Introduction
In modern business intelligence (BI) projects, tracing the path of data from its source to its destination poses significant challenges. As projects become more intricate, spanning multiple sources, artifacts, and dependencies, unraveling complexities like “What happens if I alter this data?” or “Why isn’t this report updated?” demands deep expertise or extensive investigation.
One of the most popular BI tools is PowerBI.
Its major building blocks are: dashboards, reports, semantic models (datasets) and dataflows and they are all organized into workspaces. Datasets and dataflows can connect to external data sources or to datasets in other workspaces, complicating the task of tracking data origins.
A workspace acts as a container for various PowerBI elements. Workspaces can be of two types:
- Personal workspaces named “My workspaces”, which can be accessed only by their owner.
- Public workspaces named “Workspaces”, which are used to collaborate and share content with colleagues.
In PowerBI a data source refers to the origin of data utilized in reports and visualizations. This source could be any location or service where data resides, like databases, files, online service, other applications and so on.
Additionally, a dataset can be a tabular model on Azure Analysis Services. AAS allows for the creation of sophisticated data models that can be leveraged in Power BI to provide fast, interactive analysis.
A dataset contains ready to use data for reporting and visualization. The data is in tabular model form. This model consists of tables, which have columns and measures, where a measure is a calculation defined in a DAX expression aggregating data from one or more columns.
The dataset’s data can come from other datasets, dataflows or data sources. A dataset can be:
- hosted in PowerBI.
- externally hosted by Azure Analysis Service and by SQL Server Analysis Services.
A dataset can be developed in two ways:
- import mode, where data is copied from the data source. This mode delivers fast performance but requires more memory and it does not use real-time up-to-date data.
- directQuery mode, where only metadata defines the dataset structure and when it is queried, native queries are used to retrieve data from the underlying data source. This mode uses real-time data but it is slower than the import mode.
PowerBI
data lineage view
Currently, PowerBI offers a tool named Data Lineage View, which shows connections between all the artifacts in a workspace and all its external dependencies.
The Limits
Data Lineage View shows data lineage with a low level of granularity, which means that it does not display interconnections between artifacts like datasets tables, reports metrics and data sources database tables ( and columns ). This can impede a comprehensive understanding of how data moves through the system, making it impossible to find issues or assess the full impact of changes.
Blindata
data lineage
To address these limitations, Blindata offers a custom connector for PowerBI that enables a full data lineage view.
Tracing data lineage from Data Sources to Datasets
Tracing data lineage from data sources to datasets starts by first importing data source metadata through various means like:
- Built-in connectors, which automatically extract metadata from various data sources such as databases, data warehouses, and cloud storage.
- Spreadsheets, ingesting metadata through Excel or CSV files.
- Custom solutions using Blindata API.
Then the Blindata’s PowerBI connector gathers datasets metadata using PowerBI metadata scanning APIs. For each dataset’s table:
- It traces the origin of each column in the data source.
- It tracks down every source used from the measures, thanks to a custom-built parser designed for the PowerQuery M language, which analyzes the measure’s expression.
Tracing data lineage from Datasets to Reports and Dashboards
Using PowerBI metadata scanning APIs, the Blindata’s PowerBI connector retrieves all the reports that are present in each workspace and which datasets they use.
The connector offers an additional feature that allows analyzing each report’s .pbix file: this is first automatically downloaded using PowerBI Rest APIs, then it is parsed, and all useful metadata is extracted, like information about which tables, measures and columns are used inside a report. And it goes even further by retrieving where the data is displayed in the layout of the report (E.g. in which sections of the report a measure is displayed).
Conclusion
Understanding data lineage within PowerBI is crucial for maintaining data integrity, ensuring report accuracy, and facilitating informed decision-making. While PowerBI’s native data lineage view provides some insights, its limitations in granularity restrict a comprehensive understanding of complex data flows.
Blindata’s custom connector for PowerBI presents a valuable solution by amalgamating metadata from various data source systems, PowerBI metadata APIs, and .pbix files. This integration bridges the gaps left by PowerBI’s native functionality, allowing for a comprehensive data lineage view from source systems to PowerBI datasets and reports, achieving the highest level of granularity at the ‘column level.’