Introduction to Data Lineage
When the info from the first person reaches the last person, it transforms into something altogether different. Employees are perplexed, as though they have no idea how the original data became something completely different. As an enterprise’s data assets flow via its Data Architecture, this is also the case with poor Data Lineage. Customers, regulators, and enterprises find using a company’s Big Data to be less entertaining even though it can overcome several of these challenges.
Businesses require compliant and secure data. This information must be available when and where it is required. With multiple end-users, platforms, and sources in various formats, such as video, text, images, and audio, the need for clean Big Data becomes even more complicated. When Big Data is stored remotely, it becomes less clear how the data got there in the Cloud. Understanding Data Lineage addresses these and other issues.
What is Data Lineage?
The data lineage includes:
- The origin of the data.
- Each stop along the way.
- An explanation of how and why the data has moved over time.
From source to the final destination can visually document the data lineage, be it any stops, deviations, or changes along the way. The process makes operational aspects like day-to-day use and error resolution easier to track.
It answers one crucial question, i.e. where is the data going and from where is it coming ?.
Data Lineage Example
Data lineage diagrams depict how data transforms and travels from source to destination throughout its complete data lifespan. A business lineage diagram is an interactive visualization that depicts the overall data flow from source to report without revealing all of the technical intricacies and adjustments. An information architect can use a technical data lineage diagram to see transformations, drill down into the table, column, and query-level Lineage, and traverse data pipelines.
Let’s understand with the help of an example that visualizes end-to-end data lineage at a high level with hops of the column – customer ID and how it can solve several related problems
- Table Customer A and Customer B includes the column Customer Id and Amount
- Table Accounts includes Customer Id and Amount for all customers
- Table Transactions includes Customer ID, Total amount, and Number of Transactions per customer
- Table Customer Rating includes Customer Id and Customer rating
- Rewards Table includes Customer Id, Customer Name, Customer Rating, and Reward
The Lineage, for example, starts with the Customer ID column in the Customer A and B tables and flows forward via Accounts and Transaction Table to the customer ID column in the Customer Rating table, and then to corresponding columns in the Rewards and Marketing strategy table.
The above data lineage helps us ensure that data is coming from trusted sources and all the transformations have been applied correctly. In this case, it also plays a role in the marketing team’s strategic decision-making. If data operations aren’t adequately tracked, data verification becomes nearly impossible, or at the very least, extraordinarily costly and time-consuming.
Why do we need Data Lineage?
Organizations can use Data Lineage to track errors, migrate systems, bring data discovery and metadata closer together, and make process improvements less risky.
Data accuracy is critical for strategic business choices. It becomes difficult to track and validate data processes without strong data lineage. Data lineage allows users to see the entire data path from source to destination, making it easier to spot and correct errors. Users can utilize data lineage to debug or generate lost output by replaying certain portions or inputs of data flow.
Data lineage assists you in troubleshooting and system migrations and allows you to secure data security and integrity by tracking changes, how they were made, and who made them. IT teams can use data lineage to view the end-to-end journey of data from beginning to end. It simplifies the task of IT professionals and gives business users the confidence to make informed decisions.
How we can achieve Data Lineage?
When developing a data lineage system, we must keep track of every operation that changes or processes the data. At each level of data, transformation must map data Tables, views, columns, and reports must track between databases and ETL operations.
Collect metadata from each step and put it in a metadata repository for lineage analysis to make this easier. Here’s how Lineage is accomplished at various phases of the data pipeline:
- Data Ingestion: Monitoring data flow inside data ingestion jobs and checking for errors in the mapping between source and destination systems or any mistakes in data transfer
- Data Processing: Tracking the results of specific actions conducted on the data system, for example, reading a text file, applying filters, counting values from a specified column, and publishing the results to another table. Each level of data processing is examined independently for mistakes or security/compliance issues.
- Query History: Keeping track of user queries and automatic reports generated by databases and data warehouses. Filters join, and other operations allow users to create new datasets successfully. It necessitates performing data lineage on critical queries and reports to verify the flow. Users can also use lineage data to assist them to improve their queries.
- Data lakes: Identifying security or governance vulnerabilities by tracing user access to different objects or data fields. Due to the massive amount of unstructured data, these issues might be challenging to enforce in large data lakes.
What are the advantages of Data Lineage?
From IT to business, the entire organization can benefit from data lineage. It gives an organization’s data the visibility and context it requires and allows IT to focus on strategic projects rather than manually mapping data. These advantages of data lineage enable businesses to:
Trust data and better understanding
The business user benefits from data lineage because it provides the required context for an organization’s data. The source of your data, how data sets are produced and aggregated, the quality of data sets, and any alterations along the data journey are all displayed in data lineage.
Comply with regulations
Data traceability for regulatory purposes, such as BCBS239, CCPA, and GDPR, is challenging to map. It can take a long time, and if done incorrectly, it can lead to fines and penalties. Data lineage assists the Risk Management and Data Governance teams by documenting how data moves through various systems from source to destination and allowing risk management to observe the audit trail for all data transformations.
Save time doing manual impact analysis: When making a data modification, data lineage allows IT to undertake impact analysis at a granular level (columnar, table, or business report) so that they can observe any changes to downstream systems. It eliminates approximately 98 % of the time spent by IT on manual analysis.
What are the best practices of Data Lineage?
The best practices of Data Lineage are described below:
Automate data lineage extraction
It was common practice for companies to manually document lineage. Manual tracking is no longer practicable due to production’s dynamic and fast-paced nature. To engage with the fast-paced business environment, you must automate the process.
To increase automation, best-in-class data catalogs are also advised. They use AI and machine learning to aggregate metadata from many systems to create a logical lineage flow. It also can extract metadata and draw inferences from it.
Metadata Validation
Because data is always prone to errors, it’s critical to include the owners of various processes and tools in lineage tracing. Owners are closest to and most aware of the details generated by their programmes. They can help point out defects or inaccuracies in records or procedures.
Inclusion of source metadata
Including the data generated by the many operations that process, transform, or transport the data while tracing data lineage is critical. As a result, lineage tracking should include metadata established by these operations on the data.
Modifying and Updating the data
The data owner has unique control over the data. He should keep his information in a secure location where only those with authorization rights can access it.
As a result, the owner knows who is updating, utilizing, and amending the data and who to contact if an issue arises.
Assigning a person To verify lineage
The proprietors of the tools and applications that generate metadata about your data understand how timely, accurate, and relevant the metadata is better than anyone else.
The data owner must properly transmit data handling rights to the person who will need to utilize it in the future. Data lineage assists the owner/analyst track who is actively utilizing and updating the data.
Progressive extraction and validation
To map the lineage as precisely as possible, it’s best to record metadata in the order of the data pipeline stages. This results in a well-defined timeframe and a much more legible structure for the massive metadata log. The high-level links can be validated first, making progressive validation of this data easy. The deeper complexities can be evaluated level-by-level once they’re evident. While reading or extracting data, the progressive technique maintains a logical pattern and reduces errors.
Identifying and marking critical Data
In this case, the company must identify the relevance of data, keep track of it, and even separate the critical data. Strict policies should be created for any sensitive data to maintain its secret and secure it.
End to end Lineage validation
Validate lineage in stages, beginning with high-level linkages between systems, then moving on to related datasets, data items, and transformation documentation.
Storing the data
Many people believe that if we have used the data, we should just destroy it. The organization must recognise that each and every aspect of data is critical.
If not now, you may require the information in the future. To do so, you’ll need to construct datasets that will aid in the management and tracking of any additional data that is useful in conjunction with your primary data.
Use of data within the organization
Data is now used multiple times in every business to extract information and generate reports. The reports assist the organization in gaining insight into its operations and, as a result, making decisions. Suppose there is an issue in the report. In that case, an organization can discover the cause of the error by following data lineage best practices.
What are the techniques of Data Lineage?
Techniques for Data lineage are mentioned below:
Pattern-based Lineage
Instead of dealing with the code that alters the data, it employs patterns to execute Lineage. It searches for patterns in metadata to create a lineage. The essential advantage of this technique is that unlike data lineage via parsing, pattern-based Lineage does not require knowledge of any programming language to process data. It keeps an eye on the data rather than the algorithms.
Lineage by data tagging
Data that transforms or moves is tagged by a transformation engine. The tag is then tracked to create a lineage representation from beginning to end. It only works, though, if you have a reliable transformation mechanism in place to manage all data flow.
Lineage by parsing
It reads the logic used to process data automatically. Because it monitors data as it moves, this type of data lineage makes it simple to capture changes across systems. However, it does necessitate a thorough understanding of the programming languages and tools utilized throughout the data lifecycle.
What are the best Data Lineage Tools 2022?
The best Data Lineage Tools are listed below:
- Atlan
- Alation
- OvalEdge
- Datameer
- Trifacta
- Octopai
- Collibra
- CloverDX
Future of Data Lineage
As automation becomes more common in all aspects of software engineering, it’s only natural that data engineering follows suit, from DevOps to DataOps. High-quality lineage data will enable similar use cases in data engineering. Companies already do tests for data pipelines and QA, and we see some excellent prospects for impact analysis in this area.
The data engineer is the human component of data governance in most circumstances. Data lineage technology can enable a slew of services that augment and remove elements of their process, freeing up time for value-creating activities. Data governance is becoming more difficult, but we believe that using case-driven tools powered by lineage technology will aid data engineers in meeting the challenge.
Conclusion
Tracking data lineage is a must to be an actual data intelligent company. Large firms have data dispersed around the enterprise in hundreds to thousands of systems and data sets, including on-premise, hosted, and Cloud. Furthermore, data is growing exponentially, making it even more challenging to track where data comes from and how it has changed over time.
Data lineage benefits both IT and business users by giving an end-to-end representation of where data is kept, where it came from, and where it is going. This visibility allows IT to work more efficiently and effectively and gives the company confidence in their data, allowing them to make more informed business decisions.