The roadmap of Exasol development foresees the implementation of the in-database Data Lineage feature in one of the next releases. In this thread, I would like to outline the data lineage solutions already available for Exasol, describe the methodology developed in our company, and discuss with the community how other data-driven companies approach this topic.
Data Lineage is the journey of data from the origin to the destination through all intermediate transformations. It helps to investigate data quality issues, ensures compliance with regulatory requirements, and allows impact analysis. Basically, data lineage has to provide an answer to a simple question: “How is the field X in the table Y calculated?”
To answer this question some companies engage multiple IT-specialists and business users who spend many hours manually mapping the relationships between entities. This work is however very error-prone and the resulting lineage information can become outdated very quickly.
Some other companies use metadata to automatically generate lineage diagrams. This functionality is now available in many BI and ETL tools. However, it is mostly very difficult to unite dependencies from multiple technologies into a single integrated data lineage graph.
This challenge is tackled in several enterprise-grade tools. Commercial products like Manta Lineage Platform and Collibra Data Lineage include out-of-the-box connectors to many DBMS, ETL, and reporting tools. However, even these costly solutions leave some problems unsolved:
In our projects, we use a self-designed approach that effectively addresses these issues and helps to create a reliable data lineage based on the information from the database query history. When auditing and profiling are enabled, Exasol stores information about all executed queries and referenced objects in the Statistical System Tables. We have developed pipelines that regularly retrieve the relevant data and transform it into an analyzable datamart. For the visualization of the dependency graphs, we either use our pre-built Data Lineage Analyzer or create custom dashboards in the enterprise BI tool used by the customer.
Compared to other data lineage approaches, this solution provides some important advantages:
Feel free to contact me @slavik_taubkin, if you would like to learn more about our solution Data Lineage Analyzer.
Which approach do you use to analyze dependencies between processes and tables?
We would be happy to discuss here the pros and cons of the various solutions!
I saw your post about the data lineage in the Community ideas. 👍
One of our challenges is to identify, which SQL statements belong to which ELT process (step). This allows to create table-to-process mappings -- a very important information for the impact analysis. How do you deal with that in your project?
Hi @slavik_taubkin ,
Our elt jobs are mainly SQL executors written in java.
For each elt job we set the clientname and clientversion.
The clientname is set to the business transformation that is done in this step. The client version is set to the id of the job of our enterprise scheduler.
With this id we can lookup the parameters that were set for this run (like tenant id and time frame).
So clientname and clientversion can be seen as concatenated string in the exa_audit_sessions in the field client.
If you are using pyexasol the clientname and clientversion can be set as well.
For BI reports we use SAP BI and we don't really care about hich report triggers the sql but more which universe is used (universe = meta layer with measures and dimensions).
So we setup an dedicated connection for every universe and also use the clientname and clientversion for adding this information.
We don't need to set report name because userid is passed through to exasol (SAP BI User = Exasol DB User).
Hi @slavik_taubkin ,
table-to-process mappings are definitly very useful - I´d say the that , since you mentioned an ELT process(step) . there had to be some kind of information exchange upon connection and passing the necessary "back reference" into the DB. This could for example in exasol be as simple as "pre-issuing" a select statement
that contains a run_id or program_identifier which could then be parsed from the exa_dba_audit_sql.
If your ELT process is "JDBC based" you might even use the additional connect parameters to put the "back reference" into the client-field of the exa_dba_audit_sessions ( kind of like when using a dbms package to put module or action into the AWR ) , but in any case I´d say the "ELT-orchestrator" needs to "bring something" to the figurative table.
Hey @mwellbro, nice to see you here! 🙂
For Exasol I can see the following options:
I absolutely agree that "pre-issuing" a special statement (like "query band" in Teradata or "query tag" in Snowflake) would be the simplest and most reliable solution.
Exa-guys, what do you think?