Data Lineage for Exasol

slavik_taubkin
Contributor

 

The roadmap of Exasol development foresees the implementation of the in-database Data Lineage feature in one of the next releases. In this thread, I would like to outline the data lineage solutions already available for Exasol, describe the methodology developed in our company, and discuss with the community how other data-driven companies approach this topic.

Data Lineage is the journey of data from the origin to the destination through all intermediate transformations. It helps to investigate data quality issues, ensures compliance with regulatory requirements, and allows impact analysis. Basically, data lineage has to provide an answer to a simple question: “How is the field X in the table Y calculated?”

To answer this question some companies engage multiple IT-specialists and business users who spend many hours manually mapping the relationships between entities. This work is however very error-prone and the resulting lineage information can become outdated very quickly.

Some other companies use metadata to automatically generate lineage diagrams. This functionality is now available in many BI and ETL tools. However, it is mostly very difficult to unite dependencies from multiple technologies into a single integrated data lineage graph.

This challenge is tackled in several enterprise-grade tools. Commercial products like Manta Lineage Platform and Collibra Data Lineage include out-of-the-box connectors to many DBMS, ETL, and reporting tools. However, even these costly solutions leave some problems unsolved:

  • Obsolete and inactive links can appear in the dependency graph just because they are present in the code/metadata, even if this part of the code is no longer executed.
  • Links might be missing because some of the SQL statements are generated only at runtime, and therefore not directly available in the metadata.
  • Complex technology-dependent deployment is required, as each BI and ETL technology has its own metadata that should be read individually.

In our projects, we use a self-designed approach that effectively addresses these issues and helps to create a reliable data lineage based on the information from the database query history. When auditing and profiling are enabled, Exasol stores information about all executed queries and referenced objects in the Statistical System Tables. We have developed pipelines that regularly retrieve the relevant data and transform it into an analyzable datamart. For the visualization of the dependency graphs, we either use our pre-built Data Lineage Analyzer or create custom dashboards in the enterprise BI tool used by the customer.

Compared to other data lineage approaches, this solution provides some important advantages:

  • Every BI and ETL tool that accesses the database is automatically included in the lineage.
  • Only valid dependencies from the queries that were actually executed on the database are considered.
  • With this solution, there are no costs for additional software.

Feel free to contact me @slavik_taubkin, if you would like to learn more about our solution Data Lineage Analyzer.

 

Which approach do you use to analyze dependencies between processes and tables?
We would be happy to discuss here the pros and cons of the various solutions!

6 REPLIES 6

Charlie
Xpert

Is there still the need for me to point out the importance of this topic @exa-Chris ? 😉

 

slavik_taubkin
Contributor

Hi @Charlie,

I saw your post about the data lineage in the Community ideas. 👍

One of our challenges is to identify, which SQL statements belong to which ELT process (step). This allows to create table-to-process mappings -- a very important information for the impact analysis. How do you deal with that in your project?

Charlie
Xpert

Hi @slavik_taubkin ,

 

ELT

Our elt jobs are mainly SQL executors written in java.

For each elt job we set the clientname and clientversion.

The clientname is set to the business transformation that is done in this step. The client version is set to the id of the job of our enterprise scheduler.

With this id we can lookup the parameters that were set for this run (like tenant id and time frame).

So clientname and clientversion can be seen as concatenated string in the exa_audit_sessions in the field client.

 

If you are using pyexasol the clientname and clientversion can be set as well.

 

BI Reports

For BI reports we use SAP BI and we don't really care about hich report triggers the sql but more which universe is used (universe = meta layer with measures and dimensions).

So we setup an dedicated connection for every universe and also use the clientname and clientversion for adding this information.

We don't need to set report name because userid is passed through to exasol (SAP BI User = Exasol DB User).

 

mwellbro
Xpert

Hi @slavik_taubkin ,

table-to-process mappings are definitly very useful - I´d say the that , since you mentioned an ELT process(step) . there had to be some kind of information exchange upon connection and passing the necessary "back reference" into the DB. This could for example in exasol be as simple as "pre-issuing" a select statement
that contains a run_id or program_identifier which could then be parsed from the exa_dba_audit_sql.

If your ELT process is "JDBC based" you might even use the additional connect parameters to put the "back reference" into the client-field of the exa_dba_audit_sessions ( kind of like when using a dbms package to put module or action into the AWR ) , but in any case I´d say the "ELT-orchestrator" needs to "bring something" to the figurative table.

slavik_taubkin
Contributor

Hey @mwellbro, nice to see you here! 🙂

For Exasol I can see the following options:

  • Incapsulate the description of the process step to a SQL comment and parse it from audit_sql. This "trick" works already for years in combination with Microstrategy (to pass the report-id and reporting username) and can be also used to pass corresponding information about the ELT process.
  • Create dummy one-row tables with the name of the process (steps) and cross join them in the queries. Then no parsing is required -- they will be automatically in the profiles.
  • Just use different database users for different processes. This strategy has a natural limitation: having one user per process step would drive admins crazy 😉

 

I absolutely agree that "pre-issuing" a special statement (like "query band" in Teradata or "query tag" in Snowflake) would be the simplest and most reliable solution.
Exa-guys, what do you think?

 

exa-Chris
Community Manager
Community Manager

Something on Data Lineage and no comment from @Charlie  after 6hrs? Are you away on Holidays?

Connecting Customers, Partners, Prospects and Exasolians is my passion. Apart from that I cycle, listen to music, and try to understand what all those technical discussions really mean...