scd type 2 in exasol

dnienaber
Padawan

Hello,

we want to create a scd type 2 history in exasol.

We used in the past MS db. We were able to use also psql to solve this. You can find an example below:

Type 2 SCD with SQL Merge - WaterOx Consulting, Inc

This works pretty fine. Now we want to use the same procedure on Exasol database. Unfortunately we are not able to use psql anymore. Therefore it dosn't work.

Did somebody managed this task already?

Thanks a lot, 

Daniel

12 REPLIES 12

mwellbro
Xpert

Hi @dnienaber ,

just off the top of my head, maybe something along the lines of:

 

create table merge_test(
my_key decimal(15,0),
val_a varchar(20),
val_b varchar(20),
eff_date   timestamp,
end_date   timestamp);


merge into merge_test a
using (
select n.my_key,n.val_a,n.val_b,case when idx=1 then x.eff_date else n.eff_date end as eff_date,n.end_date from 
(select 1 as my_key,'foo' as val_a,'real barr!' as val_b,CURRENT_TIMESTAMP as eff_date,TO_TIMESTAMP('999912312359','YYYYMMDDHH24MI') as end_date from dual) n
left join (select * from merge_test where end_date='9999-12-31 23:59:00.0') x on n.my_key=x.my_key 
-- to get around the issue that doing update and insert in a single match is not supported we´ll mirror the record 
-- and use the CASE on eff_date to differentiate new vs old row
left join (select level as idx from dual connect by level<=2) y on hash_md5(x.val_a||'|'||x.val_b)<>hash_md5(n.val_a||'|'||n.val_b) and x.my_key is not null
where hash_md5(x.val_a||'|'||x.val_b)<>hash_md5(n.val_a||'|'||n.val_b)
) b
on (a.my_key=b.my_key and a.eff_date=b.eff_date)
when not matched then INSERT VALUES(b.my_key,b.val_a,b.val_b,b.eff_date,b.end_date)
when matched then update set a.end_date=current_timestamp
;

 

mwellbro_1-1628092222110.png

 

 

Doesn´t yet take care of the eff_date/end_date-overlap but that´s easily fixed - the separator between your payload fields for change tracking might need to be adjusted, but maybe it´s enough to get you on the right track ?

Cheers,
Malte

mwellbro
Xpert

Hi @dnienaber ,

since this post turned a bit older now, was this in any way helpful ? Were you looking ( seeing the tag ) for a more script-wise approach ? Already solved all your scd-2 topics ?

Just a few comments on the idea offered:

1) As mentioned, in order to get non-overlapping ( aka. "BETWEEN-capable" ) times you´d need to adjust for example the end_date with an add_seconds(end_date,-1) or similar
2) This also works on Oracle btw if you create a wrapper function called hash_md5 ( which then uses the oracle equivalent which should be standard_hash(<your-data>,'MD5' ), I think - also Oracle wants you to cast the string used for the end_date='9999-12....' ) , other than that: rather portable for that system
3) This of course only addresses data that is "valid upon delivery" - taking care of data that includes their respective eff_date and end_date is not (yet ? 🙂  ) included
4) Same goes for data that is no longer present in the source system and should be set to end_date=now ( which on the other hand also would require your input data to either always give you the entire set or deliver some kind of marker for deleted records )

Would love to hear how you´re doing on the topic - maybe already passed scd-2 and moved on to bi-temporal data ?

 

Cheers,
Malte

exa-MathiasHo
Community Manager
Community Manager

Hi @dnienaber,

I'm guessing that Malte's suggestion helped you with your issue?! Let us know if not. 😉

dnienaber
Padawan

Hi Malte,
we run with your proposal. Basically it works, but unfortunately the speed was very week.
Our partner Sievers contact already you in order to get a clou about a possible solution.
Thank you

Daniel

mwellbro
Xpert

Hi Daniel,

sounds interesting - I´m guessing by "contact already you" you mean they´ve contacted Exasol ( since I´m just a humble customer, much like yourself 😉 ) ?

Anyway, you wouldn´t be able to share some exa_dba_audit_sql or profile data to let all of us have a look at what "speed was very week" means in terms of numbers ?

I think I remember scaling up the example to a few records more but I don´t have the results at hand - will try and see if I can come up with something on that end.

Cheers,
Malte

mwellbro
Xpert

Hi again,

so in order to get a feeling on how this approach scales I looked at the following scenario:

 

CREATE SCHEMA SOME_RETAIL;

CREATE OR REPLACE TABLE PRODUCTS(
PRODUCT_KEY DECIMAL(15,0),
PRODUCT_NAME VARCHAR(200),
PRODUCT_CATEGORY_ID CHAR(1),
PRODUCT_PRICE DECIMAL(18,2),
PRODUCT_EFF_DATE DATE,
PRODUCT_END_DATE DATE
);


INSERT INTO PRODUCTS SELECT LEVEL,'Product '||HASH_MD5(level),'T',RAND(1,99999),TO_DATE('01.01.1900','DD.MM.YYYY') AS eff_date,TO_DATE('31.12.9999','DD.MM.YYYY') AS end_date FROM dual CONNECT BY level<=1000000;


CREATE TABLE PRODUCTS_ODS AS SELECT 
PRODUCT_KEY,
PRODUCT_NAME,
PRODUCT_CATEGORY_ID ,
PRODUCT_PRICE FROM PRODUCTS
;

merge into PRODUCTS a
using (
select n.product_key,n.product_name,n.product_category_id,n.product_price,case when idx=1 then x.product_eff_date else n.eff_date end as eff_date,n.end_date from 
(select product_key,product_name,product_category_id,product_price,TO_DATE('19010101','YYYYMMDD') as eff_date,TO_TIMESTAMP('99991231','YYYYMMDD') as end_date from PRODUCTS_ODS) n
left join (select * from PRODUCTS where product_end_date=to_date('9999-12-31','YYYY-MM-DD')) x on n.product_key=x.product_key 
-- to get around the issue that doing update and insert in a single match is not supported we´ll mirror the record 
-- and use the CASE on eff_date to differentiate new vs old row
left join (select level as idx from dual connect by level<=2) y on 
hash_md5(x.product_name||'|'||x.product_category_id||'|'||x.product_price)<>hash_md5(n.product_name||'|'||n.product_category_id||'|'||n.product_price) and x.product_key is not null
where hash_md5(x.product_name||'|'||x.product_category_id||'|'||x.product_price)<>hash_md5(n.product_name||'|'||n.product_category_id||'|'||n.product_price)
) b
on (a.product_key=b.product_key and a.product_eff_date=b.eff_date)
when not matched then INSERT VALUES(b.product_key,b.product_name,b.product_category_id,b.product_price,b.eff_date,b.end_date)
when matched then update set a.product_end_date=TO_DATE('19010101','YYYYMMDD')
;

UPDATE PRODUCTS_ODS SET PRODUCT_PRICE=PRODUCT_PRICE-1;

merge into PRODUCTS a
using (
select n.product_key,n.product_name,n.product_category_id,n.product_price,case when idx=1 then x.product_eff_date else n.eff_date end as eff_date,n.end_date from 
(select product_key,product_name,product_category_id,product_price,TO_DATE('19010101','YYYYMMDD') as eff_date,TO_TIMESTAMP('99991231','YYYYMMDD') as end_date from PRODUCTS_ODS) n
left join (select * from PRODUCTS where product_end_date=to_date('9999-12-31','YYYY-MM-DD')) x on n.product_key=x.product_key 
-- to get around the issue that doing update and insert in a single match is not supported we´ll mirror the record 
-- and use the CASE on eff_date to differentiate new vs old row
left join (select level as idx from dual connect by level<=2) y on 
hash_md5(x.product_name||'|'||x.product_category_id||'|'||x.product_price)<>hash_md5(n.product_name||'|'||n.product_category_id||'|'||n.product_price) and x.product_key is not null
where hash_md5(x.product_name||'|'||x.product_category_id||'|'||x.product_price)<>hash_md5(n.product_name||'|'||n.product_category_id||'|'||n.product_price)
) b
on (a.product_key=b.product_key and a.product_eff_date=b.eff_date)
when not matched then INSERT VALUES(b.product_key,b.product_name,b.product_category_id,b.product_price,b.eff_date,b.end_date)
when matched then update set a.product_end_date=TO_DATE('19010101','YYYYMMDD')
;

UPDATE PRODUCTS_ODS SET PRODUCT_CATEGORY_ID='X' WHERE MOD(PRODUCT_KEY,10)=0;

merge into PRODUCTS a
using (
select n.product_key,n.product_name,n.product_category_id,n.product_price,case when idx=1 then x.product_eff_date else n.eff_date end as eff_date,n.end_date from 
(select product_key,product_name,product_category_id,product_price,TO_DATE('19010102','YYYYMMDD') as eff_date,TO_TIMESTAMP('99991231','YYYYMMDD') as end_date from PRODUCTS_ODS) n
left join (select * from PRODUCTS where product_end_date=to_date('9999-12-31','YYYY-MM-DD')) x on n.product_key=x.product_key 
-- to get around the issue that doing update and insert in a single match is not supported we´ll mirror the record 
-- and use the CASE on eff_date to differentiate new vs old row
left join (select level as idx from dual connect by level<=2) y on 
hash_md5(x.product_name||'|'||x.product_category_id||'|'||x.product_price)<>hash_md5(n.product_name||'|'||n.product_category_id||'|'||n.product_price) and x.product_key is not null
where hash_md5(x.product_name||'|'||x.product_category_id||'|'||x.product_price)<>hash_md5(n.product_name||'|'||n.product_category_id||'|'||n.product_price)
) b
on (a.product_key=b.product_key and a.product_eff_date=b.eff_date)
when not matched then INSERT VALUES(b.product_key,b.product_name,b.product_category_id,b.product_price,b.eff_date,b.end_date)
when matched then update set a.product_end_date=TO_DATE('19020101','YYYYMMDD')
;

SELECT row_count,a.* FROM exa_dba_audit_sql a WHERE session_id=CURRENT_SESSION AND command_name IN ('MERGE','UPDATE') AND error_code IS NULL ORDER BY stmt_id asc;

 

This throws 1 Mio. records into an (albeit not very broad) dimension table, copies it into an "_ODS" variety to serve as MERGE-Source and than does one complete price reduction on all 1 Mio productes, merges those and afterwards, just to do something else, changes 100k product categories + MERGE.

And here´s for the timings I got:

mwellbro_0-1632170451308.png

The lower results come from an VirtualBox-based EXASolo ( 1node ) on my home system, storage attached as a USB so there´s that 😉
The upper, much better, results were taken from an 4node m5.large AWS-cluster.

Of course row_counts can be deceiving since the actual data volume we are talking about is rather small:

mwellbro_1-1632170708551.png

From what I saw the operation seemed CPU bound so it should scale rather ok - of course HASH_MD5 isn´t the most CPU-efficient way to produce hashes for comparisons due to it´s more expensive cryptographic nature...might be worth a shot to change things up in that department, perhaps...

Looking forward to any further input that comes trickling down this thread 🙂

Btw. , noticed that I got NODE SYNC profile values on my EXASolo...anyone at Exa could kindly explain how that´s possible ? 😂

Cheers,
Malte

exa-Gabi
Team Exasol
Team Exasol

There should be no NODE SYNC parts in profiling on single node systems, or at least I can't think of any situation where it may happen. I did take the liberty of running the sql script above in a 7.1.1 single node instance and didn't get any NODE SYNC parts in profiling.

Therefore, if possible, I'd like to kindly ask to provide more info on NODE SYNC on single node instances, ideally a sql reproducing it together with the version it was run on.

 

Thanks in advance,

Gabi

EXA-Dude, database developer, history buff. Member of Team Exasol.

exa-Peggy
Moderator
Moderator

Matching distribution keys will also help.

exa-Kristof
Team Exasol
Team Exasol

Hi @dnienaber , I'd also be interested what "weak performance" actually means? When I was still on the customer side we used @mwellbro's approach (although UNION ALL instead of CONNECT BY for the mirroring 😉 ) on a very large scale and experienced very good performance, even with multi-million upserts.