Fastest way to make copy of huge dataset on same Exasol server

Padawan

Any other quick way to creating copy of huge dataset except 'Create or replace table' or Insert into - for keeping original dataset intact and using second one for experimenting

5 REPLIES 5

Xpert

hi again 😉
you could go for an IMPORT within the same database - I´m not sure if you get any more "mileage" compared to CREATE OR REPLACE - I´ll test that on occasion

Padawan

Thanks for the input! If I understand correctly, import could be a possibility to replicate a table faster. Can this be faster if it can be made to execute parallelly across  nodes. If yes, how to force import to run parallely.

Xpert

I´d assume that IMPORT should work in parallel automatically:

COMMAND_NAME IPROC PART_NAME OBJECT_NAME OBJECT_ROWS IN_ROWS OUT_ROWS DURATION
IMPORT 0 IMPORT T3 0 6250000 6250000 5.215547
IMPORT 1 IMPORT T3 0 6250000 6250000 5.215421
IMPORT 2 IMPORT T3 0 6250000 6250000 5.215292
IMPORT 3 IMPORT T3 0 6250000 6250000 5.215446
IMPORT 4 IMPORT T3 0 6250000 6250000 5.215514
IMPORT 5 IMPORT T3 0 6250000 6250000 5.215219
IMPORT 6 IMPORT T3 0 6250000 6250000 5.215208
IMPORT 7 IMPORT T3 0 6250000 6250000 5.215290


What surprised me was that IMPORT did perform slower than CTAS on a 50M copy...

Creating testset: 

COMMIT 1.208 0 COMMIT /* AUTO */
CREATE TABLE AS 1312.431 50000000 create table t1 as select level as my_pk,HASH_MD5(to_char(level)) as some_text,current_date + mod(level,100000) as some_date,'fix text for good measure' as some_fix_text from dual connect by level<=50000000;

fyi: not the most ideal/performant why to create the data but I wanted to use a sql that can by copy&pasted&executed
Copy with CTAS:

COMMIT 1.845 0 COMMIT /* AUTO */
CREATE TABLE AS 1.013 50000000 create table t2 as select * From t1;

Copy with IMPORT:

COMMIT 1.318 0 COMMIT /* AUTO */
IMPORT 5.256 50000000 import into t3 from exa at ... table t1;

 

Resulting data:

SIZE_GB TABLE_ROW_COUNT OBJECT_NAME OBJECT_TYPE RAW_OBJECT_SIZE MEM_OBJECT_SIZE
1.737127877771854 50000000 T1 TABLE 3400000000 1865226856
1.737139143049717 50000000 T3 TABLE 3400000000 1865238952
1.737249948084354 50000000 T2 TABLE 3400000000 1865357928


I´d assume that on a table with a bigger size and more columns IMPORT should outperform CTAS ( which btw also works as parallel as your cluster can muster i.e. #nodes) ..
Depending on what "for experimenting" means in your case you might save a bit if you can manage to stay in the same session for copy&experimenting and not commit your copy ( i.e. CTAS or IMPORT with autocommit off and all tests in the same transaction ) - of course, that will take a toll on your DBRAM and is not viable if your tests can´t be made in the same session.

Team Exasol
Team Exasol

I´d assume that IMPORT should work in parallel automatically.

If you import from EXA as data source type, yes.

> What surprised me was that IMPORT did perform slower than CTAS on a 50M copy...
Of course CTAS/INSERT also work in parallel and can do without any driver data conversion overhead...

 

The only situation where IMPORT used to be (?) faster than INSERT was when the target table had a distribution key differing from the source table; with IMPORT the distribution was done in-stream, while the INSERT performed a two-phase approach.

 

All in all, CTAS should be the fastest possible way to copy a data set. If the data set does not fully fit into memory, maybe a loop of smaller INSERT statements is preferable.

 

Padawan

Thanks for the details!

Tried import but execution duration are almost same compared to current implementation. Seems CTAS/Insert into/Import have similar performance to replicate table on same server.