Export to multiple files, controle the number of records per file

Gallus
SQL-Fighter

Hello everybody

I am using an export statement to store the records of a table into multiple files. I noticed that the size of the files and also the number of records per file is not equal and varias quit a bit. In the EXASOL documentation I found the following statement

Gallus_0-1625831281402.png

Is there any documentation on the "several factors" or what could I do to make sure that the number of records is more or less evenly distributed.

My query looks like the following (This is generated code, out of Wherescape)

 

EXPORT SELECT  grl_mrt_mzlg_m_syr_update_file_export.fileline AS fileline
  FROM edd_dwh_grl.grl_mrt_mzlg_m_syr_update_file_export grl_mrt_mzlg_m_syr_update_file_export
   INTO CSV AT CON_IG_FTP_MZLG_SYR_SND_DEV FILE 'NC104_PartnerFlagging_DWH_update.txt_1_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_2_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_3_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_4_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_5_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_6_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_7_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_8_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_9_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_10_202110709'
   COLUMN SEPARATOR=';'
COLUMN DELIMITER=''
ENCODING='UTF-8'
REPLACE

 

This results in the following

Gallus_1-1625831448994.png

 

Kind Regards

Gallus

 

1 ACCEPTED SOLUTION

Accepted Solutions

exa-Kristof
Team Exasol
Team Exasol

Hi @Gallus and @mwellbro,

I reached out to RnD and this is what I learned:


Basically, each file is written with the data which is on the corresponding node(s). If the number of files is equal to the number of nodes, then each file will receive the data on that node. If the number of files are less/greater than the number of nodes, then the data on some nodes will be combined/split according to the number of files.
There is no guarantee that the data will be equally distributed.

So Malte, your assumption about the 1:1 node/file ratio is correct. However, if this is not given, the distribution across the files becomes a quite complicated process. For now, there's no real way to control file sizes or data distribution across those files (besides the 1:1 node/file ratio).

I hope this sheds a bit more light on this topic.

Best regards,

Kristof

View solution in original post

13 REPLIES 13

mwellbro
Xpert

Hi @Gallus ,

just off the top of my head: since you are already using a STATEMENT and not TABLE clause, maybe throw in an ORDER BY...just a hunch...you´ll pay with extra CPU, MEM and processing time but perhaps it might influence the EXPORT-distribution ?

Cheers,
Malte

Gallus
SQL-Fighter

Hello @mwellbro 

Many thanks for your answer. I have just tried it with an order by statement, but that did not change anything.

Kind Regards

Gallus

mwellbro
Xpert

good to know - let´s see....will try something and get back to you

mwellbro
Xpert

very brutish way of going about it, but you could use multiple EXPORT statements and WHERE the data accordingly => should give you quite exact control over the distribution per file at the expanse of "ease of execution" ( and probably higher resource pressure since you´d use multiple sessions...just spewing out ideas 😉  )

littlekoi
Xpert

The data set in this example is too small (~16 Mb per file).

It's getting more balanced for data sets of larger size (~250Mb+), when it actually matters.

mwellbro
Xpert

Hi @Gallus ,

I think one of the deciding factors in this case is the file-to-nodes ratio - taking up @littlekoi ´s input regarding volume I spun up a 4 node cluster with the following setup:

 

 

CREATE SCHEMA TEMP;

OPEN SCHEMA TEMP;


CREATE TABLE MULTIPLIER( idx DECIMAL(15,0));

INSERT INTO MULTIPLIER SELECT LEVEL FROM dual CONNECT BY LEVEL<=100;
COMMIT;


CREATE TABLE EXP_SRC(
my_pk decimal(15,0),
unique_text CHAR(32)
);

INSERT INTO EXP_SRC
WITH base(rid) AS (SELECT ROW_NUMBER() OVER() AS rid FROM MULTIPLIER a CROSS JOIN MULTIPLIER CROSS JOIN MULTIPLIER CROSS JOIN MULTIPLIER )
 SELECT rid, HASH_MD5(rid) FROM base ;



SELECT raw_object_size/1024/1024 AS r_size_mb,mem_object_size/1024/1024 AS m_size_mb 
FROM exa_dba_object_sizes WHERE object_name='EXP_SRC';

CREATE CONNECTION CON_IG_FTP_MZLG_SYR_SND_DEV TO 'http://...s3.eu-central-1.amazonaws.com' USER '' identified BY '';

 

 

For the actual exports I pretty much stayed with your syntax:

 

 

EXPORT SELECT * FROM EXP_SRC 
INTO CSV AT CON_IG_FTP_MZLG_SYR_SND_DEV 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_1_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_2_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_3_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_4_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_5_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_6_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_7_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_8_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_9_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_10_202110709'
   COLUMN SEPARATOR=';'
COLUMN DELIMITER=''
ENCODING='UTF-8'
REPLACE;

 

 

 

Using this a came up with the following export-sizes:

mwellbro_0-1626025398773.png

Still a bit skewed, so I tried to use 4 files ( aka. node-to-file ration 1:1 ) :

mwellbro_1-1626025480403.png

Does seem much more equally distributed ( fyi: a group by iproc() before was used to check that the rows to node distribution is even ).

Interestingly enough, once I used an ORDER BY ROWNUM or ORDER BY MOD(ROWNUM,10) it still got me 4 files..will check that back, might be an execution error on my part...

As a final test I added some files to get to a multiple of my nodes, so here´s the result for 12 files with 4 nodes:

mwellbro_2-1626025667675.png

So a bit more even-ish - will be interesting to here if someone from Exasol will elaborate on your initial question regarding the "several factors" 🙂

Cheers,
Malte

mwellbro
Xpert

Hi @exa-Chris , think you could tap someone regarding Gallus´ question for a more comprehensive list of the "several factors" involved here ?

Cheers & a nice weekend to all,
Malte

exa-Chris
Community Manager
Community Manager

Thanks Malte for watching an helping so much. I will get more pressing internally now 🙂 Happy dry (and not burning) weekend to all

Connecting Customers, Partners, Prospects and Exasolians is my passion. Apart from that I cycle, listen to music, and try to understand what all those technical discussions really mean...

exa-Kristof
Team Exasol
Team Exasol

Hi @Gallus and @mwellbro,

I reached out to RnD and this is what I learned:


Basically, each file is written with the data which is on the corresponding node(s). If the number of files is equal to the number of nodes, then each file will receive the data on that node. If the number of files are less/greater than the number of nodes, then the data on some nodes will be combined/split according to the number of files.
There is no guarantee that the data will be equally distributed.

So Malte, your assumption about the 1:1 node/file ratio is correct. However, if this is not given, the distribution across the files becomes a quite complicated process. For now, there's no real way to control file sizes or data distribution across those files (besides the 1:1 node/file ratio).

I hope this sheds a bit more light on this topic.

Best regards,

Kristof

View solution in original post