Export to multiple files, controle the number of records per file

Gallus
SQL-Fighter

Hello everybody

I am using an export statement to store the records of a table into multiple files. I noticed that the size of the files and also the number of records per file is not equal and varias quit a bit. In the EXASOL documentation I found the following statement

Gallus_0-1625831281402.png

Is there any documentation on the "several factors" or what could I do to make sure that the number of records is more or less evenly distributed.

My query looks like the following (This is generated code, out of Wherescape)

EXPORT SELECT  grl_mrt_mzlg_m_syr_update_file_export.fileline AS fileline
  FROM edd_dwh_grl.grl_mrt_mzlg_m_syr_update_file_export grl_mrt_mzlg_m_syr_update_file_export
   INTO CSV AT CON_IG_FTP_MZLG_SYR_SND_DEV FILE 'NC104_PartnerFlagging_DWH_update.txt_1_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_2_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_3_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_4_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_5_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_6_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_7_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_8_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_9_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_10_202110709'
   COLUMN SEPARATOR=';'
COLUMN DELIMITER=''
ENCODING='UTF-8'
REPLACE

This results in the following

Gallus_1-1625831448994.png

 

Kind Regards

Gallus

 

1 ACCEPTED SOLUTION

exa-Kristof
Team Exasol
Team Exasol

Hi @Gallus and @mwellbro,

I reached out to RnD and this is what I learned:


Basically, each file is written with the data which is on the corresponding node(s). If the number of files is equal to the number of nodes, then each file will receive the data on that node. If the number of files are less/greater than the number of nodes, then the data on some nodes will be combined/split according to the number of files.
There is no guarantee that the data will be equally distributed.

So Malte, your assumption about the 1:1 node/file ratio is correct. However, if this is not given, the distribution across the files becomes a quite complicated process. For now, there's no real way to control file sizes or data distribution across those files (besides the 1:1 node/file ratio).

I hope this sheds a bit more light on this topic.

Best regards,

Kristof

View solution in original post

13 REPLIES 13

exa-Kristof
Team Exasol
Team Exasol

Hi @Gallus and @mwellbro,

I reached out to RnD and this is what I learned:


Basically, each file is written with the data which is on the corresponding node(s). If the number of files is equal to the number of nodes, then each file will receive the data on that node. If the number of files are less/greater than the number of nodes, then the data on some nodes will be combined/split according to the number of files.
There is no guarantee that the data will be equally distributed.

So Malte, your assumption about the 1:1 node/file ratio is correct. However, if this is not given, the distribution across the files becomes a quite complicated process. For now, there's no real way to control file sizes or data distribution across those files (besides the 1:1 node/file ratio).

I hope this sheds a bit more light on this topic.

Best regards,

Kristof

View solution in original post

mwellbro
Xpert

Hi @exa-Kristof ,

thanks for digging this up - still kind of leaves me wondering what "quite complicated process" entails exactly but I guess this is either venturing into trade secret territory or a flow chart that would make our eyes bleed ? 😉

@Gallus : do you think this helps with your original topic ? Was trying to get a bit deeper into this by setting up an FTP server next to my AWS exa cluster ( which would have allowed for better tracing than my original S3-based way ) but I must have taken a wrong turn somewhere around my own security group architecture 🤔
If I get around to it I´ll post any additional findings here.

Cheers,
Malte

mwellbro
Xpert

Hi all,

so in the end I did manage to get my AWS systems to talk ( "a win-ftp-server, an exa-4-node-cluster and a bunch of data walk into a VPC...." ), so here´s how the data flow spread accross connections & ports, for anyone interested:

mwellbro_0-1627057450709.png

The resulting skew as previously shown:

mwellbro_1-1627057523989.png

Won´t bring us any closer to solving the mysteries of how the data gets chosen for each stream but at 
least it offered me some closure 😉

Cheers,
Malte

Gallus
SQL-Fighter

Good Morning everybody

Many thanks for all your help. This helps definitely. As we have 17 nodes I am going to check with the recipient whether it would also be possible to deliver 17 files or whether they can deal with different file sizes. I assume they can deal with it, the reason to splizt the data into multiple files is that they can better spread the load over a longer period of time.

Kind Regars

Gallus

exa-Kristof
Team Exasol
Team Exasol

Hi @Gallus , of course this a rather technical way to influence things (and admittedly not easy to explain to a department).

But feel free to add an idea if you think that it would help you.

And thank you Malte for your tireless efforts to solve riddles - I guess it's a mixture of your two presumptions why there are not more details on this topic. 😉

exa-Chris
Community Manager
Community Manager

Thanks Malte for watching an helping so much. I will get more pressing internally now 🙂 Happy dry (and not burning) weekend to all

Connecting Customers, Partners, Prospects and Exasolians is my passion. Apart from that I cycle, listen to music, and try to understand what all those technical discussions really mean...

mwellbro
Xpert

Hi @Gallus ,

I think one of the deciding factors in this case is the file-to-nodes ratio - taking up @littlekoi ´s input regarding volume I spun up a 4 node cluster with the following setup:

 

 

CREATE SCHEMA TEMP;

OPEN SCHEMA TEMP;


CREATE TABLE MULTIPLIER( idx DECIMAL(15,0));

INSERT INTO MULTIPLIER SELECT LEVEL FROM dual CONNECT BY LEVEL<=100;
COMMIT;


CREATE TABLE EXP_SRC(
my_pk decimal(15,0),
unique_text CHAR(32)
);

INSERT INTO EXP_SRC
WITH base(rid) AS (SELECT ROW_NUMBER() OVER() AS rid FROM MULTIPLIER a CROSS JOIN MULTIPLIER CROSS JOIN MULTIPLIER CROSS JOIN MULTIPLIER )
 SELECT rid, HASH_MD5(rid) FROM base ;



SELECT raw_object_size/1024/1024 AS r_size_mb,mem_object_size/1024/1024 AS m_size_mb 
FROM exa_dba_object_sizes WHERE object_name='EXP_SRC';

CREATE CONNECTION CON_IG_FTP_MZLG_SYR_SND_DEV TO 'http://...s3.eu-central-1.amazonaws.com' USER '' identified BY '';

 

 

For the actual exports I pretty much stayed with your syntax:

 

 

EXPORT SELECT * FROM EXP_SRC 
INTO CSV AT CON_IG_FTP_MZLG_SYR_SND_DEV 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_1_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_2_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_3_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_4_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_5_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_6_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_7_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_8_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_9_202110709' 
   FILE 'NC104_PartnerFlagging_DWH_update.txt_10_202110709'
   COLUMN SEPARATOR=';'
COLUMN DELIMITER=''
ENCODING='UTF-8'
REPLACE;

 

 

 

Using this a came up with the following export-sizes:

mwellbro_0-1626025398773.png

Still a bit skewed, so I tried to use 4 files ( aka. node-to-file ration 1:1 ) :

mwellbro_1-1626025480403.png

Does seem much more equally distributed ( fyi: a group by iproc() before was used to check that the rows to node distribution is even ).

Interestingly enough, once I used an ORDER BY ROWNUM or ORDER BY MOD(ROWNUM,10) it still got me 4 files..will check that back, might be an execution error on my part...

As a final test I added some files to get to a multiple of my nodes, so here´s the result for 12 files with 4 nodes:

mwellbro_2-1626025667675.png

So a bit more even-ish - will be interesting to here if someone from Exasol will elaborate on your initial question regarding the "several factors" 🙂

Cheers,
Malte

mwellbro
Xpert

Hi @exa-Chris , think you could tap someone regarding Gallus´ question for a more comprehensive list of the "several factors" involved here ?

Cheers & a nice weekend to all,
Malte

littlekoi
Xpert

The data set in this example is too small (~16 Mb per file).

It's getting more balanced for data sets of larger size (~250Mb+), when it actually matters.

mwellbro
Xpert

Hi @Gallus ,

just off the top of my head: since you are already using a STATEMENT and not TABLE clause, maybe throw in an ORDER BY...just a hunch...you´ll pay with extra CPU, MEM and processing time but perhaps it might influence the EXPORT-distribution ?

Cheers,
Malte

Gallus
SQL-Fighter

Hello @mwellbro 

Many thanks for your answer. I have just tried it with an order by statement, but that did not change anything.

Kind Regards

Gallus

mwellbro
Xpert

very brutish way of going about it, but you could use multiple EXPORT statements and WHERE the data accordingly => should give you quite exact control over the distribution per file at the expanse of "ease of execution" ( and probably higher resource pressure since you´d use multiple sessions...just spewing out ideas 😉  )

mwellbro
Xpert

good to know - let´s see....will try something and get back to you