IMPORT FROM (large) CSV FILE

mgasperi_work
SQL-Fighter

Hello!

I'm facing issues during import of several large CSV files > 400MB each. This is what I've noticed, also with smaller files:

1. Sometimes I'm receiving errors like, the line number is changing every time I'm executing the import

[Code: 0, SQL State: 42636]  java.sql.SQLException: ETL-2101: Error while parsing row=1319 (starting from 0) [CSV Parser found incorrect number of columns. Expected: [22], found [21] columns in file

2. When you retry the same file it just executes without trouble. 

3. There seems to be kind of size limit, some files have 2.4 million rows work fine, the next file which has 2.41 million rows does not work.

Is that something I should not be doing with standard IMPORT command? Any hint where I shall be looking?

Here is the command which I'm trying to execute, I'm separating each executing by ";" and I have around 10 files with around 400MB of size.

IMPORT INTO CSV_TABLE FROM LOCAL CSV FILE 'C:\Users\xxxxx\Downloads\file1.csv' ENCODING = 'UTF-8'
COLUMN SEPARATOR = '|' 
ROW SEPARATOR = 'LF' 
TRIM
NULL = 'na'
SKIP = 1;

KR

Michael

1 ACCEPTED SOLUTION

mgasperi_work
SQL-Fighter

I managed to import the files by:

1. For small files and files with less than 2 million rows: Brute-force re-running the failed imports until Exasol swallowed the data

2. For larger files - aka > 2 millions rows: I've split the files and data was imported.

3. Meanwhile I've seen Java timeouts - anyway for me Java is a nightmare. 

I've never seen this behavior on other DBMS, also it was not finally clear if this has something to do with one cluster node which had troubles until yesterday. In my role I just have limited access to Exasol and I couldn't find out what the problem caused, but I needed a quick solution.

 

View solution in original post

6 REPLIES 6

mgasperi_work
SQL-Fighter

I managed to import the files by:

1. For small files and files with less than 2 million rows: Brute-force re-running the failed imports until Exasol swallowed the data

2. For larger files - aka > 2 millions rows: I've split the files and data was imported.

3. Meanwhile I've seen Java timeouts - anyway for me Java is a nightmare. 

I've never seen this behavior on other DBMS, also it was not finally clear if this has something to do with one cluster node which had troubles until yesterday. In my role I just have limited access to Exasol and I couldn't find out what the problem caused, but I needed a quick solution.

 

View solution in original post

mwellbro
Xpert

Hi @mgasperi_work ,

what´s a Java timeout ? If you could share the exact error message, I´m curious on this one 😄

Just to put a bit of testing into play here:

 

-- let´s get us some target table
CREATE OR REPLACE TABLE IMPORT_TARGET (
    C1 DECIMAL(10,0),
    C2 DECIMAL(10,0),
    C3 DECIMAL(10,0),
    CX CHAR(20) UTF8,
    C4 VARCHAR(256) UTF8,
    C5 TIMESTAMP,
    C6 DATE,
    C7 DATE,
    C8 VARCHAR(256) UTF8
);

-- create 3 million rows to reach Michaels file sizes
export (
with base(c1,c2,c3) as (
select * from 
(select level from dual connect by level<=1000)
cross join 
(select level from dual connect by level<=1000)
cross join 
(select level from dual connect by level<=3)
)
select rownum,rownum,rownum
,'some constant string'
,'some variable '||rownum||' string !'
,CURRENT_TIMESTAMP
,CURRENT_DATE
,TO_DATE('00010101','YYYYMMDD') + rownum
,'even more variable string: ' || rownum
, null
 from base
)
into local csv file 'C:\temp\3m_file.gz'
-- into local csv file 'C:\temp\3m_file.csv'
ENCODING = 'UTF-8'
COLUMN SEPARATOR = '|' 
ROW SEPARATOR = 'LF' ;

-- repeat the following for 10 to 20 times

truncate table import_target;

import into import_target from local csv
file 'C:\temp\3m_file.gz'
--file 'C:\temp\3m_file.csv'
ENCODING = 'UTF-8'
COLUMN SEPARATOR = '|' 
ROW SEPARATOR = 'LF' 
TRIM
NULL = 'na'
SKIP = 1
;

 

I ran this against a V.7.0.9 ExaOne with an Exaplus 6.0.14 ( don´t even ask 😀 ) and a DBeaver 21.0.2 , around 10 to 20 times in each client and then ( because .gz might differ from "native" csv ) again with .csv .... no errors to be found...also killed the import a few times to check
if I could provoke an "errenous error message" due to a session kill, but no dice....

T´is strange....But this Java business you mentioned seems to indicate that this might have been a client side issue ?

Can you ( perhaps on a less time critical issue ) reproduce this problem ?

Anyways, I´m glad you found a way around your predicament !

Cheers,
Malte

mgasperi_work
SQL-Fighter

Hello @jens_areto!
Thanks, but that would not help me much, I have > 1700 files to import. 

Here another example of a very small file (248 k and 1041 lines), which is demonstrating what I'm facing permanently:

10:54:16	INFO					Physical database connection acquired for: Exasol (PROD)	
10:54:16	SUCCESS	ALTER	0.031		0	OK. No rows were affected	ALTER SESSION SET NLS_NUMERIC_CHARACTERS=',.'
10:54:16	FAILED	IMPORT	5.237		0	[Code: 0, SQL State: 42636]  ETL-2101: Error while parsing row=799 (starting from 0) [CSV Parser found incorrect number of columns. Expected: [38], found [5] columns in file '28dce917-3e6c-4a5e-9e76-58422863f389'] (Session: 1697458375736779682) while executing 'IMPORT INTO CSV_TABLE FROM CSV AT 'https://10.101.80.76:43196' FILE '28dce917-3e6c-4a5e-9e76-58422863f389' ENCODING = 'UTF-8' COLUMN SEPARATOR = '|' ROW SEPARATOR = 'LF' TRIM NULL = 'na' SKIP = 1;'	IMPORT INTO CSV_TABLE FROM LOCAL CSV FILE 'C:\Users\xxxxxxx\Downloads\Import\file1.csv' ENCODING = 'UTF-8'
COLUMN SEPARATOR = '|' 
ROW SEPARATOR = 'LF' 
TRIM
NULL = 'na'
SKIP = 1
10:54:21	FINISHED		5.268	0.0	0	Success: 1 Failed: 1	
10:56:06	STARTED					Executing for: 'Exasol (PROD)' [Exasol], Schema: SBX_SECPOC	
10:56:06	SUCCESS	ALTER	0.031		0	OK. No rows were affected	ALTER SESSION SET NLS_NUMERIC_CHARACTERS=',.'
10:56:06	SUCCESS	IMPORT	1.596		1040	OK	IMPORT INTO CSV_TABLE FROM LOCAL CSV FILE 'C:\Users\xxxxxxx\Downloads\Import\file1.csv' ENCODING = 'UTF-8'
COLUMN SEPARATOR = '|' 
ROW SEPARATOR = 'LF' 
TRIM
NULL = 'na'
SKIP = 1
10:56:08	FINISHED		1.627	0.0	1040	Success: 2 Failed: 0	

mwellbro
Xpert

Hi @mgasperi_work ,

never saw this kind of Problem before... are you sure about the "LF" for ROW SEPARATOR ? Just thought that the import directory looks more like Windows and I´d expect CRLF there, but that´s just an observation ( might be downloaded stuff so you´ll know best what your row seperator should be ).

Does the "Success: 2" indicate 2 rows ? Seems small... also: what is the actual column count in your source files ?

Cheers,
Malte

mgasperi_work
SQL-Fighter

Hello @mwellbro!

Thanks for your feedback.

1. Separator is LF - it's a Unix export and the encoding is 'UTF-8' no BOM or anything else, can't show more because it's highly sensitive data.

mgasperi_work_0-1618897415043.png

2. The success 2 is referring to the number of statements, I've executed the SQL in DbVisualizer, I'll test with command line, but I'm not expecting different results.

Anyway thanks for your help. I've heard from our IT-department that one cluster node has hardware issues, maybe that caused troubles. I'll check with them.

Michael

jens_areto
Contributor

Hey Michael,

I have no direct solution for your problem. I never had those problems during file import but maybe you can add a reject limit and an error clause to your import statement so that the import will not fail but you can see the rows in detail in the error table which lead to an error. Maybe they can give another hint if it´s something inside the data or a faulty functionality with those huge amouts of data. Then you can list the error message and the row numbers of the failed rows inside the error table.

Best regards

Jens