Text file import - Is it possible to ignore rows at the end of the text file?

steve-remington
Padawan

I am trying to import the following pipe (|) separated variable file into an Exasol table:

http://www.nasdaqtrader.com/dynamic/SymDir/nasdaqtraded.txt

I used the following SQL to create the destination table and import the contents of the file into the table.

create table nasdaq_traded (
        nasdaq_traded char(1)
       ,symbol varchar(10)
       ,security_name varchar(256)
       ,listing_exchange char(1)
       ,market_category char(1)
       ,etf char(1)
       ,round_lot_size int
       ,test_issue char(1)
       ,financial_status char(1)
       ,cqs_symbol varchar(10)
       ,nasdaq_symbol varchar(10)
       ,next_shares char(1)
);

import into stocks_data.nasdaq_traded
        from csv at 'http://www.nasdaqtrader.com/dynamic/SymDir/' file 'nasdaqtraded.txt' (1..12)
        row separator = 'CRLF'
        column separator = '|'
        skip = 1
        reject limit 0;

When I ran the above SQL the following error is returned

[Code: 0, SQL State: 42636]  ETL-2101: Error while parsing row=9437 (starting from 0) [CSV Parser found incorrect number of columns. Expected: [12], found [6] columns in file 'nasdaqtraded.txt'] (Session: 1685652005867552768)

The reason for the error is that the last line of the file is a footer row that contains the file creation time. See the sample of the file showing the top 4 and bottom four rows of the file below:

Nasdaq Traded|Symbol|Security Name|Listing Exchange|Market Category|ETF|Round Lot Size|Test Issue|Financial Status|CQS Symbol|NASDAQ Symbol|NextShares
Y|A|Agilent Technologies, Inc. Common Stock|N| |N|100|N||A|A|N
Y|AA|Alcoa Corporation Common Stock |N| |N|100|N||AA|AA|N
Y|AAA|Listed Funds Trust AAF First Priority CLO Bond ETF|P| |Y|100|N||AAA|AAA|N
...
Y|ZYME|Zymeworks Inc. Common Shares|N| |N|100|N||ZYME|ZYME|N
Y|ZYNE|Zynerba Pharmaceuticals, Inc. - Common Stock|Q|G|N|100|N|N||ZYNE|N
Y|ZYXI|Zynex, Inc. - Common Stock|Q|S|N|100|N|N||ZYXI|N
File Creation Time: 1209202018:03|||||

I know the "skip" operator can be used to ignore a specified number of rows at the top of the file. Is there an equivalent way to ignore rows at the end of a file? 

1 ACCEPTED SOLUTION

Accepted Solutions

exa-Uwe
Moderator
Moderator

Hi Steve,

this is actually a cool use case, thank you for sharing! 😎

This is one way to make it work:

import into nasdaq_traded
from csv at 'http://www.nasdaqtrader.com/dynamic/SymDir/' file 'nasdaqtraded.txt' (1..12)
row separator = 'CRLF'
column separator = '|'
skip = 1
errors into error_table (current_timestamp)
reject limit unlimited;

error_table is auto-generated by the IMPORT command and contains all rows that cannot be loaded. The last row in your case.


Best regards
Uwe

 

View solution in original post

3 REPLIES 3

exa-Uwe
Moderator
Moderator

Hi Steve,

this is actually a cool use case, thank you for sharing! 😎

This is one way to make it work:

import into nasdaq_traded
from csv at 'http://www.nasdaqtrader.com/dynamic/SymDir/' file 'nasdaqtraded.txt' (1..12)
row separator = 'CRLF'
column separator = '|'
skip = 1
errors into error_table (current_timestamp)
reject limit unlimited;

error_table is auto-generated by the IMPORT command and contains all rows that cannot be loaded. The last row in your case.


Best regards
Uwe

 

View solution in original post

mwellbro
Xpert

Hi @steve-remington  ,

a bit more "clunky" than what Uwe suggested but here´s a way if you don´t want or can´t use the error_table:

select 
 replace(regexp_substr(c1,'([\w\W].*?)\|',1,1),'|','')
,replace(regexp_substr(c1,'([\w\W].*?)\|',1,2),'|','')
,replace(regexp_substr(c1,'([\w\W].*?)\|',1,3),'|','')
,replace(regexp_substr(c1,'([\w\W].*?)\|',1,4),'|','')
,replace(regexp_substr(c1,'([\w\W].*?)\|',1,5),'|','')
,replace(regexp_substr(c1,'([\w\W].*?)\|',1,5),'|','')
,replace(regexp_substr(c1,'([\w\W].*?)\|',1,7),'|','')
,replace(regexp_substr(c1,'([\w\W].*?)\|',1,8),'|','')
,replace(regexp_substr(c1,'([\w\W].*?)\|',1,9),'|','')
,replace(regexp_substr(c1,'([\w\W].*?)\|',1,10),'|','')
,replace(substr(c1,regexp_instr(c1,'([\w\W].*?)\|',1,10) +1 ),'|','')
 from (
import into (c1 VARCHAR(2000000)) from csv at 'http://www.nasdaqtrader.com/dynamic/SymDir/' file 'nasdaqtraded.txt' 
        row separator = 'CRLF'
        column separator = '¡'
        skip = 1
        reject limit 0
) where c1 not like '%File Creation Time:%'

Won´t work with local files because "inline import" is only supported for remote sources, but seeing as you use http it could work for you.

I´d expected it to hog up more DBRAM than Uwe´s solution - it´s more of an illustration of what could be done if more "extensive modifications" are necessary within the
IMPORT, but perhaps it´ll come in handy at some point.

Cheers,
Malte

exa-Franz
Team Exasol
Team Exasol

If you can guarantee that it is the case in all your files, just do a 

reject limit 1

 Problem solved?