Fully managed cloud. 30-day FREE full featured trial. Start Now
cancel
Showing results for 
Search instead for 
Did you mean: 

PyExasol export_to_pandas

TomHel
Contributor

Dear all, 

I was wondering if I am using PyExasol's export_to_pandas function wrong, or if I should be using something else. The thing I noticed is:

export_to_pandas appears to use an export to csv and reimport to transfer data into a python pandas dataframe. This leads to certain problems:

a) I had to ask special export permissions for my user type on the database from our overcompliant IT department

b) it seems to take quite a long time, mainly attributed to the spin-up time (depending on the query and table. I have some quick SQLs that now take noticeably longer on Python Exasol than they did via, for example, the teradata python driver)

c) this process gives me some issues with data types being lost. I have some columns with data that looks like decimal numbers, but is a string. I now have to painstakingly define data types (via callback parameters) for each column instead of getting them from the database directly. This also never used to be an issue in our old teradata architecture.

Is there a more elegant way to get the query results into a dataframe without an intermediate step in the background?

Best regards, 

Tom

 

1 ACCEPTED SOLUTION

Accepted Solutions

littlekoi
Xpert

You may try to use a very basic approach:

 

connection = pyexasol.connect(...)

stmt = connection.execute('SELECT ...')
pd = pandas.DataFrame(stmt.fetchall())

 

 

You may also use stmt.columns()to access information about result data types, if necessary.

https://github.com/badoo/pyexasol/blob/master/docs/REFERENCE.md#columns

 

.export_to_pandas()is advised for medium to large data sets with ~100k+ rows. In this case "spin-up" time becomes insignificant compared to savings on transport protocol and reduced deserialisation overhead.

 

The main problem with Exasol vs. Pandas data types is related to fundamental limitations of pandas / numpy. Pandas natively supports less data types compared to Exasol. It is especially noteable with:

  • integers (pandas is limited to int64);
  • decimals (no native support for pandas, only float with loss of precision);
  • timestamps (Exasol goes from 0001-01-01 to 9999-12-31, pandas can only do a fraction of it);

 

So it is not possible to create a universal function to read any data from Exasol to Pandas and preserve data types. The current approach lets user to read any data at least in the form of "object" data type, and fix it later by providing data types explicitly or by transforming data frame.

On the bright side, it's rarely an issue for typical users, and it's still easy to "fix" using a few lines of code and internal knowledge about your data.

View solution in original post

7 REPLIES 7

exa-MathiasHo
Exasol Alumni

Hi @TomHel,

Thank you for posting. Let me reach out internally and find an answer for you. 😉

Best regards,
Mathias

exa-Matze
Team Exasol
Team Exasol

Hi @TomHel,

Indeed PyExasol is designed to handle huge data sets, but not optimized for smaler queries.

 

Alternatives are:

 

Exasol python package

This package is delivered by Exasol, but deprecated since we recommend to use pyExasol unless there is a reason to not use it 😉 

This seems to be the case here. Data can be read into pandas data frames directly.

Maybe this is an option for your use case

 

Turbodbc

Turbodbc is delivered open source by one of our customers. I haven't used it myself, but I know that it's widely used.

It works with cursors, but not with data frames.

Might be an option, too.

 

Let me ping @littlekoi (PyExasol) and @MichaelKoenigBY (Blue Yonder) for their thoughts

 

Best,

Matze

 

 

 

littlekoi
Xpert

You may try to use a very basic approach:

 

connection = pyexasol.connect(...)

stmt = connection.execute('SELECT ...')
pd = pandas.DataFrame(stmt.fetchall())

 

 

You may also use stmt.columns()to access information about result data types, if necessary.

https://github.com/badoo/pyexasol/blob/master/docs/REFERENCE.md#columns

 

.export_to_pandas()is advised for medium to large data sets with ~100k+ rows. In this case "spin-up" time becomes insignificant compared to savings on transport protocol and reduced deserialisation overhead.

 

The main problem with Exasol vs. Pandas data types is related to fundamental limitations of pandas / numpy. Pandas natively supports less data types compared to Exasol. It is especially noteable with:

  • integers (pandas is limited to int64);
  • decimals (no native support for pandas, only float with loss of precision);
  • timestamps (Exasol goes from 0001-01-01 to 9999-12-31, pandas can only do a fraction of it);

 

So it is not possible to create a universal function to read any data from Exasol to Pandas and preserve data types. The current approach lets user to read any data at least in the form of "object" data type, and fix it later by providing data types explicitly or by transforming data frame.

On the bright side, it's rarely an issue for typical users, and it's still easy to "fix" using a few lines of code and internal knowledge about your data.

TomHel
Contributor

Thank you all for your great responses. I will try some of the hints and understand the benefits and downsides of the methods. I will accept one of your answers randomly as solution, since they are all great. 

exa-John
Team Exasol
Team Exasol

Hi,

We are using a ExaOne env and experience data inconsistencies when retrieving data via pyexasol. We are using the export_to_pandas from PyExasol package to analyse data in google colab.

=============================
Here is the script:
=============================
import pyexasol
import ipywidgets as widgets
import ssl
import pandas as pd

C = pyexasol.connect(dsn='1.2.3.4:8563', user='', websocket_sslopt={'cert_reqs': ssl.CERT_NONE}, access_token=xtoken)

sql_0 = "SELECT * FROM TABLE WHERE ..."

df_0 = C.export_to_pandas(sql_0)

df_0.head()


=============================
What we know
=============================
1) We tried the community artcles:
https://community.exasol.com/t5/discussion-forum/pyexasol-export-to-pandas-results-in-data-inconsist....
But it did not solve our issue.

2) The involved text fields are up to 2'000'000 characters long. But we notice that in the resulting dataframe data inconsistencies: Data seem not to be consistent row wise. In a single row we see data from different columns. When we try to find this record within the database and filtering for the combination of data we do not find a single record.

Thank you and best regards!

TomHel
Contributor

Hey, 

this could be a complete long shot, but I am currently fighting with unwanted line breaks on export in my data too (havent tried it with pyexasol). First I thought this might have to do with our SQL Client (DBeaver), but now I am wondering if it is a different issue alltogether.

 

Anyways, the solution (or rather workaround) was, to remove potential line break characters from the suspected culprit columns (I know the columns that cause the problem). 

SELECT...

,Replace(Replace(field_name,CHAR(10),''),CHAR(13),'') AS ...

FROM...

If this works, it could mean mean that pyexasol is not the main problem, since these unwanted line breaks seem to appear in exported csv files and export_to_pandas is, as far as I understand, simply exporting to csv before reading from that file to a pandas dataframe. This might not be helpful at all, just something that I have observed. 

Best regards, 

Tom

 

exa-John
Team Exasol
Team Exasol

@TomHel - I see you are going after carriage controls (line breaks). I reviewed a couple of sites to get more information:
https://community.boomi.com/s/article/whatisthedifferencebetweenchar13andchar10#https://social.msdn.microsoft.com/Forums/sqlserver/en-US/a87b9fbb-40dc-4a2e-83df-eab150620122/differ...

As an alternative to doing replace, I suggest trying the linux command "dos2unix" - which should convert the windows line breaks to linux line breaks.

1) Is "dos2unix" installed on your linux server?

Commmand: which dos2unix

My output tells us it's not installed.

2) We can install dos2unix:

Command: yum install dos2unix.

3) Once installed, you can find the usage information can be found here: https://linux.die.net/man/1/dos2unix

I might suggest using the "-o" option with the dos2unix command  to output the converted file to a new file, leaving the original input file pristine.