Accessing Excel files in Python

andreeroos
SQL-Fighter

I am trying to access an Excel file through Python, but the trustworthy pandas.read_excel doesn't seem to work (the xlrd library is missing).

Does anyone know of any other preinstalled python libraries that can read excel files?

1 ACCEPTED SOLUTION

jens_areto
Contributor

Hey andreeroos,

I think there is also another way how you can get a "quick win" here for your customer. For sure a natural support would be great but for processing data in my opinion Excel is not the best format which should be used here. But nevertheless, if you want to process an xlsl file you will need external libraries. If you do not want to generate a language container you have too other options.

In each case you need to configure a bucket for your Exasol. But this task will take less than 5 minutes  (https://docs.exasol.com/administration/on-premise/bucketfs/create_new_bucket_in_bucketfs_service.htm). Make sure that you configure a read and write password for your bucket and that you add an http port for your BucketFS Service. If the bucket is configured you can load all libraries which you need to solve your problem in wheel format or tar.gz format and add them straight to the bucket. 

You can load it from this page : https://pypi.org/project/xlrd/ and add this file to the bucket using curl or the BucketFSExplorer. If this is done you need to activate your uploaded file inside your UDF-Script and then you will be able to use the included functionality inside your UDF.

jens_areto_0-1618380707674.png

Inside your Bucket you can store whl and tar.gz file. For my tests I also uploaded two test files to process but for best practice those files (test) including data should not be loaded into the Bucket.



CREATE or replace PYTHON3 SCALAR SCRIPT UDF_TEST.EXCEL_PROCESSING_TEST() EMITS ( "RES" VARCHAR(2000000) UTF8) AS
import glob
#--Syntax for activating a tar.gz file
sys.path.extend(glob.glob('/buckets/bfsdefault/jsc/xlrd-1.2.0/*'))
import xlrd
#--Syntax for activating a wheel file
sys.path.extend(glob.glob('/buckets/bfsdefault/jsc/xlrd-2.0.1-py2.py3-none-any.whl'))
import pandas as pd
import os
def run(c):
   fe = os.path.exists("/buckets/bfsdefault/jsc/test.xlsx")
   df=pd.read_excel("/buckets/bfsdefault/jsc/test.xlsx",engine='xlrd')
   c.emit("File exists: " + str(fe))
   c.emit(str(df))
   c.emit("Pandas Version: " + str( pd.__version__ ) )
   c.emit("Xlrd Version: " + str( xlrd.__version__ ) )
/


select UDF_TEST.EXCEL_PROCESSING_TEST() ;

Here is a code example how I can acitvate both a wheel file or tar.gz file (there is a little difference in the syntax for the activation). 

jens_areto_1-1618381886736.png

In the result you can see that I am able to read the data from the xlsx into a Dataframe. This is the "quick win" how you can do this stuff inside the Exasol. But there will come another problem becaue you can only use versions from xlrd up to 1.2.0, all later version will no longer support xlsx format and you need to use other libraries like openpyxl (which will need a different pandas version than preinstalled). But it would not be a problem to also add a different pandas version with a wheel file inside the bucket and the openpyxl, include them in your UDF and there you go.

 

But as mentioned before best solution for me would be the language container which was already mentioned earlier. You don´t need to think about library dependencies when using pip here and do not need to add multiple files. I totally agree reading through the git and testing all this stuff will need some time. For this we created a bash script which takes some input parameter and is a little easier to use to generate this script, move it to your Exasol Bucket and activate the new language container inside your DB automatically. I will publish this script and a little article "How to use" next week on our blog and share the link to the blog post with this script under this post.

View solution in original post

15 REPLIES 15

exa-Aleksandr
Team Exasol
Team Exasol

Hi, @andreeroos 

My pandas on Windows, Exasol 7.0.7 (python 2/3) and Exasol 6.2.12 (python 2/3) - they all want additional packages like xlrd or openpyxl to process .xlsx(see engine parameter in https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html).

And none of them seem to be included in standard language containers: https://docs.exasol.com/database_concepts/udf_scripts/python3.htm

Therefore, you might want to build your own script language container with a set of packages of choice: https://github.com/exasol/script-languages-release

 

andreeroos
SQL-Fighter

This is unfortunate and will definetely complicate things.
As for me, currently being in a pre-sales situation, this probably won't resonate well with the potential customer.

Not being able to read excel files, as a standard functionality, is most likely considered a sort of geeky position since excel, for lots of people, is considered a really common way to store data. In this particular case it would enable combining fresh data from business users with large tables in Exasol.

I really think you should consider adding this functionality in Python if possible.

Meanwhile I will explore some other options. Apache Drill was mentioned by a Exasol contact, but this will require setting up an additional server for this purpose. Building script containers though seems a bit out of my league atm.

jens_areto
Contributor

Hey andreeroos,

I think there is also another way how you can get a "quick win" here for your customer. For sure a natural support would be great but for processing data in my opinion Excel is not the best format which should be used here. But nevertheless, if you want to process an xlsl file you will need external libraries. If you do not want to generate a language container you have too other options.

In each case you need to configure a bucket for your Exasol. But this task will take less than 5 minutes  (https://docs.exasol.com/administration/on-premise/bucketfs/create_new_bucket_in_bucketfs_service.htm). Make sure that you configure a read and write password for your bucket and that you add an http port for your BucketFS Service. If the bucket is configured you can load all libraries which you need to solve your problem in wheel format or tar.gz format and add them straight to the bucket. 

You can load it from this page : https://pypi.org/project/xlrd/ and add this file to the bucket using curl or the BucketFSExplorer. If this is done you need to activate your uploaded file inside your UDF-Script and then you will be able to use the included functionality inside your UDF.

jens_areto_0-1618380707674.png

Inside your Bucket you can store whl and tar.gz file. For my tests I also uploaded two test files to process but for best practice those files (test) including data should not be loaded into the Bucket.



CREATE or replace PYTHON3 SCALAR SCRIPT UDF_TEST.EXCEL_PROCESSING_TEST() EMITS ( "RES" VARCHAR(2000000) UTF8) AS
import glob
#--Syntax for activating a tar.gz file
sys.path.extend(glob.glob('/buckets/bfsdefault/jsc/xlrd-1.2.0/*'))
import xlrd
#--Syntax for activating a wheel file
sys.path.extend(glob.glob('/buckets/bfsdefault/jsc/xlrd-2.0.1-py2.py3-none-any.whl'))
import pandas as pd
import os
def run(c):
   fe = os.path.exists("/buckets/bfsdefault/jsc/test.xlsx")
   df=pd.read_excel("/buckets/bfsdefault/jsc/test.xlsx",engine='xlrd')
   c.emit("File exists: " + str(fe))
   c.emit(str(df))
   c.emit("Pandas Version: " + str( pd.__version__ ) )
   c.emit("Xlrd Version: " + str( xlrd.__version__ ) )
/


select UDF_TEST.EXCEL_PROCESSING_TEST() ;

Here is a code example how I can acitvate both a wheel file or tar.gz file (there is a little difference in the syntax for the activation). 

jens_areto_1-1618381886736.png

In the result you can see that I am able to read the data from the xlsx into a Dataframe. This is the "quick win" how you can do this stuff inside the Exasol. But there will come another problem becaue you can only use versions from xlrd up to 1.2.0, all later version will no longer support xlsx format and you need to use other libraries like openpyxl (which will need a different pandas version than preinstalled). But it would not be a problem to also add a different pandas version with a wheel file inside the bucket and the openpyxl, include them in your UDF and there you go.

 

But as mentioned before best solution for me would be the language container which was already mentioned earlier. You don´t need to think about library dependencies when using pip here and do not need to add multiple files. I totally agree reading through the git and testing all this stuff will need some time. For this we created a bash script which takes some input parameter and is a little easier to use to generate this script, move it to your Exasol Bucket and activate the new language container inside your DB automatically. I will publish this script and a little article "How to use" next week on our blog and share the link to the blog post with this script under this post.

View solution in original post

andreeroos
SQL-Fighter

That worked like a charm! 😊 
This clearly felt like the easiest way to solve this.

Thanks for the help!

jens_areto
Contributor

For the example I used a Exasol Virtual Machine 7.0.6. Forgot to say that in my answer above.

mwellbro
Xpert

@andreeroos : a fun little detail I discovered just now when changing from CREATE PYTHON SCALAR SCRIPT ( where reading xlsx still works ) to CREATE PYTHON3 SCALAR SCRIPT in EXA-6.2.8:

[22002] VM error: ImportError: Install xlrd >= 0.9.0 for Excel support (Session: 1696955423589142353)

 

mwellbro
Xpert

@andreeroos : wouldn´t it be an option to use a 6.2.8 where reading xlsx still works ( just checked, I´m able to read a fresh file saved as xlsx just like in your example ) ?

Or are you required to run on the latest version / community-edition ? This seems to be a case where the dependencies ( i.e. xlrd or openpyxl ) outpaced the platform ( exa ) and should be a rectifiable situation ( I would hope, @exa-* ? )

exa-Aleksandr
Team Exasol
Team Exasol

hi @mwellbro

So yes, the Excel pandas option was not declared as supported and somehow disappeared in one of minor versions after 6.2.8. We are now checking with Product management if it possible to add the required libraries permanently and if yes then when.

mwellbro
Xpert

Hi @exa-Aleksandr , while I agree that yes, as a short term solution "compiling" your own language container gets the job done it seems like a short coming that pandas is provided in an "in 80% percent of times it works 100% of the time" way....

And it seems like a "step back" when the pandas provided in 6.2.8 worked with xlsx ( or I´m constantly reading .xlsx that are indeed something else, will verify.... ).

Will need to include this as a test case on my side....any chances that the pandas provided in upcoming releases will be "dependency complete" regarding...let´s call it a mildly predominant file format, like xlsx ? 😉

Cheers,
Malte

exa-Aleksandr
Team Exasol
Team Exasol

Hi @mwellbro 

I've just tried 6.2.8 and got the same

SQL Error [22002]: VM error: ImportError: Install xlrd >= 0.9.0 for Excel support (Session: 1697031701562344619)

Could you please check how it works for you?

mwellbro
Xpert

Hi @exa-Aleksandr , on my end it works as long as I use a CREATE PYTHON instead of CREATE PYTHON3 ( see post with timestamp 13-04-2021 10:14 PM ).

mwellbro
Xpert

Hi @andreeroos ,

strange, was using EXA V.6.2.8 recently with the "pandas included therein" and it worked without a hinge....could you share the code you used ?

Cheers,
Malte

andreeroos
SQL-Fighter

Hi.

Thanks for the quick reply.
I'm using the code:

import pandas as pd
df=pd.read_excel("ftp://user:password@192.168.0.54:21/MightyExcelFile.xlsx", index_col=None, header=None)

 The error:

ModuleNotFoundError: No module named 'xlrd'

During handling of the above exception, another exception occurred:

Bolag_XLS:6 run
  File "/usr/lib/python3/dist-packages/pandas/util/_decorators.py", line 118, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/pandas/io/excel.py", line 230, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/usr/lib/python3/dist-packages/pandas/io/excel.py", line 263, in __init__
    raise ImportError(err_msg)
ImportError: Install xlrd >= 0.9.0 for Excel support

 

I don't have any problems accessing csv files with pd.read_csv(.....)

mwellbro
Xpert

could you check and post the result of:

 

CREATE or replace PYTHON SCALAR SCRIPT "PANDAS_VERSION" () EMITS ( "C3" VARCHAR(500) UTF8) AS
import pandas as pd
def run(c):
   c.emit( str( pd.__version__ ) )
/


select "PANDAS_VERSION"() ;

 Running 0.22.0 on my end - what was your EXA-Setup again ( Cluster or Solo ?  Version ? ) ?

andreeroos
SQL-Fighter

I'am using community edition, so solo then. 🙂

Pandas edition is 0.22.0.