Fully managed cloud. 30-day FREE full featured trial. Start Now
Showing results for 
Search instead for 
Did you mean: 
Community Manager
Community Manager


Exasol's UDF scripts make it possible to run machine-learning algorithms directly in the database. The training of a random-forest model and using this model to make predictions on a dataset in the Exasol database can be done in three steps:

  1. During a learning phase, the model is created based on a training dataset (e.g. from a table within the Exasol database)
  2. The model is put into BucketFS to make it accessible within UDF scripts
  3. A UDF script is used to make predictions.


As an example, we create two tables RF.TRAIN and RF.TEST in the Exasol database and import the data from the files train.csv and test.csv. They both contain information about thousands of wines, e.g. the amount of sugar, the PH value, the acidity, and so on. The training dataset contains the taste of the wine (bad / normal / good). The goal is to train a model and classify the unknown wines in the test dataset. The SQL commands to create the table and the two CSV files are attached to this solution.
We need an Exasol-Version 6.2 for the Python3 example.

How to perform Machine Learning in R and Python

Step 1: Learning Phase

1a. Learning Phase in R

 In R, a random-forest model can be created by firstly reading the data from the Exasol database using the Exasol R SDK. It can be directly installed from the GitHub repository:

install.packages('devtools') devtools::install_github("EXASOL/r-exasol") library(exasol) library(randomForest) con <- dbConnect("exa", exahost = "localhost:8563", uid = "sys", pwd = "exasol") train <- dbGetQuery(con, "SELECT * FROM RF.TRAIN")

The dataframe train contains the data from the Exasol table RF.TRAIN.  head(train) shows the first rows in this table, barplot(table(train$TASTE)) shows the taste distribution as a bar chart, e.g. in R Studio. Next, the random forest model can be created and stored into a file:

rf <- randomForest(as.factor(TASTE) ~ ., data = train, ntree=1000)
saveRDS(rf, "D:/rf.rds", version=2)

1b. Learning Phase in Python 3

Since Exasol Version 6.2, the language Python 3 is included with many built-in packages like pandas, scikit-learn and many more. On a local Python 3 environment, the random-forest model can be built analogously to the example above in R. First, the training data is loaded from the Exasol database as a dataframe using the Python package pyexasol:

import pyexasol
import pandas as pd
import numpy as np
import pickle
from sklearn.ensemble import RandomForestClassifier
C = pyexasol.connect(dsn='localhost:8563', user='sys', password='exasol')
train = C.export_to_pandas("SELECT * FROM RF.TRAIN")

The pandas dataframe train can be inspected with train.head(5) or train.describe(). Next, the labels (bad/normal/good) from the taste column need to be converted into integers (-1/0/1):

train.TASTE[train.TASTE == 'bad'] = -1
train.TASTE[train.TASTE == 'normal'] = 0
train.TASTE[train.TASTE == 'good'] = 1
labels = np.array(train['TASTE'])
labels = labels.astype('int')
features = train.drop('TASTE', axis=1)

The array labels just contains the numeric labels for a all wines, e.g. [1, 0, 0, -1, 1, 1, ...]. features looks like the original train dataframe but it does not contain the taste column. Next, the random-forest model can be trained and written into a file:

clf = RandomForestClassifier(n_estimators=100, max_depth=2)
clf.fit(features, labels)
pickle.dump(clf, open('D:/clf.dat', 'wb'))

Step 2:Putting the model into BucketFS

The easiest way to work with Exasol's BucketFS is using the BucketFS explorer. After logging in with the Exaoperation credentials, a BucketFS service and a bucket can be created. In this new bucket, we move the model file (rf.rds for R, clf.dat for Python 3).

Step 3: Creating a UDF script for prediction

3a. Creating a R UDF script for prediction

When the bucket is not publicly accessible but read-protected, a connection object has to be created in the Exasol database to provide the read password to UDF scripts:

CREATE CONNECTION my_bucket_access TO 'bucketfs:bucketfs1/udf' IDENTIFIED BY '12345';

In an R UDF script, the random-forest model is read from BucketFS. After that, the input set of the UDF script is converted into a dataframe which is accepted by the predict function of the randomforest package. We load batches of 1000 rows each to make 1000 predictions in each iteration.

create or replace r set script rf.predict(...) emits (wine_id INT, taste VARCHAR(6)) as

run <- function(ctx) {
rf <- readRDS("/buckets/bucketfs1/udf/rf.rds")

## load the first batch of 1000 rows in the input set
repeat {
  wine_ids <- ctx[[1]]()

  ## create a dataframe from all input columns (except the first one (wine_id))
  numCols <- exa$meta$input_column_count
  df <- data.frame(ctx[[2]]())
  for (i in 3:numCols) {
        df <- cbind(df,ctx[[i]]())


  prediction <- predict(rf, newdata=df)

  ctx$emit(wine_ids, as.character(prediction))  

  ## load the next batch of 1000 rows in the input set
  if (!(ctx$next_row(1000))){break}
select rf.predict(wine_id, fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfur_dioxide, total_sulfur_dioxide, density, pH, sulphates, alcohol) from RF.test group by iproc();

Due to the GROUP BY iproc(), the UDF script is called once per Exasol node with the data that is locally stored on that node to enable parallel and fast predictions.

3b. Creating a Python UDF script for prediction

Using the scripting language PYTHON3 - which is built-in in Exasol since version 6.2 -, the input set can be accessed batch-wise as a pandas dataframe with ctx.get_dataframe.

CREATE OR REPLACE PYTHON3 SET SCRIPT test.predict_wine_py(...) emits (wine_id INT, taste VARCHAR(6)) as
import pickle
import pandas as pdclf = pickle.load(open('/buckets/bucketfs1/udf/clf.dat', 'rb'))
clf.n_jobs = 1def run(ctx):
	BATCH_ROWS = 1000
	while True:
		df = ctx.get_dataframe(num_rows=BATCH_ROWS)
		if df is None:
	wine_ids = df['0']
	features = df.drop('0', axis=1)
	res_df = pd.DataFrame(columns=['WINE_ID', 'TASTE'])
	res_df['WINE_ID'] = wine_ids
	res_df['TASTE'] = clf.predict(features)
	res_df.TASTE[res_df.TASTE == -1] = 'bad'
	res_df.TASTE[res_df.TASTE == 0] = 'normal'
	res_df.TASTE[res_df.TASTE == 1] = 'good'
select rf.predict_wine_py(wine_id, fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfur_dioxide, total_sulfur_dioxide, density, pH, sulphates, alcohol) from RF.test group by iproc();

Additional References



is there any way to train the model in exasol and not in the IDEs (Rstudio, Spyder, etc.). In this example and in the video from the course Advanced Analytics, the model is first created and trained outside exasol, and later put in BucketFS. 

 - Is there any way for training the model inside exasol and use the power of the cluster?


The following question is:

-Is it possible to train a model using parallelism? That means, train you model using more /all cores in the cluster.

Something like SELECT trainmymodel() FROM schema.TrainData group by iproc() ?


Thanks in advance



Two things to note here regarding the python version:

  1. Make sure that you have the same version of sklearn installed in the environment you are using for training the model as is being used in Exasol. Otherwise, you will most likely run into errors like ModuleNotFoundError: No module named 'sklearn.ensemble._forest' when executing the query on your Exasol database.
    Here's a little UDF that reads out the version of sklearn shipped with Exasol
    create or replace python3 scalar script rf.scikit_version() emits (version varchar(1024)) as
    import sklearn
    def run(ctx):
  2. The source code for the python3 UDF has some missing line breaks. The correct version would be:
    import pickle
    import pandas as pd
    clf = pickle.load(open('/buckets/bucketfs1/udf/clf.dat', 'rb'))
    clf.n_jobs = 1
    def run(ctx):

Hope this helps!