This solution focuses on parallel loading multiple small files from an FTP server
View full article
Background Geospatial data   can be stored and analyzed in the Exasol database using the GEOMETRY datatype. In this solution, we will show you some examples of how to import geo spatial data from a CSV and from a GeoJSON file and use SQL functions to perform analytics and geo joins. Spacial Reference Systems In the system table SYS.EXA_SPATIAL_REF_SYS, you will find more than 3000 different spatial reference systems which can be used for the GEOMETRY datatype. These reference systems are there to define points on earth, but they have different strengths, accuracies, and properties, for example, the SRID 31466 can only reference locations within Germany. We will use the two SRIDs 4326 and 2163 in our example. 4326 is using degrees as a unit, 2163 uses meters. Often, conversions are necessary between different coordinate systems. The ST_TRANSFORM function can be used for this. Prerequisites Working with Spatial Data in Exasol The following example shows how to create a table with a geometry column, how to insert and query data. We are using GEOMETRY(4326) because the provided coordinates are in degrees. Later, in the SELECT query, we transform the geodata into SRID 2163 to see the distance between the two cities in meters: CREATE TABLE cities( name VARCHAR (200), geo GEOMETRY(4326)); INSERT INTO cities VALUES ( 'Berlin' , 'POINT (13.36963 52.52493)' ); INSERT INTO cities VALUES ( 'London' , 'POINT (-0.1233 51.5309)' ); -- this shows the distance in degrees: SELECT a . name , b. name , st_distance( a .geo, b.geo) FROM cities a , cities b; -- this shows the distance in meters: SELECT a . name , b. name , st_distance(ST_Transform( a .geo, 2163), ST_Transform(b.geo, 2163)) FROM cities a , cities b; GEOMETRY columns can be filled with strings using the well-known text representation (WKT), e.g. 'POINT (13.36963 52.52493)' How to process geodata Step 1: Importing geodata Importing geodata from CSV files Often, geodata is present in CSV files or in colunms of tables that are imported from different database systems in form of latitude and longitude values. On   https://openflights.org/data.html#airport, you can download a CSV file containing international airport data. We are using the extended version of this file, which consists of more than 12,000 international airports and train stations. In the first two fields of the file, there is the id and the name of the airport; in columns 7 and 8, we find its latitude and longitude coordinates: CREATE OR REPLACE TABLE airports( airport_id INT , name VARCHAR (500), latitude DECIMAL (9,6), longitude DECIMAL (9,6) ); IMPORT INTO airports FROM LOCAL CSV FILE 'D:\airports-extended.dat' (1, 2, 7, 8); ALTER TABLE airports ADD COLUMN geo GEOMETRY(4326); UPDATE airports SET geo = 'POINT (' ||longitude|| ' ' ||latitude|| ')' ; SELECT * FROM airports; We firstly used DECIMAL(9,6) columns to store the latitude and longitude values, and then we added a GEOMETRY column to store a geodata point for each airport. Similar to the example above, we can now calculate the distance between two airports or train stations in our table using the ST_DISTANCE and the ST_TRANSFORM function: select st_distance(ST_Transform( a .geo, 2163), ST_Transform(b.geo, 2163)) from airports a , airports b where a . name = 'Berlin-Tegel Airport' and b. name = 'Berlin Hauptbahnhof' ; Importing data from GeoJSON GeoJSON   is an often-used format for storing and exchanging geodata. On   https://geojson-maps.ash.ms, you can download countries as geodata. When you generate a custom JSON file on this site, the file consists of one JSON object, a so-called FeatureCollection, which contains multiple polygons and multi-polygons, one for each country. We import the data in three steps: Loading the full GeoJSON string into a VARCHAR(2000000) column in a table. (mind that this does not work if the GeoJSON is larger than two million characters) Parsing the GeoJSON using the json_table UDF script (see querying-and-converting-json-data-with-the-json-table-udf; this UDF script emits a table with one row for each geo-object) Converting each GeoJSON object into a GEOMETRY value using the ST_geomFromGeoJSON UDF script (attached to this solution) After these steps, we have all countries from the GeoJSON file in a GEOMETRY column. An alternative approach is developing a UDF script that loads the GeoJSON from a server (e.g., by using the Python package requests or paramiko) and parsing and iterating over the GeoJSON objects within the script. For each object, a WKT string is emitted, which can later be converted into a GEOMETRY value. This way, there are no limit on the characters. For our approach, the whole GeoJSON must not be larger than 2,000,000 characters: create or replace table geo_import(v varchar (2000000)); import into geo_import from local csv file 'D:\custom.geo.json' column separator = '0x01' column delimiter = '0x02' ; -- dummy separaters / delimiters to import a whole line as one column value -- json_table (can be found in https://community.exasol.com/t5/database-features/querying-and-converting-json-data-with-the-json-table-udf/ta-p/1800) emits a row for each country with two columns name and geojson create or replace view geojson as select json_table(v, '$.features[*].properties. name ' , '$.features[*].geometry' ) emits ( name varchar (2000000), geojson varchar (2000000)) from geo_import; -- ST_GeomFromGeoJSON is attached to https://community.exasol.com/t5/database-features/how-to-import-the-whole-world-into-exasol-geospatial-data-geo/ta-p/1669 create or replace table countries as select name , cast (ST_GeomFromGeoJSON(geojson) as geometry(4326)) as geo from geojson; select * from countries; Step 2: Geo-Joins and Geo-Indexes The following query performs a geo-join between the two tables, countries and airports. It finds all airports that are within the country Italy: select a .* from countries c join airports a on st_contains( c .geo, a .geo) where c . name = 'Italy' ; You can use   profiling   to see that an geo-index is created automatically and that this geo-index is used to execute the geo-join: alter session set profile = ' ON ' ; alter session set query_cache = ' OFF ' ; select a .* from countries c join airports a on st_contains( c .geo, a .geo) where c . name = 'Italy' ; flush statistics ; select * from exa_statistics.exa_user_profile_last_day where session_id = current_session and command_name = ' SELECT ' preferring high stmt_id; Like other indexes in Exasol, the geo-index is persisted so that it can be used for future queries, it is maintained automatically when the table data changes, and it is automatically dropped when it is not used for five weeks or longer. Mind that geo-indexes are a new feature since Exasol 6.1. Step 3: Visualizing Geometry Data The SQL client DBeaver is able to show geospatial data on a map. On a result tab, right click on a GEOMETRY column, choose View / Format, View as Geometry. Then click on the Panels button on the right to show the Value panel with the map.        
View full article
Background Exasol's UDF scripts make it possible to run machine-learning algorithms directly in the database. The training of a random-forest model and using this model to make predictions on a dataset in the Exasol database can be done in three steps: During a learning phase, the model is created based on a training dataset (e.g. from a table within the Exasol database) The model is put into BucketFS to make it accessible within UDF scripts A UDF script is used to make predictions. Prerequisites As an example, we create two tables RF.TRAIN and RF.TEST in the Exasol database and import the data from the files train.csv and test.csv. They both contain information about thousands of wines, e.g. the amount of sugar, the PH value, the acidity, and so on. The training dataset contains the taste of the wine (bad / normal / good). The goal is to train a model and classify the unknown wines in the test dataset. The SQL commands to create the table and the two CSV files are attached to this solution. We need an Exasol-Version 6.2 for the Python3 example. How to do Machine Learning in R and Python? Step 1: Learning Phase 1a. Learning Phase in R   In R, a random-forest model can be created by firstly reading the data from the Exasol database using the Exasol R SDK. It can be directly installed from the GitHub repository:   install.packages( 'randomForest' ) install.packages( 'devtools' ) devtools::install_github( "EXASOL/r-exasol" ) library(exasol) library(randomForest) con <- dbConnect( "exa" , exahost = "localhost:8563" , uid = "sys" , pwd = "exasol" ) train <- dbGetQuery(con, "SELECT * FROM RF.TRAIN" ) The dataframe  train }}contains the data from the Exasol table {{RF.TRAIN .   head(train)  shows the first rows in this table,  barplot(table(train$TASTE))  shows the taste distribution as a bar chart, e.g. in R Studio. Next, the random forest model can be created and stored into a file: rf <- randomForest(as.factor(TASTE) ~ ., data = train, ntree=1000) saveRDS(rf, "D:/rf.rds" , version=2) 1b. Learning Phase in Python 3 Since Exasol Version 6.2, the language Python 3 is included with many built-in packages like pandas, scikit-learn and many more. On a local Python 3 environment, the random-forest model can be built analogously to the example above in R. First, the training data is loaded from the Exasol database as a dataframe using the Python package pyexasol: import pyexasol import pandas as pd import numpy as np import pickle from sklearn.ensemble import RandomForestClassifier C = pyexasol.connect(dsn= 'localhost:8563' , user= 'sys' , password= 'exasol' ) train = C.export_to_pandas( "SELECT * FROM RF.TRAIN" ) The pandas dataframe train can be inspected with   train.head(5)   or   train.describe() . Next, the labels (bad/normal/good) from the taste column need to be converted into integers (-1/0/1): train.TASTE[train.TASTE == 'bad' ] = -1 train.TASTE[train.TASTE == 'normal' ] = 0 train.TASTE[train.TASTE == 'good' ] = 1 labels = np.array(train[ 'TASTE' ]) labels = labels.astype( ' int ' ) features = train.drop( 'TASTE' , axis=1) The array   labels   just contains the numeric labels for a all wines, e.g.   [1, 0, 0, -1, 1, 1, ...] .   features   looks like the original   train   dataframe but it does not contain the taste column. Next, the random-forest model can be trained and written into a file: clf = RandomForestClassifier(n_estimators=100, max_depth=2) clf.fit(features, labels) pickle.dump(clf, open( 'D:/clf.dat' , 'wb' )) Step 2: Putting the model into BucketFS The easiest way to work with Exasol's BucketFS is using the BucketFS explorer. After logging in with the Exaoperation credentials, a BucketFS service and a bucket can be created. In this new bucket, we move the model file (rf.rds for R, clf.dat for Python 3). Step 3: Creating a UDF script for prediction 3a. Creating a R UDF script for prediction When the bucket is not publicly accessible but read-protected, a connection object has to be created in the Exasol database to provide the read password to UDF scripts: CREATE CONNECTION my_bucket_access TO 'bucketfs:bucketfs1/udf' IDENTIFIED BY '12345' ; In an R UDF script, the random-forest model is read from BucketFS. After that, the input set of the UDF script is converted into a dataframe which is accepted by the predict function of the randomforest package. We load batches of 1000 rows each to make 1000 predictions in each iteration. --/ create or replace r set script rf.predict(...) emits (wine_id INT , taste VARCHAR (6)) as library(randomForest) run <- function (ctx) { rf <- readRDS( "/buckets/bucketfs1/udf/rf.rds" ) ## load the first batch of 1000 rows in the input set ctx$next_row(1000) repeat { wine_ids <- ctx[[1]]() ## create a dataframe from all input columns ( except the first one (wine_id)) numCols <- exa$meta$input_column_count df <- data .frame(ctx[[2]]()) for (i in 3:numCols) { df <- cbind(df,ctx[[i]]()) } colnames(df) <- c ( "FIXED_ACIDITY" , "VOLATILE_ACIDITY" , "CITRIC_ACID" , "RESIDUAL_SUGAR" , "CHLORIDES" , "FREE_SULFUR_DIOXIDE" , "TOTAL_SULFUR_DIOXIDE" , "DENSITY" , "PH" , "SULPHATES" , "ALCOHOL" ) prediction <- predict(rf, newdata=df) ctx$emit(wine_ids, as . character (prediction)) ## load the next batch of 1000 rows in the input set if (!(ctx$next_row(1000))){ break } } } / select rf.predict(wine_id, fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfur_dioxide, total_sulfur_dioxide, density, pH, sulphates, alcohol) from RF.test group by iproc(); Due to the   GROUP BY iproc() , the UDF script is called once per Exasol node with the data that is locally stored on that node to enable parallel and fast predictions. 3b. Creating a Python UDF script for prediction Using the scripting language PYTHON3 - which is built-in in Exasol since version 6.2 -, the input set can be accessed batch-wise as a pandas dataframe with   ctx.get_dataframe . --/ CREATE OR REPLACE PYTHON3 SET SCRIPT test.predict_wine_py(...) emits (wine_id INT , taste VARCHAR (6)) as import pickle import pandas as pdclf = pickle. load ( open ( '/buckets/bucketfs1/udf/clf.dat' , 'rb' )) clf.n_jobs = 1def run(ctx): BATCH_ROWS = 1000 while True : df = ctx.get_dataframe(num_rows=BATCH_ROWS) if df is None : break wine_ids = df[ '0' ] features = df. drop ( '0' , axis=1) res_df = pd.DataFrame( columns =[ 'WINE_ID' , 'TASTE' ]) res_df[ 'WINE_ID' ] = wine_ids res_df[ 'TASTE' ] = clf.predict(features) res_df.TASTE[res_df.TASTE == -1] = 'bad' res_df.TASTE[res_df.TASTE == 0] = 'normal' res_df.TASTE[res_df.TASTE == 1] = 'good' ctx.emit(res_df) / select rf.predict_wine_py(wine_id, fixed_acidity, volatile_acidity, citric_acid, residual_sugar, chlorides, free_sulfur_dioxide, total_sulfur_dioxide, density, pH, sulphates, alcohol) from RF.test group by iproc();
View full article
This article talks about analytic functions for windowing with links to additional analytics.
View full article
Top Contributors