Data Science
Harness the power of Exasol to perform data science with these examples, guides, and more!
Showing results for 
Search instead for 
Did you mean: 
Background This is a basic tutorial on User Defined Functions. For details regarding the programming languages see our documentation portal More technical insight is also given in the white paper   "Big Data Science - the future of analytics". We have to distinguish between UDFs and control scripts in the following ways: Control Scripts (Lua) User Defined Functions (UDFs) for control jobs for operations on data sets written in Lua written in SQL, Lua, Java, Python, R, or any other defined language executed via SQL statement: 'EXECUTE SCRIPT ...' executed within SQL 'SELECT' statement, e.g. 'SELECT udf(col1, col2) FROM tbl1;' supports executing of SQL statements & SQL error handling supports loading of Java / Python / R packages runs iteratively & single threaded on one node runs massively parallel   How to create a UDF Creating a UDF is as simple as creating a script in the database. Step 1 You should decide what language you want to create the script in and what kind of script it is (Scalar/Set).  Input Values SCALAR The keyword SCALAR specifies that the script processes single input rows. It's code is therefore called once per input row. SET If you define the option SET, then the processing refers to a set of input values. Within the code, you can iterate through those values (for more information, see   Examples). Output Values RETURNS In this case the script returns a single value. EMITS If the keyword EMITS was defined, the script can create (emit) multiple result rows (tuples). In case of input type SET, the EMITS result can only be used alone, thus not be combined with other expressions. However, you can nest it through a subselect to do further processing of those intermediate results. You can use the script signature generator to help create the syntax for the header Step 2 Write your script in the database - we'll do a simple one for this example: --/ create or replace R scalar script Rsquare(x int) returns int as run <- function(ctx) { ctx$x * ctx$x } /   Step 3 Test out your script by running it like below: select Rsquare(3); You should get '9' as the result.  Additional Notes Parallelism Tables are stored in EXASOL row-wise distributed across a cluster's nodes. A UDF environment runs "sandboxed" on each node, computing a local result set integrated into EXASOL's SQL pipeline (see white paper:   "A peek under the hood"). The local result sets are sent to the node handling the client connection and merged to the global result set to be sent to the client.   Scalar UDFs A scalar UDF computes on one input row per UDF instance and returns one output row. It is automatically executed in massively parallel in the Exasol cluster. The example below demonstrates this: reate or replace table t (x int, y int); insert into t values(1,1),(1,2),(4,2); select * from t; --/ CREATE OR REPLACE LUA SCALAR SCRIPT my_maximum (a DOUBLE, b DOUBLE) RETURNS DOUBLE AS function run(ctx) if ctx.a == nul or ctx.b == null then return null end if ctx.a > ctx.b then return ctx.a else return ctx.b end end / SELECT x, y, my_maximum(x, y) from t; Aggregate UDFs An aggregate UDF consumes multiple input tuples, i.e. the whole columns of the table or the group if a GROUP BY clause is defined in the SELECT statement. One UDF instance returns one single value. The number of UDF instances to be started is handled by the GROUP BY clause. Without a GROUP BY clause one instance is started which runs on one node, not in parallel. A GROUP BY clause that results in multiple groups causes multiple UDF instances, which are then computed massively parallel in Exasol. Please see the following example: --/ CREATE OR REPLACE LUA SET SCRIPT my_average (a DOUBLE) RETURNS DOUBLE AS function run(ctx) if ctx.size() == 0 then return null else local sum = 0 repeat if ctx.a ~= null then sum = sum + ctx.a end until not return sum/ctx.size() end end / SELECT my_average(x), my_average(y) from t; Analytical UDFs These consume as input values multiple data tuples (rows), and they also return a result set containing multiple rows. Parallelism is handled by the GROUP BY (see above). In the following example, a running sum is computed. Notice that the simple algorithm iterates through the entire input set within one instance (repeat loop). --/ CREATE OR REPLACE LUA SET SCRIPT my_sum ( a DOUBLE) EMITS (count DOUBLE, val DOUBLE, sum DOUBLE) AS function run(ctx) local sum = 0 local count = 0 repeat if ctx.a ~= null then sum = sum + ctx.a count = count +1 ctx.emit(count, ctx.a, sum) end until not end / SELECT my_sum(x ORDER BY x) from t; You can find all of these examples in the attached file.  Additional References Debug UDF outputs  UDF Scripts  UDF Script Signature Generator  Expanding Script Languages Using BucketFS  BucketFS 
View full article
This article shows you how to implement your own analytic functions with UDFs
View full article
This article shows you how to use custom JAR files in your Java UDFs
View full article