Background Exasol R SDK offers functionality to interact with the Exasol database out of R programs. It extends the ORDBC and offers fast data transfer between Exasol and R ( exa.readData()   and   exa.writeData() ) and it makes it convenient to run parts of your R code in parallel in Exasol using R UDF functions behind the scenes ( exa.createScript() ) Example (Digit recognition) We will use a well-known machine learning algorithm for classifying handwritten digits. It is using a randomforest library, which is used to create decision trees for any kind of image. For a given training set and the actual result for each element, the library returns a decision tree that can be applied to other similar images, the test data. The data is stored pixel-wise (one value for each pixel). This How-To will focus on how to use the EXASolution R SDK and not focus on details of the algorithm or R. Further information on the algorithm and the data can be found   here. The example was chosen to show how easy it is to read or write data from or to EXASolution and how easy it is to deploy a given algorithm to an Exasol UDF. Prerequisites Exasol R SDK and randomForest needs to be installed on the used system. The attached files can be used to setup your Exasol database for the example: Run   01_DDL.sql      (Creates a schema, a table for train data set and a table for test data set) Place csv files ( train.csv     and   test.csv   ) at an appropriate place and change the path to those files in   02_IMPORT.sql   Run   02_IMPORT.sql     (Imports the train and test data set) How to use ExaSolution R SDK R implementation (reading the data sets from an EXASolution database) The attached file   03_RandomForest.r   shows how to use   exa.readData()   and   exa.writeData()   to interact with the database in R. As of Exasol 6.2.x, we needed to modify 1 line in 03_RandomForest.r in order to account for R version differences. Specifically, we are adding a parameter to the save command - version = 2. You are welcome to skip this modification and come back to revisit if you run into issues. Before: save(rf, file = "< path to rf.dat>") After: save(rf, version = 2, file = "< path to rf.dat>")  Additional installation notes for 03_RandomForest.r - installing needed packages. Continuing on, the general procedure is: Step 1. Connect to EXASolution and read the train and test data This requires an ODBC DSN (named "solo" in this example) Step 2. Read train and test data from the database and prepare them for further processing Test data was enriched by an image id Train data was enriched by a label showing the actual handwritten digit Step 3. Use randomForest() to create a decision tree and print prediction In this example, the decision forest is kept and tested on the test data Extract the predictions for test data and print them Step 4. Write decision tree to Exasol We will use a bucket called binary to store the decision forest Detailed information on EXABucket FS and buckets can be found in https://community.exasol.com/t5/environment-management/how-to-create-an-exabucketfs-service-and-bucket/ta-p/1515** This item does not apply to Exasol 6+ and is only left in for archival purposes.   Important Deprecation Notice:  In Version 5 it is not possible to store binary data in EXASOL database. One possibility is to serialize the decision forest and store it in a table. SInce VARCHAR type in EXASOL is limited to 2 million characters, the resulting string should not exceed 2 million characters.  curl -v -X PUT -T rf.dat http: //w:<write_pw>@<ip of one cluster node>:<port>/binary/rf.dat Step 5. Deploying a UDF The other file ( 05_PREDICTINDB.sql ) deploys the functionality to a UDF in EXASOL. We create a variadic UDF receiving an arbitrary amount of input parameters (all pixels) The output is the predicted digit. Load forest from the bucket Create a dataframe for all input columns (pixels) Use   predict()   and emit the predictions Step 6. Using UDF within EXASolution Now it is possible to use the deployed function within EXASolution like any other built-in function: select rf.predict(Model_str, PIXEL0,PIXEL1,PIXEL2,PIXEL3,PIXEL4,PIXEL5,PIXEL6,PIXEL7,PIXEL8,PIXEL9,PIXEL10,PIXEL11,PIXEL12,PIXEL13,PIXEL14,PIXEL15,PIXEL16,PIXEL17,PIXEL18,PIXEL19,PIXEL20,PIXEL21,PIXEL22,PIXEL23,PIXEL24,PIXEL25,PIXEL26,PIXEL27,PIXEL28,PIXEL29,PIXEL30,PIXEL31,PIXEL32,PIXEL33,PIXEL34,PIXEL35,PIXEL36,PIXEL37,PIXEL38,PIXEL39,PIXEL40,PIXEL41,PIXEL42,PIXEL43,PIXEL44,PIXEL45,PIXEL46,PIXEL47,PIXEL48,PIXEL49,PIXEL50,PIXEL51,PIXEL52,PIXEL53,PIXEL54,PIXEL55,PIXEL56,PIXEL57,PIXEL58,PIXEL59,PIXEL60,PIXEL61,PIXEL62,PIXEL63,PIXEL64,PIXEL65,PIXEL66,PIXEL67,PIXEL68,PIXEL69,PIXEL70,PIXEL71,PIXEL72,PIXEL73,PIXEL74,PIXEL75,PIXEL76,PIXEL77,PIXEL78,PIXEL79,PIXEL80,PIXEL81,PIXEL82,PIXEL83,PIXEL84,PIXEL85,PIXEL86,PIXEL87,PIXEL88,PIXEL89,PIXEL90,PIXEL91,PIXEL92,PIXEL93,PIXEL94,PIXEL95,PIXEL96,PIXEL97,PIXEL98,PIXEL99,PIXEL100,PIXEL101,PIXEL102,PIXEL103,PIXEL104,PIXEL105,PIXEL106,PIXEL107,PIXEL108,PIXEL109,PIXEL110,PIXEL111,PIXEL112,PIXEL113,PIXEL114,PIXEL115,PIXEL116,PIXEL117,PIXEL118,PIXEL119,PIXEL120,PIXEL121,PIXEL122,PIXEL123,PIXEL124,PIXEL125,PIXEL126,PIXEL127,PIXEL128,PIXEL129,PIXEL130,PIXEL131,PIXEL132,PIXEL133,PIXEL134,PIXEL135,PIXEL136,PIXEL137,PIXEL138,PIXEL139,PIXEL140,PIXEL141,PIXEL142,PIXEL143,PIXEL144,PIXEL145,PIXEL146,PIXEL147,PIXEL148,PIXEL149,PIXEL150,PIXEL151,PIXEL152,PIXEL153,PIXEL154,PIXEL155,PIXEL156,PIXEL157,PIXEL158,PIXEL159,PIXEL160,PIXEL161,PIXEL162,PIXEL163,PIXEL164,PIXEL165,PIXEL166,PIXEL167,PIXEL168,PIXEL169,PIXEL170,PIXEL171,PIXEL172,PIXEL173,PIXEL174,PIXEL175,PIXEL176,PIXEL177,PIXEL178,PIXEL179,PIXEL180,PIXEL181,PIXEL182,PIXEL183,PIXEL184,PIXEL185,PIXEL186,PIXEL187,PIXEL188,PIXEL189,PIXEL190,PIXEL191,PIXEL192,PIXEL193,PIXEL194,PIXEL195,PIXEL196,PIXEL197,PIXEL198,PIXEL199,PIXEL200,PIXEL201,PIXEL202,PIXEL203,PIXEL204,PIXEL205,PIXEL206,PIXEL207,PIXEL208,PIXEL209,PIXEL210,PIXEL211,PIXEL212,PIXEL213,PIXEL214,PIXEL215,PIXEL216,PIXEL217,PIXEL218,PIXEL219,PIXEL220,PIXEL221,PIXEL222,PIXEL223,PIXEL224,PIXEL225,PIXEL226,PIXEL227,PIXEL228,PIXEL229,PIXEL230,PIXEL231,PIXEL232,PIXEL233,PIXEL234,PIXEL235,PIXEL236,PIXEL237,PIXEL238,PIXEL239,PIXEL240,PIXEL241,PIXEL242,PIXEL243,PIXEL244,PIXEL245,PIXEL246,PIXEL247,PIXEL248,PIXEL249,PIXEL250,PIXEL251,PIXEL252,PIXEL253,PIXEL254,PIXEL255,PIXEL256,PIXEL257,PIXEL258,PIXEL259,PIXEL260,PIXEL261,PIXEL262,PIXEL263,PIXEL264,PIXEL265,PIXEL266,PIXEL267,PIXEL268,PIXEL269,PIXEL270,PIXEL271,PIXEL272,PIXEL273,PIXEL274,PIXEL275,PIXEL276,PIXEL277,PIXEL278,PIXEL279,PIXEL280,PIXEL281,PIXEL282,PIXEL283,PIXEL284,PIXEL285,PIXEL286,PIXEL287,PIXEL288,PIXEL289,PIXEL290,PIXEL291,PIXEL292,PIXEL293,PIXEL294,PIXEL295,PIXEL296,PIXEL297,PIXEL298,PIXEL299,PIXEL300,PIXEL301,PIXEL302,PIXEL303,PIXEL304,PIXEL305,PIXEL306,PIXEL307,PIXEL308,PIXEL309,PIXEL310,PIXEL311,PIXEL312,PIXEL313,PIXEL314,PIXEL315,PIXEL316,PIXEL317,PIXEL318,PIXEL319,PIXEL320,PIXEL321,PIXEL322,PIXEL323,PIXEL324,PIXEL325,PIXEL326,PIXEL327,PIXEL328,PIXEL329,PIXEL330,PIXEL331,PIXEL332,PIXEL333,PIXEL334,PIXEL335,PIXEL336,PIXEL337,PIXEL338,PIXEL339,PIXEL340,PIXEL341,PIXEL342,PIXEL343,PIXEL344,PIXEL345,PIXEL346,PIXEL347,PIXEL348,PIXEL349,PIXEL350,PIXEL351,PIXEL352,PIXEL353,PIXEL354,PIXEL355,PIXEL356,PIXEL357,PIXEL358,PIXEL359,PIXEL360,PIXEL361,PIXEL362,PIXEL363,PIXEL364,PIXEL365,PIXEL366,PIXEL367,PIXEL368,PIXEL369,PIXEL370,PIXEL371,PIXEL372,PIXEL373,PIXEL374,PIXEL375,PIXEL376,PIXEL377,PIXEL378,PIXEL379,PIXEL380,PIXEL381,PIXEL382,PIXEL383,PIXEL384,PIXEL385,PIXEL386,PIXEL387,PIXEL388,PIXEL389,PIXEL390,PIXEL391,PIXEL392,PIXEL393,PIXEL394,PIXEL395,PIXEL396,PIXEL397,PIXEL398,PIXEL399,PIXEL400,PIXEL401,PIXEL402,PIXEL403,PIXEL404,PIXEL405,PIXEL406,PIXEL407,PIXEL408,PIXEL409,PIXEL410,PIXEL411,PIXEL412,PIXEL413,PIXEL414,PIXEL415,PIXEL416,PIXEL417,PIXEL418,PIXEL419,PIXEL420,PIXEL421,PIXEL422,PIXEL423,PIXEL424,PIXEL425,PIXEL426,PIXEL427,PIXEL428,PIXEL429,PIXEL430,PIXEL431,PIXEL432,PIXEL433,PIXEL434,PIXEL435,PIXEL436,PIXEL437,PIXEL438,PIXEL439,PIXEL440,PIXEL441,PIXEL442,PIXEL443,PIXEL444,PIXEL445,PIXEL446,PIXEL447,PIXEL448,PIXEL449,PIXEL450,PIXEL451,PIXEL452,PIXEL453,PIXEL454,PIXEL455,PIXEL456,PIXEL457,PIXEL458,PIXEL459,PIXEL460,PIXEL461,PIXEL462,PIXEL463,PIXEL464,PIXEL465,PIXEL466,PIXEL467,PIXEL468,PIXEL469,PIXEL470,PIXEL471,PIXEL472,PIXEL473,PIXEL474,PIXEL475,PIXEL476,PIXEL477,PIXEL478,PIXEL479,PIXEL480,PIXEL481,PIXEL482,PIXEL483,PIXEL484,PIXEL485,PIXEL486,PIXEL487,PIXEL488,PIXEL489,PIXEL490,PIXEL491,PIXEL492,PIXEL493,PIXEL494,PIXEL495,PIXEL496,PIXEL497,PIXEL498,PIXEL499,PIXEL500,PIXEL501,PIXEL502,PIXEL503,PIXEL504,PIXEL505,PIXEL506,PIXEL507,PIXEL508,PIXEL509,PIXEL510,PIXEL511,PIXEL512,PIXEL513,PIXEL514,PIXEL515,PIXEL516,PIXEL517,PIXEL518,PIXEL519,PIXEL520,PIXEL521,PIXEL522,PIXEL523,PIXEL524,PIXEL525,PIXEL526,PIXEL527,PIXEL528,PIXEL529,PIXEL530,PIXEL531,PIXEL532,PIXEL533,PIXEL534,PIXEL535,PIXEL536,PIXEL537,PIXEL538,PIXEL539,PIXEL540,PIXEL541,PIXEL542,PIXEL543,PIXEL544,PIXEL545,PIXEL546,PIXEL547,PIXEL548,PIXEL549,PIXEL550,PIXEL551,PIXEL552,PIXEL553,PIXEL554,PIXEL555,PIXEL556,PIXEL557,PIXEL558,PIXEL559,PIXEL560,PIXEL561,PIXEL562,PIXEL563,PIXEL564,PIXEL565,PIXEL566,PIXEL567,PIXEL568,PIXEL569,PIXEL570,PIXEL571,PIXEL572,PIXEL573,PIXEL574,PIXEL575,PIXEL576,PIXEL577,PIXEL578,PIXEL579,PIXEL580,PIXEL581,PIXEL582,PIXEL583,PIXEL584,PIXEL585,PIXEL586,PIXEL587,PIXEL588,PIXEL589,PIXEL590,PIXEL591,PIXEL592,PIXEL593,PIXEL594,PIXEL595,PIXEL596,PIXEL597,PIXEL598,PIXEL599,PIXEL600,PIXEL601,PIXEL602,PIXEL603,PIXEL604,PIXEL605,PIXEL606,PIXEL607,PIXEL608,PIXEL609,PIXEL610,PIXEL611,PIXEL612,PIXEL613,PIXEL614,PIXEL615,PIXEL616,PIXEL617,PIXEL618,PIXEL619,PIXEL620,PIXEL621,PIXEL622,PIXEL623,PIXEL624,PIXEL625,PIXEL626,PIXEL627,PIXEL628,PIXEL629,PIXEL630,PIXEL631,PIXEL632,PIXEL633,PIXEL634,PIXEL635,PIXEL636,PIXEL637,PIXEL638,PIXEL639,PIXEL640,PIXEL641,PIXEL642,PIXEL643,PIXEL644,PIXEL645,PIXEL646,PIXEL647,PIXEL648,PIXEL649,PIXEL650,PIXEL651,PIXEL652,PIXEL653,PIXEL654,PIXEL655,PIXEL656,PIXEL657,PIXEL658,PIXEL659,PIXEL660,PIXEL661,PIXEL662,PIXEL663,PIXEL664,PIXEL665,PIXEL666,PIXEL667,PIXEL668,PIXEL669,PIXEL670,PIXEL671,PIXEL672,PIXEL673,PIXEL674,PIXEL675,PIXEL676,PIXEL677,PIXEL678,PIXEL679,PIXEL680,PIXEL681,PIXEL682,PIXEL683,PIXEL684,PIXEL685,PIXEL686,PIXEL687,PIXEL688,PIXEL689,PIXEL690,PIXEL691,PIXEL692,PIXEL693,PIXEL694,PIXEL695,PIXEL696,PIXEL697,PIXEL698,PIXEL699,PIXEL700,PIXEL701,PIXEL702,PIXEL703,PIXEL704,PIXEL705,PIXEL706,PIXEL707,PIXEL708,PIXEL709,PIXEL710,PIXEL711,PIXEL712,PIXEL713,PIXEL714,PIXEL715,PIXEL716,PIXEL717,PIXEL718,PIXEL719,PIXEL720,PIXEL721,PIXEL722,PIXEL723,PIXEL724,PIXEL725,PIXEL726,PIXEL727,PIXEL728,PIXEL729,PIXEL730,PIXEL731,PIXEL732,PIXEL733,PIXEL734,PIXEL735,PIXEL736,PIXEL737,PIXEL738,PIXEL739,PIXEL740,PIXEL741,PIXEL742,PIXEL743,PIXEL744,PIXEL745,PIXEL746,PIXEL747,PIXEL748,PIXEL749,PIXEL750,PIXEL751,PIXEL752,PIXEL753,PIXEL754,PIXEL755,PIXEL756,PIXEL757,PIXEL758,PIXEL759,PIXEL760,PIXEL761,PIXEL762,PIXEL763,PIXEL764,PIXEL765,PIXEL766,PIXEL767,PIXEL768,PIXEL769,PIXEL770,PIXEL771,PIXEL772,PIXEL773,PIXEL774,PIXEL775,PIXEL776,PIXEL777,PIXEL778,PIXEL779,PIXEL780,PIXEL781,PIXEL782,PIXEL783) from rf.test cross join rfmodels.models; PRED 0 0 3 0 3 3 0 0 1 7
View full article
Background Below is an UDF script in R that executes a scoring model inside EXASOL. While example is using the flights dataset from R and applying a randomForest model, it may be used as a template for R models. It has already proven in a real case where a Neural Network-based scoring process on a multi-bilion dataset was transferred from an R Server into EXASOL, reducing the amount of time needed from approx. 1h to a few mins. How to deploy an R scoring model in EXASOL via UDF Typical points for adaptation are: Step 1 the library needed for scoring must be present (here: library(randomForest) Step 2 the column names in the data.frame used in the UDF must resemble those of the data frame used for building the model. Typically these can be extracted from the model, but for other kinds of models the location needs to be adapted: if (length(m) == 0) { ... # get the predictors names colnames <<- names(m$forest$xlevels) } Here the model is loaded from EXABucket (Redis in EXASOL 5) if not already present in the environment. Depending on the model size, the model loading is the main factor influencing the time needed for the scoring process. The character vector 'colnames' is later used to construct the data.frame; it must contain the column names. Step 3 Later when the InDB scoring is started by the SELECT, the columns need to be referenced in the same order as in the model building data frame (due to pt 2). The GROUP BY parameter ROWNUM / 100 000 may be a lever for performance fine tuning: the more rows per UDF, the faster the scoring, but if an instance gets too many rows, it runs out of memory. R (Client side) - download code for: EXASOL 6:   r_client_v6.R     (as shown here) EXASOL 5:   r_client_v5.R r_client_v6.R # build a model for testing # see www.github.com/exasol/r-exasol library(exasol) library(nycflights13) con <- dbConnect( "exa" , dsn= "exa" ) dbWriteTable(con, "R.FLIGHTS" , flights) df <- dbGetQuery(con, "select * from R.FLIGHTS limit 10000" ) dbDisconnect(con) library(randomForest) set.seed(852) df <- df[complete.cases(df),] model1 <- randomForest(as.factor(carrier) ~ dep_time + dep_delay, data=head(df, n=1000)) # like this , you could do prediction on the client-side: #predict(model1, df[,c( "dep_time" , "dep_delay" )]) # but instead, we want to predict in-database, thus we have to # upload the model to an EXABucket via httr library(httr) PUT( # EXABucket URL url = "http: //192.168.42.129:2101/bucket1/model1" , body = serialize(model1, ascii = FALSE, connection = NULL), # EXABucket: authenticate with write_user__name / write_user__password config = authenticate( "w" , "writepw" ) ) EXASOL (Server side) - download code for: EXASOL 6:   flights_v6.sql     (as shown here) EXASOL 5:   flights_v5.sql flights_v6.sql -- deploying the model open schema R; create or replace r set script pred_randomForest(...) emits (predict varchar (200000)) as library(randomForest) m <- list() colnames <- vector( mode = " character " ) run <- function (ctx) { if ( length ( m ) == 0) { # only load the model if not already present. Must be done in run() as ctx is needed. f <- file (ctx[[1]](), open = "rb" , raw = TRUE ) m <<- unserialize(f) # get the predictors names colnames <<- names ( m $forest$xlevels) } # load entire chunk ctx$next_row(NA) n <- exa$meta$input_column_count s <- paste0( "model_key" , "=ctx[[1]]()" ) for (i in 2:n) { # construct a data .frame from the chunk & colnames name <- colnames[i - 1] s <- paste0(s, ", " , name , "=ctx[[" , i, "]]()" ) } eval(parse( text = paste0( "inp_df <- data .frame(" , s, ")" ))) # predict & emit ctx$emit(predict( m , inp_df[, 2:n])) } / create connection sys_bucketfs to 'bucketfs:bfsdefault/bucket1' identified by 'readpw' ; select pred_randomForest( '/buckets/bfsdefault/bucket1/model1' , -- EXABucketFS path "dep_time" , "dep_delay" ) -- predictors from FLIGHTS group by iproc(), -- the node no for data locality cast ( rownum /100000 as int ); -- restrict the no of rows ( limit is R max . vector size ) Additional references: https://community.exasol.com/t5/database-features/how-to-use-exasolution-r-sdk/ta-p/1224 https://community.exasol.com/t5/environment-management/how-to-create-an-exabucketfs-service-and-bucket/ta-p/1515
View full article
Top Contributors