Tech Blog

Exasol AWS Glue Integration Using Spark Connector

AWS Glue is a serverless data integration service that makes it easy to discover, prepare and combine data for analytics, Loading...machine learning and application development. It provides all necessary capabilities for data integration so that users can start analyzing and making sense of data within minutes.

In this tutorial we are going to show you how to integration AWS Glue ETL (extract, transform and load) with the Exasol database using Exasol Spark connector.

Prerequisite

To follow this tutorial, you should know how to setup and configure AWS services such as VPC, subnets, security groups, endpoints and IAM roles.

Deploying Exasol Cluster in AWS

First, we are going to deploy an Exasol cluster in AWS. You can use CloudTools or Terraform Module to setup an Exasol cluster.
Once you have the cluster running, note down the VPC id, subnet and security groups for Exasol nodes.

For example, for this tutorial we are going to use these:

  • VPC Id: vpc-044ebad71ff8ee446
  • Subnet: subnet-0864474a549a6f3e3
  • Security Group: sg-081a168693892392f

Updating the Security Group

Please make sure that the security group follows the rules listed below.

  • Ensure that the security group has a self reference for all traffic
  • Ensure that the security group has a rule for all outbound traffic

You can also restrict the outbound traffic to only TCP and S3 bucket using s3-prefix-list-id as destination. For more information please read Setting Up a VPC to Connect to JDBC Data Stores on AWS Glue documentation.

Adding S3 Endpoint to VPC

We are going to store Exasol Spark connector jar file in a S3 bucket that will be used by Glue. For that, Glue requires an S3 endpoint to access it.

Add a S3 endpoint to the VPC that will be used in Glue.

  • Open the Amazon VPC console at https://console.aws.amazon.com/vpc
  • In the navigation pane, choose Endpoints, and click Create Endpoint on the top right corner
  • For Service category, ensure that AWS services is selected
  • For Service Name, type S3 press send
  • From provided options choose the S3 service gateway
  • Then choose the VPC id, vpc-044ebad71ff8ee446
  • On the routing tables section, click for the subnet, subnet-0864474a549a6f3e3
  • Add tags and finish creating the endpoint

For this tutorial, we are going to allow Full Access to the S3 endpoint.

Creating Glue IAM Role

In order to use the Glue service, we should create an IAM Policy Role with permissions that acts on our behalf.

  • Go to https://console.aws.amazon.com/iam/ IAM service
  • Switch to Roles and click to Create Role
  • Select AWS Service and then Glue, and click to next Permissions
  • Type “S3” to Filter Policies and press enter, and then select AmazonS3FullAccess policy
  • Similarly, type “Glue” and press enter, and then select AWSGlueServiceRole policy
  • These are the default policy roles, however, you can also customize these policies
  • Then click next, and provide a descriptive Role name and Role description
  • Add tags, review and create role

Creating Glue Connection

In this section we are going to create a connection that will be attached to the Glue ETL jobs. However, this connection will not actually be used. The connection is still required since it is the only way to connect AWS Glue managed service and Exasol nodes running in separate VPC and subnet.

In this tutorial, we are going to create a temporary connection using MySQL URL, because at the time of writing this guide, there is no Exasol connection available.

  • Go to AWS Glue Service
  • Select Connections from the left, and click to add a connection
  • Give a descriptive name and select “JDBC” as connection type, and click Next
  • For JDBC URL, add a custom MySQL connection string: jdbc:mysql://10.10.0.11:8563/schema=SYS
  • The IP address in the URL does not have to exits, since it will used as a placeholder connection
  • You can find other connections on AWS Glue Connection Properties page
  • Add “sys” as username and write dummy password, it will not be used
  • Select the VPC id, vpc-044ebad71ff8ee446, where Exasol cluster is running
  • Similarly, select the subnet, subnet-0864474a549a6f3e3 and choose security group, sg-081a168693892392f, of the Exasol nodes
  • Click on Next, review and click on Finish to create a connection

Uploading Spark Connector Jar to S3 Bucket

To get Glue ETL job integrated with Exasol database we are going to use Exasol Spark Connector. Before starting an ETL job, we should upload the assembled connector jar file to S3 so that Glue can access it. Download the latest assembled (with ‘-assembly’ suffix) jar from spark-connector latest release assets. Please make sure it is compatible with AWS Glue Scala and Spark versions.

For this post, I am going to use AWS Glue with Spark version 3.1 and Scala version 2.x (Glue version 3).

And finally, upload the jar file to an S3 bucket. For example, I have stored it at location s3://exasol-artifacts/spark-connector/spark-connector_2.12-1.2.0-spark-3.1.2-assembly.jar. We are going to reference it when creating Glue ETL job.

It is also the reason we created an S3 endpoint for AWS Glue. So that it can access the jar file stored in S3 bucket.

Creating Exasol Table

For testing the integration, we are going to access a table in Exasol database. So let’s create it first. Using your favorite SQL client, connect to the Exasol database that you deployed to AWS.

Run the following SQL statements to create a table:

DROP SCHEMA IF EXISTS BASIC CASCADE;CREATE SCHEMA BASIC; CREATE TABLE NUMERIC_TYPES (  C_DECIMAL DECIMAL(3,2),  C_DOUBLE DOUBLE PRECISION); INSERT INTO NUMERIC_TYPES VALUES(NULL, NULL);INSERT INTO NUMERIC_TYPES VALUES(1.23, 3.14159265358979323846264338327950288);INSERT INTO NUMERIC_TYPES VALUES(0.0, 0.0);INSERT INTO NUMERIC_TYPES VALUES(-9.99, -5.555555555555);INSERT INTO NUMERIC_TYPES VALUES(-1.11, -111.11111111111);INSERT INTO NUMERIC_TYPES VALUES(9.99, 9.9999999999999);INSERT INTO NUMERIC_TYPES VALUES(-9.99, -9.9999999999999);

It creates a small table with decimal and double precision columns.

Creating AWS Glue ETL Job

In this section, we are going to create the Glue ETL Job that queries the Exasol database. We are going to choose Spark Scala programming API. However, you can follow the similar steps for creating job using Loading...Python PySpark API.

  • Go to AWS Glue service, on the left side select ETL and Jobs, and click to add a Job
  • Give it a name, and select the IAM role you created above
  • Select Spark as a type and “Spark 3.1, Scala 2 (Glue Version 3.0)” as a Glue version
  • Select “A new script to be authored by you” as how the job runs
  • Give a class name for the script, and script file name
  • Similarly select S3 locations for script and temporary directory or leave the default entries
  • And configure “Security configuration, script libraries, and job parameters (optional)
  • For the “Dependent jars path“, add the S3 location of Spark connector jar
  • For example, “s3://exasol-artifacts/spark-connector/spark-connector_2.12-1.2.0-spark-3.1.2-assembly.jar
  • And optionally add some job parameters, so that they can be modified with each job run
  • For example, HOST, PORT, JDBC_USERNAME and JDBC_PASSWORD
  • If you are using a recent versions of Exasol database, please also add parameter for certificate fingerprint
  • Click Next and select the JDBC connection we created before
  • Save and Edit the script

Editing the ETL Script

We are going to use the following Scala Spark script. It runs basic select query over the table we created in the Exasol database. You can modify it for more advanced ETL queries.

import scala.collection.JavaConverters._ import com.amazonaws.services.glue.GlueContextimport com.amazonaws.services.glue.util.GlueArgParserimport com.amazonaws.services.glue.util.Job import org.apache.spark.SparkContext object AWSGlueExasolETLJob {    def main(args: Array[String]): Unit = {        val glueContext = new GlueContext(new SparkContext())        val sparkSession = glueContext.getSparkSession                 val jobParams = Seq("JOB_NAME", "HOST", "PORT", "JDBC_USERNAME", "JDBC_PASSWORD", "FINGERPRINT")        val options = GlueArgParser.getResolvedOptions(args, jobParams.toArray)        Job.init(options("JOB_NAME"), glueContext, options.asJava)                 val host = options("HOST")        val port = options("PORT")        val username = options("JDBC_USERNAME")        val password = options("JDBC_PASSWORD")        val fingerprint = options("FINGERPRINT")                 val queryString = """                                                                       SELECT *                                                         FROM BASIC.NUMERIC_TYPES                                                                                 WHERE C_DOUBLE > 0.0                                                              """                       val df = sparkSession            .read            .format("exasol")            .option("host", host)            .option("port", port)            .option("username", username)            .option("password", password)            .option("fingerprint", fingerprint)            .option("query", queryString)            .load()                 df.collect().foreach(println)                 Job.commit()    }}

Save the script and let us run the job.

Running the ETL Job

You can click on the run job, and enter or modify the parameters such as username and password for the Exasol database and run the AWS Glue ETL job.

When you run the above script, you should see the following lines on Logs section.

[1.23,3.141592653589793] [9.99,9.9999999999999]

Conclusion

In this guide, we showed you how you can integrate AWS Glue with the Exasol database using our Spark Connector. For more information on Exasol Spark connector you can check out the project user guide.

exa-Muhammet