I have a use case where I want to extract data from an API with Python and get it into the database. Instead of hosting the script somewhere, which involves additional infrastructure, I am thinking about using an UDF:
- while NextLink: call API, process result and append to DataFrame
- EMIT DataFrame
My particular API has a rate limit of 50 calls/s, so for every 10k records extracted the runtime is at least 3.33 minutes. Due to the low API limit running it in parallel makes little sense, so it’s probably best to limit to one instance and node:
%perNodeAndCallInstanceLimit 1
- Is there a runtime limit for UDFs?
- Does running a UDF with a long runtime (e.g. 30 minutes) have any negative effects on other users?
- How can I find out what the available RAM for my UDF is to avoid OOM exceptions?
Speaking of exceptions: the UDF could also be designed in a way that each new API result is processed and EMIT-ed immediately after processing:
- call API
- process and EMIT result
- call NextLink
What is the behavior in case the UDF encounters an exception after already EMIT-ing some rows, are the (partial) results saved or is the whole output considered void?
Hi @robert_nagy,
With long-running UDFs, you have to pay attention to a few things.
- UDFs are running in a query and with this in a transaction. That means the emitted rows are only visible to other transaction in the database as soon as the transaction commits them. This means when your UDF is running 30 minutes in one go, the 10k records only appear in the table after the 30 min for other transactions.
- The UDFs itself don’t have a runtime limit (if you mean the time they can run). However, Exasol supports the feature of query timeout. You probably, need to make sure that this isn’t set. Furthermore, if your SQL client gets disconnected from the DB, you session might get killed as well, so a stable connection to the DB is needed.
- A remark regarding
%perNodeAndCallInstanceLimit 1
, this option doesn’t limit the UDF to one instance per cluster, it limits it to one instance per node and call in a query. That means, given enough input, a UDF with this option, will still start one instance per node. Furthermore, if the UDF is called multiple times in a query or different queries, you still get at least number of calls x number of nodes in many instances. To reduce the number of instances to one with an arbitrary input size, you probably need to use a SET EMITS function, which is called without a group by clause. This why you can at least guarantee an instance limit of 1 per call.
- If a UDF encounters an unhandled exception, it will fail and all previously emitted data is lost. So, good error handling is crucial. Often, that means catching all exceptions regardless and potentially emitting a row that contains the error message, if the error is not recoverable for this specific input row.
- Regarding the influence of other users. As with all long-running queries, you are occupying one of the database sessions. With many users, this can lead to a shortage of sessions.
- A first hint regarding the available RAM for UDFs you find in the UDF Metadata, it contains a memory limit property which you shouldn’t cross with all UDF instances from a single query. Besides this, there is another limit of RAM for UDFs and other processes (doesn’t include DBRAM) can use, the so-called System Heap, if all UDF instances and processes of all queries on a node together cross this limit, a query will be killed to free up memory. For more information, see the following knowledge base article.
Given all this together, it might be better in this specific scenario to let the UDF run only for a specific number of requests, finish then the UDF instance and query, commit the transaction and call the UDF in a loop. For example, using Lua Scripts (not UDFs). With pquery you can run SQL queries inside the script and also commit the transaction. However, you could do this as well on the client side with one of our drivers.
2 Likes