Parallel connections/connection pools in Python?

Padawan

Hey there,

I'm currently developing a web app using pyexasol and flask to make some data viewable over the web.
As for most web apps, parallel visits can and will occur, which will be handled by flask gracefully. Unfortunately, pyexasol does not support concurrent requests out of the box.
Instead, it gives me an "Connection cannot be shared between multiple threads sending requests simultaneously" ExaConcurrencyError.
My first thought was using connection pools as I would when working with other database solutions.
Are there any best-practices or solutions for this problem? I'm not limited to using pyexasol, but it seems to be the best choice when working with Python.
I can't be the only one facing the problem of integrating Exasol into a real world app instead of using it for data scientific purposes only.
So far, I only found this GitHub Issue https://github.com/badoo/pyexasol/issues/23, which recommends to mirror the data onto a caching layer using some other database.
But the whole point of using Exasol is to make the data accessible fast and to reduce loading speeds. Is this really necessary?
Has anyone solved this problem yet? Any recommendations? Otherwise I'll probably have to implement some kind of queue like also mentioned in the issue.

Thanks

1 ACCEPTED SOLUTION

Xpert

Unlike classic row-store databases (MySQL, PostgreSQL), analytical column-store databases cannot serve hundreds and thousands of simultaneous connections efficiently. The main goal for analytical databases is to serve small number of complex queries in parallel.

When you expose Exasol directly in web-interface, it may work initially. But as soon as you'll get more users running reports like this, you'll see drop in performance. You may expect having 20-40 parallel queries before it becomes too slow.

In theory, you can implement a basic connection pooling by creating multiple pyexasol objects and having a basic "slot" or "queue" system in place to assign one connection to one specific user at any moment of time. But it won't scale well.

The alternative solutions which scales well:

  1. When user tries to run a report via flask web interface, save report parameters in queue and wait;
  2. Queue processor picks up the report parameters and runs the actual query, saving result to MySQL / PostgreSQL / Hadoop / whatever. You may use EXPORT query to write the result directly from Exasol to row-storage;
  3. Once queue processor finished working on the report, select the result from row-store and display it;

 

If another user opens the same report in a short amount of time, show it directly from row-store. But add an explicit button "Refresh", just in case.

If another user opens the same report while original report is still running, DO NOT enqueue a new report, but wait for the original one instead.

If too many users are trying to run different reports simultaneously, do not run more than 5 queries simultaneously. But make sure to notify users that their query is currently not running, but waiting in queue. Ultimately, this approach will lead to better overall performance and very high scalability.

View solution in original post

4 REPLIES 4

Xpert

We don't have a flask application but django which isn't that far away from this scenario.

The applications mainly runs predefined fast running reports and visualizations (with some parameters available).

We just use a regular connection pool and never ran into any issues (just like you mentioned as possible solution).

To see if too many active connections are beeing used monitor for

 

exacs(3.2): Query queue limit of active sessions nearly reached, running: 80 of 100 (+ 0 WAIT FOR COMMIT) are in use.

 

 

Even if you hit the 100 of 100 the next sessions to executed will simply be queued and don't get an error.


in your EXAOperation log.

Edit: typo





Padawan

Are you using pyexasol and have you implemented the connection pooling by yourself or is there already some other work I can use to not reinvent the wheel?

Xpert

Unlike classic row-store databases (MySQL, PostgreSQL), analytical column-store databases cannot serve hundreds and thousands of simultaneous connections efficiently. The main goal for analytical databases is to serve small number of complex queries in parallel.

When you expose Exasol directly in web-interface, it may work initially. But as soon as you'll get more users running reports like this, you'll see drop in performance. You may expect having 20-40 parallel queries before it becomes too slow.

In theory, you can implement a basic connection pooling by creating multiple pyexasol objects and having a basic "slot" or "queue" system in place to assign one connection to one specific user at any moment of time. But it won't scale well.

The alternative solutions which scales well:

  1. When user tries to run a report via flask web interface, save report parameters in queue and wait;
  2. Queue processor picks up the report parameters and runs the actual query, saving result to MySQL / PostgreSQL / Hadoop / whatever. You may use EXPORT query to write the result directly from Exasol to row-storage;
  3. Once queue processor finished working on the report, select the result from row-store and display it;

 

If another user opens the same report in a short amount of time, show it directly from row-store. But add an explicit button "Refresh", just in case.

If another user opens the same report while original report is still running, DO NOT enqueue a new report, but wait for the original one instead.

If too many users are trying to run different reports simultaneously, do not run more than 5 queries simultaneously. But make sure to notify users that their query is currently not running, but waiting in queue. Ultimately, this approach will lead to better overall performance and very high scalability.

View solution in original post

Xpert

Can´t you do some magic here, @littlekoi  ? 🙂