Column value metadata

jwarlander
Xpert
Hello,

I'm curious about if Exasol keeps track of any interesting metadata about column values, in a way that one could access through system tables for example.. Eg. current min/max values, NULL value counts, or anything similar?

I'm working on an Exasol integration for Redata (https://github.com/redata-team/redata), a data monitoring solution, and it would be really nice to avoid querying for some of these things directly in the source data if Exasol already knows!
1 ACCEPTED SOLUTION

exa-Gabi
Team Exasol
Team Exasol

To add to @PeterK excellent answer, I'd like to point out some tradeoffs and hopefully shed some light on how things work:

* Exact statistics come with a non-trivial performance cost attached. For instance, say you have a table of 1 billion rows and you delete one row; at this precise moment you don't know whether your MIN or MAX statistics are still accurate and you need to recompute your statistics for the whole affected column. Such a scenario is really cost-prohibitive and the statistics are not recomputed. The statistics are recomputed only after many enough values have been changed (via DMLs), so that the relatively large cost of computing them is amortized over a larger set of row updates - either a lot of small updates or a few large updates or anything in between (all DML's count as updates here). For this reason the statistics are not can cannot be guaranteed to be always exact.
* The column statistics need to be maintained upon DML's and thus incur some small overhead for DML queries. To minimize this overhead we create and maintain the statistics only for the necessary columns (the ones used in joins) and not for all columns.
* The column statistics generated by the system are for intended to be for internal use and, eventually, help understand bad join queues, not for being used for checks against table data. To retrieve MIN, MAX and other table statistics one can always use the relevant aggregated functions provided by the system using normal SQL queries against the table of interest

Hope it helps.

EXA-Dude, database developer, history buff. Member of Team Exasol.

View solution in original post

6 REPLIES 6

exa-Uwe
Moderator
Moderator

It may also be worthwhile to point out why we are creating these statistics mentioned by Peter and Gabi  automatically: It's for our query optimizer that works cost-based (using statistics). They are not meant to be used for exact reports by customers. And disclaimer for all tables starting with '$': There's a reason why they are not documented. They are for internal purposes. There's no guarantee for consistency or reliability and their content might change with newer versions of Exasol without notice. Use at your on risk 😉

Best regards

Uwe

PeterK
Xpert

Hi @jwarlander  - There are some column min/max estimates in the "$EXA_COLUMN_STATISTICS" table. Note however that:

  1. It's only populated on an as-needed basis so it doesn't always have the info.
  2. They are estimates
  3. Tables starting with "$" can change without notice between Exasol versions...so it's "use at your own risk"

Hopefully that's helpful.

Peter

jwarlander
Xpert

Thanks @PeterK, that looks interesting!

I guess it would be a bit risky to use, as you say.. but on the other hand, I could test for "compliance" on startup, and use the table if it works as expected with the connected Exasol database. Then there's the "estimates" bit.. I really do need absolute data! But on the other hand I see this flag in the table, MIN_MAX_VALUE_EXACT_LIMIT. I wonder if it means what it seems to say, that they are exact values rather than limits if it's set?

Nonetheless, I only seem to have such "exact" statistics for *one* of my tables in the schema I'm testing on, probably because that's an old table that's not getting updated anymore. The other tables get replaced with full data dumps from a source system every day, and seem in fact to not even have any estimates at all.

We'll see where this goes; for now I'll have to rely on queries, I suspect 😎

exa-Gabi
Team Exasol
Team Exasol

To add to @PeterK excellent answer, I'd like to point out some tradeoffs and hopefully shed some light on how things work:

* Exact statistics come with a non-trivial performance cost attached. For instance, say you have a table of 1 billion rows and you delete one row; at this precise moment you don't know whether your MIN or MAX statistics are still accurate and you need to recompute your statistics for the whole affected column. Such a scenario is really cost-prohibitive and the statistics are not recomputed. The statistics are recomputed only after many enough values have been changed (via DMLs), so that the relatively large cost of computing them is amortized over a larger set of row updates - either a lot of small updates or a few large updates or anything in between (all DML's count as updates here). For this reason the statistics are not can cannot be guaranteed to be always exact.
* The column statistics need to be maintained upon DML's and thus incur some small overhead for DML queries. To minimize this overhead we create and maintain the statistics only for the necessary columns (the ones used in joins) and not for all columns.
* The column statistics generated by the system are for intended to be for internal use and, eventually, help understand bad join queues, not for being used for checks against table data. To retrieve MIN, MAX and other table statistics one can always use the relevant aggregated functions provided by the system using normal SQL queries against the table of interest

Hope it helps.

EXA-Dude, database developer, history buff. Member of Team Exasol.

View solution in original post

jwarlander
Xpert

Yeah, I suspected as much!

I'm also in need of min/max/avg values over a few different time periods, which of course wouldn't be possible to derive from total column statistics.. My best bet is probably to try and optimize the way Redata queries for these statistics, specifically try and grab as much as I can in a single query on the table instead of one query per statistic per column.

exa-MathiasHo
Community Manager
Community Manager

Hi @jwarlander ,

I mean, what is "interesting", right? Jokes aside, I definitely will see what I can find out! Please bear with me.