2 weeks ago
Solved! Go to Solution.
2 weeks ago
To add to @PeterK excellent answer, I'd like to point out some tradeoffs and hopefully shed some light on how things work:
* Exact statistics come with a non-trivial performance cost attached. For instance, say you have a table of 1 billion rows and you delete one row; at this precise moment you don't know whether your MIN or MAX statistics are still accurate and you need to recompute your statistics for the whole affected column. Such a scenario is really cost-prohibitive and the statistics are not recomputed. The statistics are recomputed only after many enough values have been changed (via DMLs), so that the relatively large cost of computing them is amortized over a larger set of row updates - either a lot of small updates or a few large updates or anything in between (all DML's count as updates here). For this reason the statistics are not can cannot be guaranteed to be always exact.
* The column statistics need to be maintained upon DML's and thus incur some small overhead for DML queries. To minimize this overhead we create and maintain the statistics only for the necessary columns (the ones used in joins) and not for all columns.
* The column statistics generated by the system are for intended to be for internal use and, eventually, help understand bad join queues, not for being used for checks against table data. To retrieve MIN, MAX and other table statistics one can always use the relevant aggregated functions provided by the system using normal SQL queries against the table of interest
Hope it helps.
2 weeks ago - last edited 2 weeks ago
It may also be worthwhile to point out why we are creating these statistics mentioned by Peter and Gabi automatically: It's for our query optimizer that works cost-based (using statistics). They are not meant to be used for exact reports by customers. And disclaimer for all tables starting with '$': There's a reason why they are not documented. They are for internal purposes. There's no guarantee for consistency or reliability and their content might change with newer versions of Exasol without notice. Use at your on risk 😉
Best regards
Uwe
2 weeks ago
Hi @jwarlander - There are some column min/max estimates in the "$EXA_COLUMN_STATISTICS" table. Note however that:
Hopefully that's helpful.
Peter
2 weeks ago
Thanks @PeterK, that looks interesting!
I guess it would be a bit risky to use, as you say.. but on the other hand, I could test for "compliance" on startup, and use the table if it works as expected with the connected Exasol database. Then there's the "estimates" bit.. I really do need absolute data! But on the other hand I see this flag in the table, MIN_MAX_VALUE_EXACT_LIMIT. I wonder if it means what it seems to say, that they are exact values rather than limits if it's set?
Nonetheless, I only seem to have such "exact" statistics for *one* of my tables in the schema I'm testing on, probably because that's an old table that's not getting updated anymore. The other tables get replaced with full data dumps from a source system every day, and seem in fact to not even have any estimates at all.
We'll see where this goes; for now I'll have to rely on queries, I suspect 😎
2 weeks ago
To add to @PeterK excellent answer, I'd like to point out some tradeoffs and hopefully shed some light on how things work:
* Exact statistics come with a non-trivial performance cost attached. For instance, say you have a table of 1 billion rows and you delete one row; at this precise moment you don't know whether your MIN or MAX statistics are still accurate and you need to recompute your statistics for the whole affected column. Such a scenario is really cost-prohibitive and the statistics are not recomputed. The statistics are recomputed only after many enough values have been changed (via DMLs), so that the relatively large cost of computing them is amortized over a larger set of row updates - either a lot of small updates or a few large updates or anything in between (all DML's count as updates here). For this reason the statistics are not can cannot be guaranteed to be always exact.
* The column statistics need to be maintained upon DML's and thus incur some small overhead for DML queries. To minimize this overhead we create and maintain the statistics only for the necessary columns (the ones used in joins) and not for all columns.
* The column statistics generated by the system are for intended to be for internal use and, eventually, help understand bad join queues, not for being used for checks against table data. To retrieve MIN, MAX and other table statistics one can always use the relevant aggregated functions provided by the system using normal SQL queries against the table of interest
Hope it helps.
2 weeks ago
Yeah, I suspected as much!
I'm also in need of min/max/avg values over a few different time periods, which of course wouldn't be possible to derive from total column statistics.. My best bet is probably to try and optimize the way Redata queries for these statistics, specifically try and grab as much as I can in a single query on the table instead of one query per statistic per column.
2 weeks ago
Hi @jwarlander ,
I mean, what is "interesting", right? Jokes aside, I definitely will see what I can find out! Please bear with me.
Then it's time to become part of a unique family! Discover helpful tips and support other Community members with your knowledge.
Sign In