FIRST_VALUE and LAST_VALUE with empty OVER() shouldn't produce identical result

lukaseder
SQL-Fighter

I'm using Exasol 7.0.9.

The following query:

select
  a,
  first_value(a) over (),
  last_value(a) over ()
from (values(1), (2), (3), (4)) as t(a)
order by a

Produces:

|A  |FIRST_VALUE(T.A) OVER()|LAST_VALUE(T.A) OVER()|
|---|-----------------------|----------------------|
|1  |1                      |1                     |
|2  |1                      |1                     |
|3  |1                      |1                     |
|4  |1                      |1                     |

I think the LAST_VALUE function should produce 4, assuming the semantics of empty OVER() clauses for these functions is ORDER BY <function argument> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

It would also be more consistent with NTH_VALUE, which still produces 2 e.g. for NTH_VALUE(a, 2) OVER ()

1 ACCEPTED SOLUTION

exa-GeorgD
Team Exasol
Team Exasol

Hello Lukas,
as always thank you for your contribution and for finding another bug :). As this is is a good opportunity to share some Exasol internals in the community, I will try to give a thorough explanation.


FIRST_VALUE and LAST_VALUE with OVER()

Let us start with your query:

select
  a,
  first_value(a) over (),
  last_value(a) over ()
from (values(1), (2), (3), (4)) as t(a)
order by a

The Exasol compiler transforms this to the following query:

select
  a,
  (SELECT first_value(a) FROM (values(1), (2), (3), (4)) as t(a) GROUP BY NULL),
  (SELECT first_value(a) FROM (values(1), (2), (3), (4)) as t(a) GROUP BY NULL)
from (values(1), (2), (3), (4)) as t(a)
order by a;


Before you ask, yes there is no typo here ;). This explains also the results you get.
It is wrong to do that optimization if there is more than one empty OVER() in the query.However, the bug is not the existence of this optimization, but that it is used while FIRST_VALUE OVER () or
NTH_VALUE OVER() are present in the query together with LAST_VALUE OVER().
The optimization also only affects LAST_VALUE/FIRST_VALUE with an empty OVER() clause.

Workaround
You basically discovered the workaround yourself. You can replace OVER() with OVER(ORDER BY (SELECT 1)).
However, OVER(ORDER BY (SELECT 1) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) would be slightly faster in Exaol.
Based on the SQL standard, using an ORDER BY in an OVER clause without window frame automatically leads to the default window frame: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
For a single order value both deliver equivalent results. But this requires an extra detection and optimization by the database.

Consistency
You are also right, that the SQL standard guarantees the same order for analytic functions with the same ordering within a query. This means that all analytic functions with an empty OVER() clause need to have the same order.

Results are nondeterministic
However, what is not guaranteed is ORDER BY <function argument> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING for OVER(). In fact, as the computations in Exasol happen in parallel, there is no consistent order between two executions of the same query. You can see that if you execute your query on a larger data set multiple times:

CREATE SCHEMA test;
CREATE OR REPLACE TABLE t1(a int);
CREATE OR REPLACE TABLE ten(col int);
INSERT INTO ten VALUES 0,1,2,3,4,5,6,7,8,9;
INSERT INTO t1 select a.col+10*b.col+100*c.col+1000*d.col from ten a,ten b,ten c,ten d;SELECT COUNT(*) FROM t1;
SELECT FIRST_VALUE(a), LAST_VALUE(a) FROM t1 GROUP BY NULL;
select
  a,
  first_value(a) over (order by (select 1)) f,
  nth_value(a, 1) over (order by (select 1)) n1,
  nth_value(a, 2) over (order by (select 1)) n2,
  nth_value(a, 3) over (order by (select 1)) n3,
  nth_value(a, 4) over (order by (select 1)) n4,
  last_value(a) over (order by (select 1)) l
from t1
order by a limit 3;


There are several possible results. Here are just two, to illustrate the problem:
Result 1:

|A  |F  |N1 |N2 |N3 |N4 |L     |
|---|---|---|---|---|---|------|
|0  |0  |0  |1  |2  |3  |4163  |
|1  |0  |0  |1  |2  |3  |4163  |
|2  |0  |0  |1  |2  |3  |4163  |

Result 2:

|A  |F     |N1    |N2    |N3    |N4    |L     |
|---|------|------|------|------|------|------|
|0  |9182  |9182  |9183  |9184  |9185  |8227  |
|1  |9182  |9182  |9183  |9184  |9185  |8227  |
|2  |9182  |9182  |9183  |9184  |9185  |8227  |


The results are consistent but not deterministic.
As usual I will add a bug ticket in our internal issue tracker :).
Best wishes
Georg

View solution in original post

12 REPLIES 12

exa-GeorgD
Team Exasol
Team Exasol

Hello Lukas,
as always thank you for your contribution and for finding another bug :). As this is is a good opportunity to share some Exasol internals in the community, I will try to give a thorough explanation.


FIRST_VALUE and LAST_VALUE with OVER()

Let us start with your query:

select
  a,
  first_value(a) over (),
  last_value(a) over ()
from (values(1), (2), (3), (4)) as t(a)
order by a

The Exasol compiler transforms this to the following query:

select
  a,
  (SELECT first_value(a) FROM (values(1), (2), (3), (4)) as t(a) GROUP BY NULL),
  (SELECT first_value(a) FROM (values(1), (2), (3), (4)) as t(a) GROUP BY NULL)
from (values(1), (2), (3), (4)) as t(a)
order by a;


Before you ask, yes there is no typo here ;). This explains also the results you get.
It is wrong to do that optimization if there is more than one empty OVER() in the query.However, the bug is not the existence of this optimization, but that it is used while FIRST_VALUE OVER () or
NTH_VALUE OVER() are present in the query together with LAST_VALUE OVER().
The optimization also only affects LAST_VALUE/FIRST_VALUE with an empty OVER() clause.

Workaround
You basically discovered the workaround yourself. You can replace OVER() with OVER(ORDER BY (SELECT 1)).
However, OVER(ORDER BY (SELECT 1) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) would be slightly faster in Exaol.
Based on the SQL standard, using an ORDER BY in an OVER clause without window frame automatically leads to the default window frame: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
For a single order value both deliver equivalent results. But this requires an extra detection and optimization by the database.

Consistency
You are also right, that the SQL standard guarantees the same order for analytic functions with the same ordering within a query. This means that all analytic functions with an empty OVER() clause need to have the same order.

Results are nondeterministic
However, what is not guaranteed is ORDER BY <function argument> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING for OVER(). In fact, as the computations in Exasol happen in parallel, there is no consistent order between two executions of the same query. You can see that if you execute your query on a larger data set multiple times:

CREATE SCHEMA test;
CREATE OR REPLACE TABLE t1(a int);
CREATE OR REPLACE TABLE ten(col int);
INSERT INTO ten VALUES 0,1,2,3,4,5,6,7,8,9;
INSERT INTO t1 select a.col+10*b.col+100*c.col+1000*d.col from ten a,ten b,ten c,ten d;SELECT COUNT(*) FROM t1;
SELECT FIRST_VALUE(a), LAST_VALUE(a) FROM t1 GROUP BY NULL;
select
  a,
  first_value(a) over (order by (select 1)) f,
  nth_value(a, 1) over (order by (select 1)) n1,
  nth_value(a, 2) over (order by (select 1)) n2,
  nth_value(a, 3) over (order by (select 1)) n3,
  nth_value(a, 4) over (order by (select 1)) n4,
  last_value(a) over (order by (select 1)) l
from t1
order by a limit 3;


There are several possible results. Here are just two, to illustrate the problem:
Result 1:

|A  |F  |N1 |N2 |N3 |N4 |L     |
|---|---|---|---|---|---|------|
|0  |0  |0  |1  |2  |3  |4163  |
|1  |0  |0  |1  |2  |3  |4163  |
|2  |0  |0  |1  |2  |3  |4163  |

Result 2:

|A  |F     |N1    |N2    |N3    |N4    |L     |
|---|------|------|------|------|------|------|
|0  |9182  |9182  |9183  |9184  |9185  |8227  |
|1  |9182  |9182  |9183  |9184  |9185  |8227  |
|2  |9182  |9182  |9183  |9184  |9185  |8227  |


The results are consistent but not deterministic.
As usual I will add a bug ticket in our internal issue tracker :).
Best wishes
Georg

View solution in original post

lukaseder
SQL-Fighter

Thanks a lot for the excellent explanation! That has been very insightful

mwellbro
Xpert

Hi @lukaseder ,

why would you assume the stated semantic ? It feels a bit like indeterministic territory here because if you don´t specify an order ( order by in OVER() clause is always something else compared to order by for the RESULT set ) you don´t get an order ... or am I wrong here ?

Cheers,
Malte

lukaseder
SQL-Fighter

My expectation is by intuition here. Of course, it is better to be explicit about ordering, but since it isn't required by the parser (in some RDBMS it is, but the presence of ORDER BY does not necessarily guarantee determinism), then I'd expect it to be at least consistent with the ordering of FIRST_VALUE and NTH_VALUE:

 

select
  a,
  first_value(a) over () f,
  nth_value(a, 1) over () n1,
  nth_value(a, 2) over () n2,
  nth_value(a, 3) over () n3,
  nth_value(a, 4) over () n4,
  last_value(a) over () l
from (values(1), (2), (3), (4)) as t(a)
order by a

 

Yields:

 

|A  |F  |N1 |N2 |N3 |N4 |L  |
|---|---|---|---|---|---|---|
|1  |1  |1  |2  |3  |4  |1  |
|2  |1  |1  |2  |3  |4  |1  |
|3  |1  |1  |2  |3  |4  |1  |
|4  |1  |1  |2  |3  |4  |1  |

 

So, to me, intuitively, either the NTH_VALUE calculations are inconsistent, or the LAST_VALUE one is. If the 4th value is 4, and the 5th value would be NULL, then the 4th value appears to be the "last" value, no?

Also, that's exactly what PostgreSQL is doing 🙂 https://dbfiddle.uk/?rdbms=postgres_13&fiddle=0d2f2ccd0b5139a51ad624550aa4d3fc, and MySQL as well: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=80eb53041ace73c36602caf2024d99da and Oracle as well: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=ab548eebbd709da7d73abb4a3af5a657 

mwellbro
Xpert

The only argument I´d see there is that from an EXA perspective:

"If the 4th value is 4, and the 5th value would be NULL, then the 4th value appears to be the "last" value, no?"

nth_value(a, 4) over () n4

 => this will not (necessarily) yield the 4th value but some value , so comparing this with an equally non-deterministic call to last_value(a) over () seems misleading.
I´m with you on the "intuition" part, but still, implementation trumps intuition 🙂

Your line of thought seems to assume an order where there is none as far as I can see.

@Exa-* : "Also, that's exactly what(...)" => here´s a chance to increase compatibility to other systems 🙂

lukaseder
SQL-Fighter

My line of thought assumes that irrespective of explicit or implicit (non-deterministic) order, there is an enumeration going on, just like with ROW_NUMBER, and each ordinal is assigned to each row exactly once. In fact, I'm even assuming that for any window specification w, NTH_VALUE(a, n) OVER w will evaluate the expression a for the row where ROW_NUMBER() OVER w = n

Or in other words, even if an order is non-deterministic, there's an accidental order, and ordinals can be assigned non-ambiguously per query execution. If this weren't the case explicit, but non-deterministic ordering also wouldn't work, such as ROW_NUMBER() OVER (ORDER BY (SELECT 1))

mwellbro
Xpert

To pick up on your statement: "It would also be more consistent with NTH_VALUE, which still produces 2 e.g. for NTH_VALUE(a, 2) OVER ()"

I think I did not change the general structure of your sql ( just added a few things left and right ), and nth_value(a,2) over() is by no means guaranteed to deliver "2":

mwellbro_0-1620837155302.png

I think the semantics here are more along the lines of "pick a card, any card" , as described here:

https://docs.exasol.com/sql_references/functions/analyticfunctions.htm

"If the analytic function does not contain an order_clause, Exasol does not sort the data within each partition. Without sorted data, the use of some analytic functions can cause non-deterministic results."

I hope I made sense here ? Sorry, in all honesty I did not entirely follow your last post, will have to read that a couple of times and maybe someone from EXA could chime in here 🙂

 

lukaseder
SQL-Fighter

> and nth_value(a,2) over() is by no means guaranteed to deliver "2":

I did not imply that it is guaranteed to deliver "2". That was just the example I made. It produces the value of a at the 2nd position in the window frame, and in my example, the data happened to be ordered already (by accident).

Anyway, I just noticed this difference to almost all other RDBMS that we're testing jOOQ with (a total of 30, 22 of which support LAST_VALUE). Among the ones that don't simply reject empty OVER () clauses in this case (and even in those cases, we generate OVER (ORDER BY (SELECT 1)) or something similar, so the non-deterministic semantics is ultimately the same), Exasol is the only one I've noticed to exhibit this behaviour, so I thought it might be a bug somewhere, and I'd let you know. The bug being that consistency is still possible and desired in the event of nondeterminism. E.g. in your latest screenshot, I can't reconcile how n1-n4 produce all the 4 distinct values, yet FIRST_VALUE / LAST_VALUE are the same (the "last value, this time", i.e. the value of NTH_VALUE(a, 4)). That just seems inconsistent to me.

mwellbro
Xpert

> "The bug being that consistency is still possible and desired in the event of nondeterminism"

I always thought of nondeterminism as being unable to produce consistency , possible consistency ( i.e. accidental consistency ) isn´t what I´d call consistency.


> "I can't reconcile how n1-n4 produce all the 4 distinct values, yet FIRST_VALUE / LAST_VALUE are the same (the "last value, this time", i.e. the value of NTH_VALUE(a, 4)). That just seems inconsistent to me."
But it´s only inconsistent as long as we assume that the OVER (ORDER BY (SELECT 1)) yields the same order accross those functions, right ? In any case, I´m with you that all in all FIRST / LAST_VALUE shouldn´t yield the same value if there are more than 1 discrete values.

Just for me to get this ( sorry if I´m getting on your nerves with this ) :

mwellbro_0-1620894927097.png

Would you expect FIRST / LAST_VALUE to equal N1 & N8 , the first and last value of "something" ( because of the identical window clause ) or just "at least" different values given the structure of the data set as a whole ?

Someone at EXA told me they are going to post something in this thread tomorrow, looking forward to it 🙂

lukaseder
SQL-Fighter

> I always thought of nondeterminism as being unable to produce consistency , possible consistency ( i.e. accidental consistency ) isn´t what I´d call consistency.

It's not accidental. The ordering is nondeterministic but repeatable across a single query execution, because the specification of the ordering is the same (see below example), and one can reasonably expect that a DBMS doesn't do unnecessary extra work to scramble results here but not there.

> But it´s only inconsistent as long as we assume that the OVER (ORDER BY (SELECT 1)) yields the same order accross those functions, right ?

Yes, I assume this. Of course, an RDBMS may make the case for ordering things completely randomly in the absence of determinism (though I personally don't think that's generally a good idea), but it doesn't seem to be the case elsewhere. I.e. it's not being done for NTH_VALUE, only for FIRST_VALUE and LAST_VALUE, and only when leaving OVER () empty, not when putting an explicitly non-deterministic ORDER BY clause in OVER ()

> Would you expect FIRST / LAST_VALUE to equal N1 & N8 , the first and last value of "something" ( because of the identical window clause ) or just "at least" different values given the structure of the data set as a whole ?

I would expect the former. Otherwise, even the various NTH_VALUE expressions would risk producing duplicate results for no good reason other than educating users about non-determinism.

The current behaviour seems to be very specific to the omission of any content in OVER (). Just look at the result of this query:

select
  a,
  first_value(a) over (order by (select 1)) f,
  nth_value(a, 1) over (order by (select 1)) n1,
  nth_value(a, 2) over (order by (select 1)) n2,
  nth_value(a, 3) over (order by (select 1)) n3,
  nth_value(a, 4) over (order by (select 1)) n4,
  last_value(a) over (order by (select 1)) l
from (values(2), (1), (4), (3)) as t(a)
order by a

There's my desired consistency and expected result, where F=N1 and L=N4:

|A  |F  |N1 |N2 |N3 |N4 |L  |
|---|---|---|---|---|---|---|
|1  |2  |2  |1  |4  |3  |3  |
|2  |2  |2  |1  |4  |3  |3  |
|3  |2  |2  |1  |4  |3  |3  |
|4  |2  |2  |1  |4  |3  |3  |

 Again, there might be a case for random ordering per expression, but if that's only being done occasionally, then it's quite confusing for a user, which is why I think all of this looks more like a bug than a feature.

mwellbro
Xpert

just to add on that:

 

mwellbro_0-1620817855822.png

was a bit surprised with the last_value not being taken from the entire set...need to think on that, seems strange....

mwellbro
Xpert

https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/last_value.htm

mwellbro_0-1620818377718.png

The passage above the marked one also clears up my surprise with the last_values not considering the entire set...they are actually computed locally...in all honesty I´d have expected otherwise, but here we are.