FIRST_VALUE and LAST_VALUE with empty OVER() shouldn't produce identical result

lukaseder
SQL-Fighter

I'm using Exasol 7.0.9.

The following query:

 

select
  a,
  first_value(a) over (),
  last_value(a) over ()
from (values(1), (2), (3), (4)) as t(a)
order by a

 

Produces:

 

|A  |FIRST_VALUE(T.A) OVER()|LAST_VALUE(T.A) OVER()|
|---|-----------------------|----------------------|
|1  |1                      |1                     |
|2  |1                      |1                     |
|3  |1                      |1                     |
|4  |1                      |1                     |

 

I think the LAST_VALUE function should produce 4, assuming the semantics of empty OVER() clauses for these functions is ORDER BY <function argument> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

It would also be more consistent with NTH_VALUE, which still produces 2 e.g. for NTH_VALUE(a, 2) OVER ()

12 REPLIES 12

lukaseder
SQL-Fighter

> I always thought of nondeterminism as being unable to produce consistency , possible consistency ( i.e. accidental consistency ) isn´t what I´d call consistency.

It's not accidental. The ordering is nondeterministic but repeatable across a single query execution, because the specification of the ordering is the same (see below example), and one can reasonably expect that a DBMS doesn't do unnecessary extra work to scramble results here but not there.

> But it´s only inconsistent as long as we assume that the OVER (ORDER BY (SELECT 1)) yields the same order accross those functions, right ?

Yes, I assume this. Of course, an RDBMS may make the case for ordering things completely randomly in the absence of determinism (though I personally don't think that's generally a good idea), but it doesn't seem to be the case elsewhere. I.e. it's not being done for NTH_VALUE, only for FIRST_VALUE and LAST_VALUE, and only when leaving OVER () empty, not when putting an explicitly non-deterministic ORDER BY clause in OVER ()

> Would you expect FIRST / LAST_VALUE to equal N1 & N8 , the first and last value of "something" ( because of the identical window clause ) or just "at least" different values given the structure of the data set as a whole ?

I would expect the former. Otherwise, even the various NTH_VALUE expressions would risk producing duplicate results for no good reason other than educating users about non-determinism.

The current behaviour seems to be very specific to the omission of any content in OVER (). Just look at the result of this query:

select
  a,
  first_value(a) over (order by (select 1)) f,
  nth_value(a, 1) over (order by (select 1)) n1,
  nth_value(a, 2) over (order by (select 1)) n2,
  nth_value(a, 3) over (order by (select 1)) n3,
  nth_value(a, 4) over (order by (select 1)) n4,
  last_value(a) over (order by (select 1)) l
from (values(2), (1), (4), (3)) as t(a)
order by a

There's my desired consistency and expected result, where F=N1 and L=N4:

|A  |F  |N1 |N2 |N3 |N4 |L  |
|---|---|---|---|---|---|---|
|1  |2  |2  |1  |4  |3  |3  |
|2  |2  |2  |1  |4  |3  |3  |
|3  |2  |2  |1  |4  |3  |3  |
|4  |2  |2  |1  |4  |3  |3  |

 Again, there might be a case for random ordering per expression, but if that's only being done occasionally, then it's quite confusing for a user, which is why I think all of this looks more like a bug than a feature.

exa-GeorgD
Team Exasol
Team Exasol

Hello Lukas,
as always thank you for your contribution and for finding another bug :). As this is is a good opportunity to share some Exasol internals in the community, I will try to give a thorough explanation.


FIRST_VALUE and LAST_VALUE with OVER()

Let us start with your query:

select
  a,
  first_value(a) over (),
  last_value(a) over ()
from (values(1), (2), (3), (4)) as t(a)
order by a

The Exasol compiler transforms this to the following query:

select
  a,
  (SELECT first_value(a) FROM (values(1), (2), (3), (4)) as t(a) GROUP BY NULL),
  (SELECT first_value(a) FROM (values(1), (2), (3), (4)) as t(a) GROUP BY NULL)
from (values(1), (2), (3), (4)) as t(a)
order by a;


Before you ask, yes there is no typo here ;). This explains also the results you get.
It is wrong to do that optimization if there is more than one empty OVER() in the query.However, the bug is not the existence of this optimization, but that it is used while FIRST_VALUE OVER () or
NTH_VALUE OVER() are present in the query together with LAST_VALUE OVER().
The optimization also only affects LAST_VALUE/FIRST_VALUE with an empty OVER() clause.

Workaround
You basically discovered the workaround yourself. You can replace OVER() with OVER(ORDER BY (SELECT 1)).
However, OVER(ORDER BY (SELECT 1) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) would be slightly faster in Exaol.
Based on the SQL standard, using an ORDER BY in an OVER clause without window frame automatically leads to the default window frame: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
For a single order value both deliver equivalent results. But this requires an extra detection and optimization by the database.

Consistency
You are also right, that the SQL standard guarantees the same order for analytic functions with the same ordering within a query. This means that all analytic functions with an empty OVER() clause need to have the same order.

Results are nondeterministic
However, what is not guaranteed is ORDER BY <function argument> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING for OVER(). In fact, as the computations in Exasol happen in parallel, there is no consistent order between two executions of the same query. You can see that if you execute your query on a larger data set multiple times:

CREATE SCHEMA test;
CREATE OR REPLACE TABLE t1(a int);
CREATE OR REPLACE TABLE ten(col int);
INSERT INTO ten VALUES 0,1,2,3,4,5,6,7,8,9;
INSERT INTO t1 select a.col+10*b.col+100*c.col+1000*d.col from ten a,ten b,ten c,ten d;SELECT COUNT(*) FROM t1;
SELECT FIRST_VALUE(a), LAST_VALUE(a) FROM t1 GROUP BY NULL;
select
  a,
  first_value(a) over (order by (select 1)) f,
  nth_value(a, 1) over (order by (select 1)) n1,
  nth_value(a, 2) over (order by (select 1)) n2,
  nth_value(a, 3) over (order by (select 1)) n3,
  nth_value(a, 4) over (order by (select 1)) n4,
  last_value(a) over (order by (select 1)) l
from t1
order by a limit 3;


There are several possible results. Here are just two, to illustrate the problem:
Result 1:

|A  |F  |N1 |N2 |N3 |N4 |L     |
|---|---|---|---|---|---|------|
|0  |0  |0  |1  |2  |3  |4163  |
|1  |0  |0  |1  |2  |3  |4163  |
|2  |0  |0  |1  |2  |3  |4163  |

Result 2:

|A  |F     |N1    |N2    |N3    |N4    |L     |
|---|------|------|------|------|------|------|
|0  |9182  |9182  |9183  |9184  |9185  |8227  |
|1  |9182  |9182  |9183  |9184  |9185  |8227  |
|2  |9182  |9182  |9183  |9184  |9185  |8227  |


The results are consistent but not deterministic.
As usual I will add a bug ticket in our internal issue tracker :).
Best wishes
Georg

View solution in original post

lukaseder
SQL-Fighter

Thanks a lot for the excellent explanation! That has been very insightful