cancel
Showing results for
Did you mean:

# FIRST_VALUE and LAST_VALUE with empty OVER() shouldn't produce identical result

SQL-Fighter

I'm using Exasol 7.0.9.

The following query:

``````select
a,
first_value(a) over (),
last_value(a) over ()
from (values(1), (2), (3), (4)) as t(a)
order by a``````

Produces:

``````|A  |FIRST_VALUE(T.A) OVER()|LAST_VALUE(T.A) OVER()|
|---|-----------------------|----------------------|
|1  |1                      |1                     |
|2  |1                      |1                     |
|3  |1                      |1                     |
|4  |1                      |1                     |``````

I think the `LAST_VALUE` function should produce 4, assuming the semantics of empty `OVER()` clauses for these functions is `ORDER BY <function argument> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`

It would also be more consistent with `NTH_VALUE`, which still produces 2 e.g. for `NTH_VALUE(a, 2) OVER ()`

1 ACCEPTED SOLUTION

Accepted Solutions
Team Exasol

## FIRST_VALUE and LAST_VALUE with OVER()

``````select
a,
first_value(a) over (),
last_value(a) over ()
from (values(1), (2), (3), (4)) as t(a)
order by a``````

## The Exasol compiler transforms this to the following query:

``````select
a,
(SELECT first_value(a) FROM (values(1), (2), (3), (4)) as t(a) GROUP BY NULL),
(SELECT first_value(a) FROM (values(1), (2), (3), (4)) as t(a) GROUP BY NULL)
from (values(1), (2), (3), (4)) as t(a)
order by a;``````

## Results are nondeterministicHowever, what is not guaranteed is `ORDER BY <function argument> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING` for `OVER()`. In fact, as the computations in Exasol happen in parallel, there is no consistent order between two executions of the same query. You can see that if you execute your query on a larger data set multiple times:

``````CREATE SCHEMA test;
CREATE OR REPLACE TABLE t1(a int);
CREATE OR REPLACE TABLE ten(col int);
INSERT INTO ten VALUES 0,1,2,3,4,5,6,7,8,9;
INSERT INTO t1 select a.col+10*b.col+100*c.col+1000*d.col from ten a,ten b,ten c,ten d;SELECT COUNT(*) FROM t1;
SELECT FIRST_VALUE(a), LAST_VALUE(a) FROM t1 GROUP BY NULL;
select
a,
first_value(a) over (order by (select 1)) f,
nth_value(a, 1) over (order by (select 1)) n1,
nth_value(a, 2) over (order by (select 1)) n2,
nth_value(a, 3) over (order by (select 1)) n3,
nth_value(a, 4) over (order by (select 1)) n4,
last_value(a) over (order by (select 1)) l
from t1
order by a limit 3;``````

There are several possible results. Here are just two, to illustrate the problem:
Result 1:

``````|A  |F  |N1 |N2 |N3 |N4 |L     |
|---|---|---|---|---|---|------|
|0  |0  |0  |1  |2  |3  |4163  |
|1  |0  |0  |1  |2  |3  |4163  |
|2  |0  |0  |1  |2  |3  |4163  |
``````

Result 2:

``````|A  |F     |N1    |N2    |N3    |N4    |L     |
|---|------|------|------|------|------|------|
|0  |9182  |9182  |9183  |9184  |9185  |8227  |
|1  |9182  |9182  |9183  |9184  |9185  |8227  |
|2  |9182  |9182  |9183  |9184  |9185  |8227  |``````

The results are consistent but not deterministic.
As usual I will add a bug ticket in our internal issue tracker :).
Best wishes
Georg

12 REPLIES 12
Xpert

Hi @lukaseder ,

why would you assume the stated semantic ? It feels a bit like indeterministic territory here because if you don´t specify an order ( order by in OVER() clause is always something else compared to order by for the RESULT set ) you don´t get an order ... or am I wrong here ?

Cheers,
Malte

Xpert

was a bit surprised with the last_value not being taken from the entire set...need to think on that, seems strange....

Xpert

https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/last_value.htm

The passage above the marked one also clears up my surprise with the last_values not considering the entire set...they are actually computed locally...in all honesty I´d have expected otherwise, but here we are.

SQL-Fighter

My expectation is by intuition here. Of course, it is better to be explicit about ordering, but since it isn't required by the parser (in some RDBMS it is, but the presence of `ORDER BY` does not necessarily guarantee determinism), then I'd expect it to be at least consistent with the ordering of `FIRST_VALUE` and `NTH_VALUE`:

``````select
a,
first_value(a) over () f,
nth_value(a, 1) over () n1,
nth_value(a, 2) over () n2,
nth_value(a, 3) over () n3,
nth_value(a, 4) over () n4,
last_value(a) over () l
from (values(1), (2), (3), (4)) as t(a)
order by a``````

Yields:

``````|A  |F  |N1 |N2 |N3 |N4 |L  |
|---|---|---|---|---|---|---|
|1  |1  |1  |2  |3  |4  |1  |
|2  |1  |1  |2  |3  |4  |1  |
|3  |1  |1  |2  |3  |4  |1  |
|4  |1  |1  |2  |3  |4  |1  |``````

So, to me, intuitively, either the `NTH_VALUE` calculations are inconsistent, or the `LAST_VALUE` one is. If the 4th value is 4, and the 5th value would be `NULL`, then the 4th value appears to be the "last" value, no?

Also, that's exactly what PostgreSQL is doing 🙂 https://dbfiddle.uk/?rdbms=postgres_13&fiddle=0d2f2ccd0b5139a51ad624550aa4d3fc, and MySQL as well: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=80eb53041ace73c36602caf2024d99da and Oracle as well: https://dbfiddle.uk/?rdbms=oracle_18&fiddle=ab548eebbd709da7d73abb4a3af5a657

Xpert

The only argument I´d see there is that from an EXA perspective:

"If the 4th value is 4, and the 5th value would be `NULL`, then the 4th value appears to be the "last" value, no?"

``nth_value(a, 4) over () n4``

=> this will not (necessarily) yield the 4th value but some value , so comparing this with an equally non-deterministic call to last_value(a) over () seems misleading.
I´m with you on the "intuition" part, but still, implementation trumps intuition 🙂

Your line of thought seems to assume an order where there is none as far as I can see.

@Exa-* : "Also, that's exactly what(...)" => here´s a chance to increase compatibility to other systems 🙂

SQL-Fighter

My line of thought assumes that irrespective of explicit or implicit (non-deterministic) order, there is an enumeration going on, just like with `ROW_NUMBER`, and each ordinal is assigned to each row exactly once. In fact, I'm even assuming that for any window specification `w`, `NTH_VALUE(a, n) OVER w` will evaluate the expression `a` for the row where `ROW_NUMBER() OVER w = n`

Or in other words, even if an order is non-deterministic, there's an accidental order, and ordinals can be assigned non-ambiguously per query execution. If this weren't the case explicit, but non-deterministic ordering also wouldn't work, such as `ROW_NUMBER() OVER (ORDER BY (SELECT 1))`

Xpert

To pick up on your statement: "It would also be more consistent with `NTH_VALUE`, which still produces 2 e.g. for `NTH_VALUE(a, 2) OVER ()"`

I think I did not change the general structure of your sql ( just added a few things left and right ), and nth_value(a,2) over() is by no means guaranteed to deliver "2":

I think the semantics here are more along the lines of "pick a card, any card" , as described here:

https://docs.exasol.com/sql_references/functions/analyticfunctions.htm

"If the analytic function does not contain an order_clause, Exasol does not sort the data within each partition. Without sorted data, the use of some analytic functions can cause non-deterministic results."

I hope I made sense here ? Sorry, in all honesty I did not entirely follow your last post, will have to read that a couple of times and maybe someone from EXA could chime in here 🙂

SQL-Fighter

> and nth_value(a,2) over() is by no means guaranteed to deliver "2":

I did not imply that it is guaranteed to deliver "2". That was just the example I made. It produces the value of a at the 2nd position in the window frame, and in my example, the data happened to be ordered already (by accident).

Anyway, I just noticed this difference to almost all other RDBMS that we're testing jOOQ with (a total of 30, 22 of which support `LAST_VALUE`). Among the ones that don't simply reject empty `OVER ()` clauses in this case (and even in those cases, we generate `OVER (ORDER BY (SELECT 1))` or something similar, so the non-deterministic semantics is ultimately the same), Exasol is the only one I've noticed to exhibit this behaviour, so I thought it might be a bug somewhere, and I'd let you know. The bug being that consistency is still possible and desired in the event of nondeterminism. E.g. in your latest screenshot, I can't reconcile how n1-n4 produce all the 4 distinct values, yet `FIRST_VALUE` / `LAST_VALUE` are the same (the "last value, this time", i.e. the value of `NTH_VALUE(a, 4)`). That just seems inconsistent to me.

Xpert

> "The bug being that consistency is still possible and desired in the event of nondeterminism"

I always thought of nondeterminism as being unable to produce consistency , possible consistency ( i.e. accidental consistency ) isn´t what I´d call consistency.

> "I can't reconcile how n1-n4 produce all the 4 distinct values, yet `FIRST_VALUE` / `LAST_VALUE` are the same (the "last value, this time", i.e. the value of `NTH_VALUE(a, 4)`). That just seems inconsistent to me."
But it´s only inconsistent as long as we assume that the OVER (ORDER BY (SELECT 1)) yields the same order accross those functions, right ? In any case, I´m with you that all in all FIRST / LAST_VALUE shouldn´t yield the same value if there are more than 1 discrete values.

Just for me to get this ( sorry if I´m getting on your nerves with this ) :

Would you expect FIRST / LAST_VALUE to equal N1 & N8 , the first and last value of "something" ( because of the identical window clause ) or just "at least" different values given the structure of the data set as a whole ?

Someone at EXA told me they are going to post something in this thread tomorrow, looking forward to it 🙂