USING vs ON join syntax performance difference ... why?

PeterK
Xpert

https://docs.exasol.com/7.0/performance/best_practices.htm?tocpath=Performance%20Guide%7C_____3#Avoi...  recommends using the ON join syntax instead of USING syntax however I don't understand the explanation given for why USING is slower. Can somebody explain it for me?

The explanation refers to figuring out which table to pull the non-joining column name from (if it appears in both tables presumably). However in that case both name columns are displayed separately so there doesn't seem to be any choice that needs to be made.

with
t1 (id, name) as (values (1, 'a')),
t2 (id, name) as (values (1, 'b'))
select * from t1 join t2 using(id);

ID NAME NAME 
-- ---- ---- 
1  a    b    

 

 If the documentation meant that it has to decide which value to use for displaying the id column then it seems that it can always just display the value of id from the first table since USING(id) is an equality check.

So it's not clear to me why it needs to be handled differently from the ON syntax.

4 REPLIES 4

exa-Aleksandr
Team Exasol
Team Exasol

Hi @PeterK , hi @mwellbro 

Basically it's EXASOL-2599. Deep inside discussions around this Roadmap ticket it is mentioned that our strict adherence to SQL Standard makes us transform joins like this (with combinatorial explosion in the number of joins). And for INNER joins it is probably possible to internally rewrite a query as a simple ON equijoin.

The status is OPEN => needs to be prioritized to be implemented.

PeterK
Xpert

Good to know - Thanks @exa-Aleksandr !

mwellbro
Xpert

Hi Peter,

EXA definitely handles the "USING join" differently from "ON joins", for a little EXA-mple:

 


-- take Peters SQLs and crank them up to eleven
with
t1 (id, name) as (select level,'a' from dual connect by level<=200000),
t2 (id, name) as (select level,'b' from dual connect by level<=100000),
t3 (id, name) as (select level,'c' from dual connect by level<=300000)
select * from t1 join t2 using(id)
                 join t3 using(id);

with
t1 (id, name) as (select level,'a' from dual connect by level<=200000),
t2 (id, name) as (select level,'b' from dual connect by level<=100000),
t3 (id, name) as (select level,'c' from dual connect by level<=300000)
select * from t1 join t2 on t1.id=t2.id
                 join t3 on t1.id=t3.id;

 

 

One thing to note in this case is the difference in the width of the result set ( 4 cols in the first and 6 cols in the seconds query as we are preserving the "id"´s from all tables ).
While that carries a little memory footprint implication, it´s not much of a performance hit:

Computation for table 'tmp_subselect6' (4 cols) with 100000 of 100000 global rows and 1491457/14336 data/memory size completed.

vs.

Computation for table 'tmp_subselect6' (6 cols) with 100000 of 100000 global rows and 2899314/21504 data/memory size completed.

 

What might be a bit more of a hit is the transformation that is done under the covers:

[Compiler] assigning DBO-name tmp_subselect6 to query 'SELECT ((COALESCE(COALESCE(ID,ID),ID)),NAME,NAME,NAME) FROM QESQL92JoinedTable( /*tmp_subselect3*/, /*RPL:tmp_subselect4*/, /*tmp_subselect5*/, <( /*tmp_subselect3*/) INNER JOIN ( /*RPL:tmp_subselect4*/) ON ID=(U)ID [FT_EQ] >, <( /*tmp_subselect3*/, /*RPL:tmp_subselect4*/) INNER JOIN ( /*tmp_subselect5*/) ON (COALESCE(ID,ID))=(U)ID [FT_EQ] >) WHERE TRUE PREFERRING TRUE'

vs.

[Compiler] assigning DBO-name tmp_subselect6 to query 'SELECT (ID,NAME,ID,NAME,ID,NAME) FROM QESQL92JoinedTable( /*tmp_subselect3*/, /*RPL:tmp_subselect4*/, /*tmp_subselect5*/, <( /*tmp_subselect3*/) INNER JOIN ( /*RPL:tmp_subselect4*/) ON ID=ID [FT_EQ] >, <( /*tmp_subselect3*/) INNER JOIN ( /*tmp_subselect5*/) ON ID=ID [FT_EQ] >, <( /*RPL:tmp_subselect4*/) INNER JOIN ( /*tmp_subselect5*/) ON ID=ID [FT_EQ] >) WHERE TRUE PREFERRING TRUE'

In case of an INNER JOIN I agree with you that this wouldn´t be necessary as none of the IDs stands a chance of becoming NULL but I guess there´s a reason for keeping this logic
even in this particular case.

I generally avoid the USING syntax , but that´s more of a personal preference - I could imagine scenarios where the phenomenon above might really sting ( more joins and higher data volumes ) , so I don´t see a reason to use USING joins.

Cheers,
Malte

PeterK
Xpert

Yes - It's definitely slower to use many USINGs because of the translation to COALESCE. We had to change our reporting tool to avoid USING() when we switched to Exasol because of this.

I guess my question is more wondering why the translation to COALESCE is necessary in the first place. Maybe some exa-folks can shed some light?