groupby inside udf python set script error

Lapchenkov
Contributor

Hi! Сan someone explain to a beginner why an error appears here:

CREATE OR REPLACE PYTHON3 SET SCRIPT "GOODS3" (...) EMITS (...) AS
import requests
import json
import pandas as pd
import numpy as np

def evaluate(df):
    df = df.groupby(['NETWORK_ID'])['PHARMACY_CODE_HISTORY'].count().reset_index()
    return df

def run(ctx):
    df = ctx.get_dataframe(num_rows='all')
    results = evaluate(df)
    ctx.emit(results)
SELECT "GOODS3"(NETWORK_ID, PHARMACY_CODE_HISTORY)
EMITS (NETWORK_ID INT, PHARMACY_CODE_HISTORY INT)
FROM TIER3.PHARMACIES

get error:

SQL Error [22002]: VM error: KeyError: 'Traceback (most recent call last):\n File "GOODS3", line 12, in run\n File "GOODS3", line 7, in evaluate\n File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 5162, in groupby\n **kwargs)\n File "/usr/lib/python3/dist-packages/pandas/core/groupby.py", line 1848, in groupby\n return klass(obj, by, **kwds)\n File "/usr/lib/python3/dist-packages/pandas/core/groupby.py", line 516, in __init__\n mutated=self.mutated)\n File "/usr/lib/python3/dist-packages/pandas/core/groupby.py", line 2934, in _get_grouper\n raise KeyError(gpr)\nKeyError: \'NETWORK_ID\'\n' (Session: 1695507820363597696)

But if i change for this one:

CREATE OR REPLACE PYTHON3 SET SCRIPT "GOODS3" (...) EMITS (...) AS
import requests
import json
import pandas as pd
import numpy as np

def evaluate(df):
    df = df.head(3)
    return df

def run(ctx):
    df = ctx.get_dataframe(num_rows='all')
    results = evaluate(df)
    ctx.emit(results)

 

 SELECT "GOODS3"(NETWORK_ID, PHARMACY_CODE_HISTORY)
 EMITS (NETWORK_ID INT, PHARMACY_CODE_HISTORY VARCHAR(100))
 FROM TIER3.PHARMACIES

 it woks fine

1 ACCEPTED SOLUTION

mwellbro
Xpert

Hi @Lapchenkov ,

I think you are confusing the headers provided by the EMIT with what is actually present in your dataframe ?

mwellbro_0-1617028422346.png

mwellbro_1-1617028443548.png

 

View solution in original post

4 REPLIES 4

mwellbro
Xpert

Hi @Lapchenkov ,

I think you are confusing the headers provided by the EMIT with what is actually present in your dataframe ?

mwellbro_0-1617028422346.png

mwellbro_1-1617028443548.png

 

View solution in original post

Lapchenkov
Contributor

@mwellbro , thank you! Now i got it !

mwellbro
Xpert

Hi @Lapchenkov ,

when I check the df.dtypes it seems like the columns in your dataframe don´t get the names from the SQL, hence NETWORK_ID

actually does not exist in your df.

Not great at python here, so these are just ideas:

1) work with column ids ( I got 0 and 1 as column headers when testing with a simple 2 column query similar to yours )

2) modify the df headers ( I don´t know how ) to reflect the column names

 

Cheers,
Malte

Lapchenkov
Contributor

Hi @mwellbro , thank you for help! 

when i use 

df = df.head(3)

 i see in resul this NETWORK_ID :

 

 SELECT "GOODS3"(NETWORK_ID, PHARMACY_CODE_HISTORY)
 EMITS (NETWORK_ID INT, PHARMACY_CODE_HISTORY VARCHAR(100))
 FROM TIER3.PHARMACIES

 gives me:

Lapchenkov_0-1617014463930.png

difference between 2 scripts : 

df = df.head(3) works  

df = df.groupby(['NETWORK_ID'])['PHARMACY_CODE_HISTORY'].count().reset_index() error