0

I have applied a groupby and calculating the standard deviation for two features in pyspark dataframe

from pyspark.sql import functions as f


val1 = [('a',20,100),('a',100,100),('a',50,100),('b',0,100),('b',0,100),('c',0,0),('c',0,50),('c',0,100),('c',0,20)]
cols = ['group','val1','val2']
tf = spark.createDataFrame(val1, cols)
tf.show() 
tf.groupby('group').agg(f.stddev(['val1','val2']).alias('val1_std','val2_std'))

but it is giving me following error

TypeError: _() takes 1 positional argument but 2 were given

How to perform it in pyspark?

1

The problem is that the stddev function acts on a single column rather than multiple columns as in the code you have written (hence the error message about 1 vs 2 arguments). One way to get what you are looking for is to calculate the standard deviation separately for each column:

# std dev for each col
expressions = [f.stddev(col).alias('%s_std'%(col)) for col in ['val1','val2']]
# Now run it
tf.groupby('group').agg(*expressions).show()

#+-----+------------------+------------------+
#|group|          val1_std|          val2_std|
#+-----+------------------+------------------+
#|    c|               0.0|43.493294502332965|
#|    b|               0.0|               0.0|
#|    a|40.414518843273804|               0.0|
#+-----+------------------+------------------+

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.