I have applied a groupby and calculating the standard deviation for two features in pyspark dataframe

from pyspark.sql import functions as f

val1 = [('a',20,100),('a',100,100),('a',50,100),('b',0,100),('b',0,100),('c',0,0),('c',0,50),('c',0,100),('c',0,20)]
cols = ['group','val1','val2']
tf = spark.createDataFrame(val1, cols)

but it is giving me following error

TypeError: _() takes 1 positional argument but 2 were given

How to perform it in pyspark?


The problem is that the stddev function acts on a single column rather than multiple columns as in the code you have written (hence the error message about 1 vs 2 arguments). One way to get what you are looking for is to calculate the standard deviation separately for each column:

# std dev for each col
expressions = [f.stddev(col).alias('%s_std'%(col)) for col in ['val1','val2']]
# Now run it

#|group|          val1_std|          val2_std|
#|    c|               0.0|43.493294502332965|
#|    b|               0.0|               0.0|
#|    a|40.414518843273804|               0.0|

