0

i have a dataframe:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('').getOrCreate()
df = spark.createDataFrame([("a", "65"), ("b", "23"),("c", "65"), ("d", "23"),
                        ("a", "66"), ("b", "46"),("c", "23"), ("d", "66"),
            ("b", "5"), ("b", "3"),("c", "3")], ["column2", "value"])
df.show()

+-------+-----+
|column2|value|
+-------+-----+ 
|      a| 65  |
|      b| 23  |
|      c| 65  |
|      d| 23  |
|      a| 66  |
|      b| 46  |
|      c| 23  |
|      d| 66  |
|      b|  5  |
|      b|  3  |
|      c|  3  |
+-------+-----+

And I wanted to make each 4 row as an one group. Then regarding to that group create new column where i can assign the number of groups to the corresponding rows. So the desired output is as following:

+-------+-----+------+
|column2|value|gr_val|
+-------+-----+ -----+
|      a| 65  |    1 |
|      b| 23  |    1 |
|      c| 65  |    1 |
|      d| 23  |    1 |
|      a| 66  |    2 |
|      b| 46  |    2 |
|      c| 23  |    2 |
|      d| 66  |    2 |
|      b|  5  |    3 |
|      b|  3  |    3 |
|      c|  3  |    3 |
+-------+-----+------+

I would appreciate any helps!

1

Try this approach -

(1) Create a new column (dummy) that will hold sequentially increasing number to each row. lit('a') used to create static value to generate sequentially increasing row number.

(2) Devide the dummy column with number or records you want in each group (eg. 4) and take ceil. Ceil return the smallest integer not less than the value.

Here is detailed example -

from pyspark.sql.functions import *
from pyspark.sql.window import *

w = Window().partitionBy(lit('a')).orderBy(lit('a'))

df.withColumn("row_num", row_number().over(w))\
    .selectExpr('column2 AS column2','value AS value','ceil(row_num/4) as gr_val')\
    .show()

#+-------+-----+------+
#|column2|value|gr_val|
#+-------+-----+------+
#|      a|   65|     1|
#|      b|   23|     1|
#|      c|   65|     1|
#|      d|   23|     1|
#|      a|   66|     2|
#|      b|   46|     2|
#|      c|   23|     2|
#|      d|   66|     2|
#|      b|    5|     3|
#|      b|    3|     3|
#|      c|    3|     3|
#+-------+-----+------+
  • Thanks for answer, i have a question regarding window function. what is the role of partitionBy(lit('a')) there? – Sascha Apr 15 at 7:58
  • I need to assign sequentially increasing rownumber to each row, hense I require a column for partitionBy. So here I am using a dummy column lit('a'). This will have static values "a" across all rows and because partition by column is having a single value all the records in dataframe will fall in a single partition and a sequentially increasing number will be assigned. – Shan Apr 15 at 8:17

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.