The use case of ๐ž๐ฑ๐ฉ๐ซ() in PySpark

ยท

3 min read

The use case of ๐ž๐ฑ๐ฉ๐ซ() in PySpark

โ˜‘๏ธ It is a SQL function in pyspark to ๐ž๐ฑ๐ž๐œ๐ฎ๐ญ๐ž ๐’๐๐‹-๐ฅ๐ข๐ค๐ž ๐ž๐ฑ๐ฉ๐ซ๐ž๐ฌ๐ฌ๐ข๐จ๐ง๐ฌ.

๐Ÿ”ต ๐’๐ฒ๐ง๐ญ๐š๐ฑ:- ๐ž๐ฑ๐ฉ๐ซ(๐ฌ๐ญ๐ซ)

โ˜‘๏ธ It will take SQL expression as a ๐ฌ๐ญ๐ซ๐ข๐ง๐  ๐š๐ซ๐ ๐ฎ๐ฆ๐ž๐ง๐ญ and performs the operations within the expression.

โ˜‘๏ธ It allows using SQL-like functions that are not present in PySpark Column type & pyspark.sql.functions API. Ex:- ๐‚๐€๐’๐„ ๐–๐‡๐„๐

โ˜‘๏ธ We are allowed to use ๐ƒ๐š๐ญ๐š๐…๐ซ๐š๐ฆ๐ž ๐œ๐จ๐ฅ๐ฎ๐ฆ๐ง๐ฌ in the expression.

Follow for more:- linkedin.com/in/kishanyadav Notebook Link:- github.com/kishanpython/PySpark-Notebooks/b..

# install pyspark 
!pip install pyspark

# importing neccessary libs
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

# creating session
spark = SparkSession.builder.appName("practice").getOrCreate()

# create dataframe
data = [("Prashant","Banglore",25, 58, "2022-08-01", 1), ("Ankit","Banglore",26, 54, "2021-05-02", 2),
        ("Ramakant","Gurugram",24, 60, "2022-06-02", 3), ("Brijesh","Gazipur", 26, 75, "2022-07-04", 4),
        ("Devendra","Gurugram", 27, 62, "2022-04-03", 5), ("Ajay","Chandigarh", 25, 72, "2022-02-01", 6)]
columns= ["friends_name","location", "age", "weight", "meetup_date", "offset"]
df_friends = spark.createDataFrame(data = data, schema = columns)
df_friends.show()

# Output:-
+------------+----------+---+------+-----------+------+
|friends_name|  location|age|weight|meetup_date|offset|
+------------+----------+---+------+-----------+------+
|    Prashant|  Banglore| 25|    58| 2022-08-01|     1|
|       Ankit|  Banglore| 26|    54| 2021-05-02|     2|
|    Ramakant|  Gurugram| 24|    60| 2022-06-02|     3|
|     Brijesh|   Gazipur| 26|    75| 2022-07-04|     4|
|    Devendra|  Gurugram| 27|    62| 2022-04-03|     5|
|        Ajay|Chandigarh| 25|    72| 2022-02-01|     6|
+------------+----------+---+------+-----------+------+


# Concatenating One or More Columns
# concate friends name, age and location columns using expr()
df_concat = df_friends.withColumn("name-age-location", expr("friends_name|| '-'|| age || '-' || location"))
df_concat.show()

# Output:-
+------------+----------+---+------+-----------+------+--------------------+
|friends_name|  location|age|weight|meetup_date|offset|   name-age-location|
+------------+----------+---+------+-----------+------+--------------------+
|    Prashant|  Banglore| 25|    58| 2022-08-01|     1|Prashant-25-Banglore|
|       Ankit|  Banglore| 26|    54| 2021-05-02|     2|   Ankit-26-Banglore|
|    Ramakant|  Gurugram| 24|    60| 2022-06-02|     3|Ramakant-24-Gurugram|
|     Brijesh|   Gazipur| 26|    75| 2022-07-04|     4|  Brijesh-26-Gazipur|
|    Devendra|  Gurugram| 27|    62| 2022-04-03|     5|Devendra-27-Gurugram|
|        Ajay|Chandigarh| 25|    72| 2022-02-01|     6|  Ajay-25-Chandigarh|
+------------+----------+---+------+-----------+------+--------------------+


# Add a New Column Based on Conditions
# check if exercise needed based on weight
# if weight is more or equal to 60 -- Yes
# if weight is less than 55 -- No
# else "Enjoy"
df_condition = df_friends.withColumn("Exercise_Need", expr("CASE WHEN weight >= 60  THEN 'Yes' " + "WHEN  weight < 55  THEN 'No' ELSE 'Enjoy' END"))
df_condition.show()

# Output:-
+------------+----------+---+------+-----------+------+-------------+
|friends_name|  location|age|weight|meetup_date|offset|Exercise_Need|
+------------+----------+---+------+-----------+------+-------------+
|    Prashant|  Banglore| 25|    58| 2022-08-01|     1|        Enjoy|
|       Ankit|  Banglore| 26|    54| 2021-05-02|     2|           No|
|    Ramakant|  Gurugram| 24|    60| 2022-06-02|     3|          Yes|
|     Brijesh|   Gazipur| 26|    75| 2022-07-04|     4|          Yes|
|    Devendra|  Gurugram| 27|    62| 2022-04-03|     5|          Yes|
|        Ajay|Chandigarh| 25|    72| 2022-02-01|     6|          Yes|
+------------+----------+---+------+-----------+------+-------------+



# Creating a new column using the existing column value inside the expression
# let increment the meetup month by number of offset
df_meetup = df_friends.withColumn("new_meetup_date", expr("add_months(meetup_date,offset)"))
df_meetup.show()

# Output:-

+------------+----------+---+------+-----------+------+---------------+
|friends_name|  location|age|weight|meetup_date|offset|new_meetup_date|
+------------+----------+---+------+-----------+------+---------------+
|    Prashant|  Banglore| 25|    58| 2022-08-01|     1|     2022-09-01|
|       Ankit|  Banglore| 26|    54| 2021-05-02|     2|     2021-07-02|
|    Ramakant|  Gurugram| 24|    60| 2022-06-02|     3|     2022-09-02|
|     Brijesh|   Gazipur| 26|    75| 2022-07-04|     4|     2022-11-04|
|    Devendra|  Gurugram| 27|    62| 2022-04-03|     5|     2022-09-03|
|        Ajay|Chandigarh| 25|    72| 2022-02-01|     6|     2022-08-01|
+------------+----------+---+------+-----------+------+---------------+

Thank You!!

Keep Learning!!!