Sampling method in PySpark

ยท

4 min read

Sampling method in PySpark

๐’๐ข๐ฆ๐ฉ๐ฅ๐ž ๐‘๐š๐ง๐๐จ๐ฆ ๐’๐š๐ฆ๐ฉ๐ฅ๐ข๐ง๐  ๐จ๐ซ ๐ฌ๐š๐ฆ๐ฉ๐ฅ๐ž():-

โ˜‘๏ธ In Simple random sampling, we pick records randomly and every records has an equal chance to get picked.

๐Ÿ”ต Syntax:- sample(withReplacement, fraction, seed=None)

โ˜‘๏ธ Arguments:-

===========

๐Ÿ”ต ๐Ÿ๐ซ๐š๐œ๐ญ๐ข๐จ๐ง:-

==========

๐Ÿ‘‰It takes values in a ๐ซ๐š๐ง๐ ๐ž ๐จ๐Ÿ [๐ŸŽ.๐ŸŽ,๐Ÿ.๐ŸŽ] and it's ๐ฆ๐š๐ง๐๐š๐ญ๐จ๐ซ๐ฒ ๐ญ๐จ ๐ฉ๐ซ๐จ๐ฏ๐ข๐๐ž.

๐Ÿ‘‰It defines fractions of ๐ซ๐ž๐œ๐จ๐ซ๐๐ฌ ๐ฒ๐จ๐ฎ ๐ฐ๐š๐ง๐ญ ๐ญ๐จ ๐ฌ๐š๐ฆ๐ฉ๐ฅ๐ž ๐Ÿ๐ซ๐จ๐ฆ ๐ฒ๐จ๐ฎ๐ซ ๐๐š๐ญ๐š๐…๐ซ๐š๐ฆ๐ž.

๐Ÿ‘‰For example, if you define it as 0.2 then it means you want to sample approximately 20% of records from your dataFrame.

๐Ÿ‘‰The sample function doesnโ€™t return the exact fractions of records specified. For example, if you have 100 records in your dataFrame, and you define 0.2 as your fraction, it's doesn't provide a guarantee to give you exact 20 records.

๐Ÿ”ต๐’๐ž๐ž๐:-

=========

๐Ÿ‘‰ The sample() provides different sets of records each time.

๐Ÿ‘‰ To generate the same sets of sample records each time you do sampling, you have to define the seed value.

๐Ÿ‘‰ When you defined the ๐ฌ๐ž๐ž๐ ๐ฏ๐š๐ฅ๐ฎ๐ž ๐ฒ๐จ๐ฎ ๐ฐ๐ข๐ฅ๐ฅ ๐ ๐ž๐ญ ๐ญ๐ก๐ž ๐ฌ๐š๐ฆ๐ž ๐ฌ๐ž๐ญ๐ฌ ๐จ๐Ÿ ๐ซ๐ž๐œ๐จ๐ซ๐๐ฌ ๐ž๐š๐œ๐ก ๐ญ๐ข๐ฆ๐ž you run the sample function.

๐Ÿ”ต ๐ฐ๐ข๐ญ๐ก๐‘๐ž๐ฉ๐ฅ๐š๐œ๐ž๐ฆ๐ž๐ง๐ญ:-

=====================

๐Ÿ‘‰ If you set it to ๐“๐ซ๐ฎ๐ž, ๐ฒ๐จ๐ฎ ๐ฐ๐ข๐ฅ๐ฅ ๐ ๐ž๐ญ ๐ซ๐ž๐ฉ๐ž๐š๐ญ๐ž๐ ๐จ๐ซ ๐๐ฎ๐ฉ๐ฅ๐ข๐œ๐š๐ญ๐ž ๐ซ๐ž๐œ๐จ๐ซ๐๐ฌ ๐ข๐ง ๐ฒ๐จ๐ฎ๐ซ ๐ฌ๐š๐ฆ๐ฉ๐ฅ๐ž ๐๐š๐ญ๐š along with other records.

๐Ÿ‘‰ If ๐…๐š๐ฅ๐ฌ๐ž or nothing specifies, and it's also the default value, ๐ฒ๐จ๐ฎ ๐ฐ๐ข๐ฅ๐ฅ ๐ ๐ž๐ญ ๐ฎ๐ง๐ข๐ช๐ฎ๐ž ๐ซ๐ž๐œ๐จ๐ซ๐๐ฌ ๐ข๐ง ๐ฒ๐จ๐ฎ๐ซ ๐ฌ๐š๐ฆ๐ฉ๐ฅ๐ž ๐๐š๐ญ๐š.

===============================================

Dataset link:- github.com/kishanpython/PySparkNotebooks/bl..

Notebook link:- github.com/kishanpython/PySpark-Notebooks/b..

=================================================

Follow for more:- linkedin.com/in/kishanyadav

# importing neccessary libs
from pyspark.sql import SparkSession

# creating session
spark = SparkSession.builder.appName("practice").getOrCreate()

# # create dataframe
df_survey = spark.read.format("csv").option("header", True).option("inferschema", True).load("/content/dataset_for_sampling.csv")
df_survey.show(5)

#Output:-

+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2011.06|     80078|     R|Number|
|     BDCQ.SEA1AA|2011.09|     78324|     R|Number|
|     BDCQ.SEA1AA|2011.12|     85850|     R|Number|
|     BDCQ.SEA1AA|2012.03|     90743|     R|Number|
|     BDCQ.SEA1AA|2012.06|     81780|     R|Number|
+----------------+-------+----------+------+------+

Example - 01 || passing fraction value ||

# providing fraction value only
df_sample_1 = df_survey.sample(fraction=0.2)
df_sample_1.show()

# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2011.06|     80078|     R|Number|
|     BDCQ.SEA1AA|2012.09|     79261|     R|Number|
|     BDCQ.SEA1AA|2015.06|     87987|     R|Number|
|     BDCQ.SEA1AA|2016.06|     88716|     R|Number|
|     BDCQ.SEA1AA|2017.06|     90510|     R|Number|
|     BDCQ.SEA1AA|2019.03|    102031|     R|Number|
|     BDCQ.SEA1AA|2021.03|    101342|     R|Number|
|     BDCQ.SEA1AS|2012.12|     84320|     R|Number|
|     BDCQ.SEA1AS|2013.06|     85614|     R|Number|
|     BDCQ.SEA1AS|2017.06|     94411|     R|Number|
|     BDCQ.SEA1AS|2017.12|     94206|     R|Number|
|     BDCQ.SEA1AS|2019.09|     94880|     R|Number|
|     BDCQ.SEA1AS|2021.03|     96418|     R|Number|
+----------------+-------+----------+------+------+

# check the number of records
df_sample_1.count()

# Output:-
13

Here you can see we have 100 records in our dataFrame and we defined fraction=0.2 to get 20% of records i.e. = 20 , records but we gets less number of records i.e. 13. It's give approx records.

Example - 02 || passing seed value ||

# providing fraction value and seed value together
# for this example we are using above sample dataframe
# so that we can compare the results easily
df_sample_2 = df_sample_1.sample(fraction=0.2, seed=34)
df_sample_3 = df_sample_1.sample(fraction=0.2, seed=34)
df_sample_4 = df_sample_1.sample(fraction=0.2, seed=40)
df_sample_2.show()

# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2015.06|     87987|     R|Number|
|     BDCQ.SEA1AA|2016.06|     88716|     R|Number|
|     BDCQ.SEA1AA|2021.03|    101342|     R|Number|
+----------------+-------+----------+------+------+

df_sample_3.show()
# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2015.06|     87987|     R|Number|
|     BDCQ.SEA1AA|2016.06|     88716|     R|Number|
|     BDCQ.SEA1AA|2021.03|    101342|     R|Number|
+----------------+-------+----------+------+------+


**We can see that the records for both sample (df_sample_2 and df_sample_3) are same. We have given the seed value of 34 for both sample. Now let see the records of df_sample_4.**

# this sample contain 4 records and we have given different seed values for this sample. i.e. 40
df_sample_4.show()

# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2012.09|     79261|     R|Number|
|     BDCQ.SEA1AA|2017.06|     90510|     R|Number|
|     BDCQ.SEA1AS|2012.12|     84320|     R|Number|
+----------------+-------+----------+------+------+

Example-03 || passing withReplacement argument value ||


# now we pass all three arguments values in sample
# we are using df_sample_1 dataFrame
df_sample_5 = df_sample_1.sample(withReplacement=True, fraction=0.4, seed=35)
df_sample_5.show()

#Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2015.06|     87987|     R|Number|
|     BDCQ.SEA1AA|2016.06|     88716|     R|Number|
|     BDCQ.SEA1AA|2019.03|    102031|     R|Number|
|     BDCQ.SEA1AS|2012.12|     84320|     R|Number|
|     BDCQ.SEA1AS|2017.06|     94411|     R|Number|
|     BDCQ.SEA1AS|2017.06|     94411|     R|Number|
|     BDCQ.SEA1AS|2019.09|     94880|     R|Number|
|     BDCQ.SEA1AS|2019.09|     94880|     R|Number|
+----------------+-------+----------+------+------+

When we set withReplacement=True we will get duplicated records in the output.

# now we pass all three arguments values in sample
# we are using df_sample_1 dataFrame
df_sample_6 = df_sample_1.sample(withReplacement=False, fraction=0.3, seed=35)
df_sample_6.show()

# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2011.06|     80078|     R|Number|
|     BDCQ.SEA1AA|2012.09|     79261|     R|Number|
|     BDCQ.SEA1AA|2015.06|     87987|     R|Number|
|     BDCQ.SEA1AS|2021.03|     96418|     R|Number|
+----------------+-------+----------+------+------+

When we set withReplacement=False it gives unique records in the output.

Thank You!!

Keep Learning!!