𝐒𝐢𝐦𝐩𝐥𝐞 𝐑𝐚𝐧𝐝𝐨𝐦 𝐒𝐚𝐦𝐩𝐥𝐢𝐧𝐠 𝐨𝐫 𝐬𝐚𝐦𝐩𝐥𝐞():-

☑️ In Simple random sampling, we pick records randomly and every records has an equal chance to get picked.

🔵 Syntax:- sample(withReplacement, fraction, seed=None)

☑️ Arguments:-

===========

🔵 𝐟𝐫𝐚𝐜𝐭𝐢𝐨𝐧:-

==========

👉It takes values in a 𝐫𝐚𝐧𝐠𝐞 𝐨𝐟 [𝟎.𝟎,𝟏.𝟎] and it's 𝐦𝐚𝐧𝐝𝐚𝐭𝐨𝐫𝐲 𝐭𝐨 𝐩𝐫𝐨𝐯𝐢𝐝𝐞.

👉It defines fractions of 𝐫𝐞𝐜𝐨𝐫𝐝𝐬 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐭𝐨 𝐬𝐚𝐦𝐩𝐥𝐞 𝐟𝐫𝐨𝐦 𝐲𝐨𝐮𝐫 𝐝𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞.

👉For example, if you define it as 0.2 then it means you want to sample approximately 20% of records from your dataFrame.

👉The sample function doesn’t return the exact fractions of records specified. For example, if you have 100 records in your dataFrame, and you define 0.2 as your fraction, it's doesn't provide a guarantee to give you exact 20 records.

🔵𝐒𝐞𝐞𝐝:-

=========

👉 The sample() provides different sets of records each time.

👉 To generate the same sets of sample records each time you do sampling, you have to define the seed value.

👉 When you defined the 𝐬𝐞𝐞𝐝 𝐯𝐚𝐥𝐮𝐞 𝐲𝐨𝐮 𝐰𝐢𝐥𝐥 𝐠𝐞𝐭 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐬𝐞𝐭𝐬 𝐨𝐟 𝐫𝐞𝐜𝐨𝐫𝐝𝐬 𝐞𝐚𝐜𝐡 𝐭𝐢𝐦𝐞 you run the sample function.

🔵 𝐰𝐢𝐭𝐡𝐑𝐞𝐩𝐥𝐚𝐜𝐞𝐦𝐞𝐧𝐭:-

=====================

👉 If you set it to 𝐓𝐫𝐮𝐞, 𝐲𝐨𝐮 𝐰𝐢𝐥𝐥 𝐠𝐞𝐭 𝐫𝐞𝐩𝐞𝐚𝐭𝐞𝐝 𝐨𝐫 𝐝𝐮𝐩𝐥𝐢𝐜𝐚𝐭𝐞 𝐫𝐞𝐜𝐨𝐫𝐝𝐬 𝐢𝐧 𝐲𝐨𝐮𝐫 𝐬𝐚𝐦𝐩𝐥𝐞 𝐝𝐚𝐭𝐚 along with other records.

👉 If 𝐅𝐚𝐥𝐬𝐞 or nothing specifies, and it's also the default value, 𝐲𝐨𝐮 𝐰𝐢𝐥𝐥 𝐠𝐞𝐭 𝐮𝐧𝐢𝐪𝐮𝐞 𝐫𝐞𝐜𝐨𝐫𝐝𝐬 𝐢𝐧 𝐲𝐨𝐮𝐫 𝐬𝐚𝐦𝐩𝐥𝐞 𝐝𝐚𝐭𝐚.

===============================================

Dataset link:- github.com/kishanpython/PySparkNotebooks/bl..

Notebook link:- github.com/kishanpython/PySpark-Notebooks/b..

=================================================

Follow for more:- linkedin.com/in/kishanyadav

# importing neccessary libs
from pyspark.sql import SparkSession

# creating session
spark = SparkSession.builder.appName("practice").getOrCreate()

# # create dataframe
df_survey = spark.read.format("csv").option("header", True).option("inferschema", True).load("/content/dataset_for_sampling.csv")
df_survey.show(5)

#Output:-

+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2011.06|     80078|     R|Number|
|     BDCQ.SEA1AA|2011.09|     78324|     R|Number|
|     BDCQ.SEA1AA|2011.12|     85850|     R|Number|
|     BDCQ.SEA1AA|2012.03|     90743|     R|Number|
|     BDCQ.SEA1AA|2012.06|     81780|     R|Number|
+----------------+-------+----------+------+------+

Example - 01 || passing fraction value ||

# providing fraction value only
df_sample_1 = df_survey.sample(fraction=0.2)
df_sample_1.show()

# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2011.06|     80078|     R|Number|
|     BDCQ.SEA1AA|2012.09|     79261|     R|Number|
|     BDCQ.SEA1AA|2015.06|     87987|     R|Number|
|     BDCQ.SEA1AA|2016.06|     88716|     R|Number|
|     BDCQ.SEA1AA|2017.06|     90510|     R|Number|
|     BDCQ.SEA1AA|2019.03|    102031|     R|Number|
|     BDCQ.SEA1AA|2021.03|    101342|     R|Number|
|     BDCQ.SEA1AS|2012.12|     84320|     R|Number|
|     BDCQ.SEA1AS|2013.06|     85614|     R|Number|
|     BDCQ.SEA1AS|2017.06|     94411|     R|Number|
|     BDCQ.SEA1AS|2017.12|     94206|     R|Number|
|     BDCQ.SEA1AS|2019.09|     94880|     R|Number|
|     BDCQ.SEA1AS|2021.03|     96418|     R|Number|
+----------------+-------+----------+------+------+

# check the number of records
df_sample_1.count()

# Output:-
13

Here you can see we have 100 records in our dataFrame and we defined fraction=0.2 to get 20% of records i.e. = 20 , records but we gets less number of records i.e. 13. It's give approx records.

Example - 02 || passing seed value ||

# providing fraction value and seed value together
# for this example we are using above sample dataframe
# so that we can compare the results easily
df_sample_2 = df_sample_1.sample(fraction=0.2, seed=34)
df_sample_3 = df_sample_1.sample(fraction=0.2, seed=34)
df_sample_4 = df_sample_1.sample(fraction=0.2, seed=40)
df_sample_2.show()

# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2015.06|     87987|     R|Number|
|     BDCQ.SEA1AA|2016.06|     88716|     R|Number|
|     BDCQ.SEA1AA|2021.03|    101342|     R|Number|
+----------------+-------+----------+------+------+

df_sample_3.show()
# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2015.06|     87987|     R|Number|
|     BDCQ.SEA1AA|2016.06|     88716|     R|Number|
|     BDCQ.SEA1AA|2021.03|    101342|     R|Number|
+----------------+-------+----------+------+------+


**We can see that the records for both sample (df_sample_2 and df_sample_3) are same. We have given the seed value of 34 for both sample. Now let see the records of df_sample_4.**

# this sample contain 4 records and we have given different seed values for this sample. i.e. 40
df_sample_4.show()

# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2012.09|     79261|     R|Number|
|     BDCQ.SEA1AA|2017.06|     90510|     R|Number|
|     BDCQ.SEA1AS|2012.12|     84320|     R|Number|
+----------------+-------+----------+------+------+

Example-03 || passing withReplacement argument value ||


# now we pass all three arguments values in sample
# we are using df_sample_1 dataFrame
df_sample_5 = df_sample_1.sample(withReplacement=True, fraction=0.4, seed=35)
df_sample_5.show()

#Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2015.06|     87987|     R|Number|
|     BDCQ.SEA1AA|2016.06|     88716|     R|Number|
|     BDCQ.SEA1AA|2019.03|    102031|     R|Number|
|     BDCQ.SEA1AS|2012.12|     84320|     R|Number|
|     BDCQ.SEA1AS|2017.06|     94411|     R|Number|
|     BDCQ.SEA1AS|2017.06|     94411|     R|Number|
|     BDCQ.SEA1AS|2019.09|     94880|     R|Number|
|     BDCQ.SEA1AS|2019.09|     94880|     R|Number|
+----------------+-------+----------+------+------+

When we set withReplacement=True we will get duplicated records in the output.

# now we pass all three arguments values in sample
# we are using df_sample_1 dataFrame
df_sample_6 = df_sample_1.sample(withReplacement=False, fraction=0.3, seed=35)
df_sample_6.show()

# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
|     BDCQ.SEA1AA|2011.06|     80078|     R|Number|
|     BDCQ.SEA1AA|2012.09|     79261|     R|Number|
|     BDCQ.SEA1AA|2015.06|     87987|     R|Number|
|     BDCQ.SEA1AS|2021.03|     96418|     R|Number|
+----------------+-------+----------+------+------+

When we set withReplacement=False it gives unique records in the output.

Thank You!!

Keep Learning!!

TheKishanYadav

TheKishanYadav

Sampling method in PySpark

𝐒𝐢𝐦𝐩𝐥𝐞 𝐑𝐚𝐧𝐝𝐨𝐦 𝐒𝐚𝐦𝐩𝐥𝐢𝐧𝐠 𝐨𝐫 𝐬𝐚𝐦𝐩𝐥𝐞():-

Example - 01 || passing fraction value ||

Example - 02 || passing seed value ||

Example-03 || passing withReplacement argument value ||