๐๐ข๐ฆ๐ฉ๐ฅ๐ ๐๐๐ง๐๐จ๐ฆ ๐๐๐ฆ๐ฉ๐ฅ๐ข๐ง๐ ๐จ๐ซ ๐ฌ๐๐ฆ๐ฉ๐ฅ๐():-
โ๏ธ In Simple random sampling, we pick records randomly and every records has an equal chance to get picked.
๐ต Syntax:- sample(withReplacement, fraction, seed=None)
โ๏ธ Arguments:-
===========
๐ต ๐๐ซ๐๐๐ญ๐ข๐จ๐ง:-
==========
๐It takes values in a ๐ซ๐๐ง๐ ๐ ๐จ๐ [๐.๐,๐.๐] and it's ๐ฆ๐๐ง๐๐๐ญ๐จ๐ซ๐ฒ ๐ญ๐จ ๐ฉ๐ซ๐จ๐ฏ๐ข๐๐.
๐It defines fractions of ๐ซ๐๐๐จ๐ซ๐๐ฌ ๐ฒ๐จ๐ฎ ๐ฐ๐๐ง๐ญ ๐ญ๐จ ๐ฌ๐๐ฆ๐ฉ๐ฅ๐ ๐๐ซ๐จ๐ฆ ๐ฒ๐จ๐ฎ๐ซ ๐๐๐ญ๐๐ ๐ซ๐๐ฆ๐.
๐For example, if you define it as 0.2 then it means you want to sample approximately 20% of records from your dataFrame.
๐The sample function doesnโt return the exact fractions of records specified. For example, if you have 100 records in your dataFrame, and you define 0.2 as your fraction, it's doesn't provide a guarantee to give you exact 20 records.
๐ต๐๐๐๐:-
=========
๐ The sample() provides different sets of records each time.
๐ To generate the same sets of sample records each time you do sampling, you have to define the seed value.
๐ When you defined the ๐ฌ๐๐๐ ๐ฏ๐๐ฅ๐ฎ๐ ๐ฒ๐จ๐ฎ ๐ฐ๐ข๐ฅ๐ฅ ๐ ๐๐ญ ๐ญ๐ก๐ ๐ฌ๐๐ฆ๐ ๐ฌ๐๐ญ๐ฌ ๐จ๐ ๐ซ๐๐๐จ๐ซ๐๐ฌ ๐๐๐๐ก ๐ญ๐ข๐ฆ๐ you run the sample function.
๐ต ๐ฐ๐ข๐ญ๐ก๐๐๐ฉ๐ฅ๐๐๐๐ฆ๐๐ง๐ญ:-
=====================
๐ If you set it to ๐๐ซ๐ฎ๐, ๐ฒ๐จ๐ฎ ๐ฐ๐ข๐ฅ๐ฅ ๐ ๐๐ญ ๐ซ๐๐ฉ๐๐๐ญ๐๐ ๐จ๐ซ ๐๐ฎ๐ฉ๐ฅ๐ข๐๐๐ญ๐ ๐ซ๐๐๐จ๐ซ๐๐ฌ ๐ข๐ง ๐ฒ๐จ๐ฎ๐ซ ๐ฌ๐๐ฆ๐ฉ๐ฅ๐ ๐๐๐ญ๐ along with other records.
๐ If ๐ ๐๐ฅ๐ฌ๐ or nothing specifies, and it's also the default value, ๐ฒ๐จ๐ฎ ๐ฐ๐ข๐ฅ๐ฅ ๐ ๐๐ญ ๐ฎ๐ง๐ข๐ช๐ฎ๐ ๐ซ๐๐๐จ๐ซ๐๐ฌ ๐ข๐ง ๐ฒ๐จ๐ฎ๐ซ ๐ฌ๐๐ฆ๐ฉ๐ฅ๐ ๐๐๐ญ๐.
===============================================
Dataset link:- github.com/kishanpython/PySparkNotebooks/bl..
Notebook link:- github.com/kishanpython/PySpark-Notebooks/b..
=================================================
Follow for more:- linkedin.com/in/kishanyadav
# importing neccessary libs
from pyspark.sql import SparkSession
# creating session
spark = SparkSession.builder.appName("practice").getOrCreate()
# # create dataframe
df_survey = spark.read.format("csv").option("header", True).option("inferschema", True).load("/content/dataset_for_sampling.csv")
df_survey.show(5)
#Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
| BDCQ.SEA1AA|2011.06| 80078| R|Number|
| BDCQ.SEA1AA|2011.09| 78324| R|Number|
| BDCQ.SEA1AA|2011.12| 85850| R|Number|
| BDCQ.SEA1AA|2012.03| 90743| R|Number|
| BDCQ.SEA1AA|2012.06| 81780| R|Number|
+----------------+-------+----------+------+------+
Example - 01 || passing fraction value ||
# providing fraction value only
df_sample_1 = df_survey.sample(fraction=0.2)
df_sample_1.show()
# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
| BDCQ.SEA1AA|2011.06| 80078| R|Number|
| BDCQ.SEA1AA|2012.09| 79261| R|Number|
| BDCQ.SEA1AA|2015.06| 87987| R|Number|
| BDCQ.SEA1AA|2016.06| 88716| R|Number|
| BDCQ.SEA1AA|2017.06| 90510| R|Number|
| BDCQ.SEA1AA|2019.03| 102031| R|Number|
| BDCQ.SEA1AA|2021.03| 101342| R|Number|
| BDCQ.SEA1AS|2012.12| 84320| R|Number|
| BDCQ.SEA1AS|2013.06| 85614| R|Number|
| BDCQ.SEA1AS|2017.06| 94411| R|Number|
| BDCQ.SEA1AS|2017.12| 94206| R|Number|
| BDCQ.SEA1AS|2019.09| 94880| R|Number|
| BDCQ.SEA1AS|2021.03| 96418| R|Number|
+----------------+-------+----------+------+------+
# check the number of records
df_sample_1.count()
# Output:-
13
Here you can see we have 100 records in our dataFrame and we defined fraction=0.2 to get 20% of records i.e. = 20 , records but we gets less number of records i.e. 13. It's give approx records.
Example - 02 || passing seed value ||
# providing fraction value and seed value together
# for this example we are using above sample dataframe
# so that we can compare the results easily
df_sample_2 = df_sample_1.sample(fraction=0.2, seed=34)
df_sample_3 = df_sample_1.sample(fraction=0.2, seed=34)
df_sample_4 = df_sample_1.sample(fraction=0.2, seed=40)
df_sample_2.show()
# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
| BDCQ.SEA1AA|2015.06| 87987| R|Number|
| BDCQ.SEA1AA|2016.06| 88716| R|Number|
| BDCQ.SEA1AA|2021.03| 101342| R|Number|
+----------------+-------+----------+------+------+
df_sample_3.show()
# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
| BDCQ.SEA1AA|2015.06| 87987| R|Number|
| BDCQ.SEA1AA|2016.06| 88716| R|Number|
| BDCQ.SEA1AA|2021.03| 101342| R|Number|
+----------------+-------+----------+------+------+
**We can see that the records for both sample (df_sample_2 and df_sample_3) are same. We have given the seed value of 34 for both sample. Now let see the records of df_sample_4.**
# this sample contain 4 records and we have given different seed values for this sample. i.e. 40
df_sample_4.show()
# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
| BDCQ.SEA1AA|2012.09| 79261| R|Number|
| BDCQ.SEA1AA|2017.06| 90510| R|Number|
| BDCQ.SEA1AS|2012.12| 84320| R|Number|
+----------------+-------+----------+------+------+
Example-03 || passing withReplacement argument value ||
# now we pass all three arguments values in sample
# we are using df_sample_1 dataFrame
df_sample_5 = df_sample_1.sample(withReplacement=True, fraction=0.4, seed=35)
df_sample_5.show()
#Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
| BDCQ.SEA1AA|2015.06| 87987| R|Number|
| BDCQ.SEA1AA|2016.06| 88716| R|Number|
| BDCQ.SEA1AA|2019.03| 102031| R|Number|
| BDCQ.SEA1AS|2012.12| 84320| R|Number|
| BDCQ.SEA1AS|2017.06| 94411| R|Number|
| BDCQ.SEA1AS|2017.06| 94411| R|Number|
| BDCQ.SEA1AS|2019.09| 94880| R|Number|
| BDCQ.SEA1AS|2019.09| 94880| R|Number|
+----------------+-------+----------+------+------+
When we set withReplacement=True
we will get duplicated records in the output.
# now we pass all three arguments values in sample
# we are using df_sample_1 dataFrame
df_sample_6 = df_sample_1.sample(withReplacement=False, fraction=0.3, seed=35)
df_sample_6.show()
# Output:-
+----------------+-------+----------+------+------+
|Series_reference| Period|Data_value|STATUS| UNITS|
+----------------+-------+----------+------+------+
| BDCQ.SEA1AA|2011.06| 80078| R|Number|
| BDCQ.SEA1AA|2012.09| 79261| R|Number|
| BDCQ.SEA1AA|2015.06| 87987| R|Number|
| BDCQ.SEA1AS|2021.03| 96418| R|Number|
+----------------+-------+----------+------+------+
When we set withReplacement=False
it gives unique records in the output.
Thank You!!
Keep Learning!!