How do I create my own dataset for the NLP tasks?

How do I create my own dataset for the NLP tasks?

In this Blog article, I am going to explain, how I created my own dataset for the NLP task. So, let's get started...

To collect the text data for NLP task, I used the Inshorts News web application. The data that I collected has three columns and they are as follows:- news_headline, news_article, and news_category.

  • News Headline:- It is one line sentence that contains an overview of the news article.
  • News Article:- It is a multiline sentence and it contains the whole information about the news.
  • News Category:- It tells the category of the news article.

Example:-

  • news_headline:- Musk's Boring Company shares a glimpse of Las Vegas loop station.
  • news_article:- The Boring Company shared a short clip on Twitter showing one of the underground stations that the company is building as part of its Las Vegas Convention Center (LVCC) loop. In September, Founder Elon Musk said the first operational tunnel under Vegas was almost complete. ""Tunnels under cities with self-driving electric cars will feel like warp drive,"" he had added.
  • new_category:- Technology

The articles have 7 different categories and they are as follow:-technology, sports, politics, entertainment, world, automobile and science.

To collect these data I used the following libraries requests, BeautifulSoup, and pandas.

Step #1

Import the necessary libraries.

import requests
from bs4 import BeautifulSoup
import pandas as pd

Step #2

Define the URLs from where to scrape the data.

# news urls
seed_urls = ['https://inshorts.com/en/read/technology',
             'https://inshorts.com/en/read/sports',
             'https://inshorts.com/en/read/world',
             'https://inshorts.com/en/read/politics',
             'https://inshorts.com/en/read/entertainment',
             'https://inshorts.com/en/read/automobile',
             'https://inshorts.com/en/read/science',
             'https://inshorts.com/en/read/world']

Step #3

Now iterate over each URL and get the Html content by using request and BeautifulSoup libs and then extract the data that you needed from them.

news_data = []
# Collecting data
for url in seed_urls:
    news_category = url.split('/')[-1]
    data = requests.get(url)
    soup = BeautifulSoup(data.content, 'html.parser')
    news_articles = [{'news_headline': headline.find('span', 
                                                        attrs={"itemprop": "headline"}).string,
                        'news_article': article.find('div', 
                                                    attrs={"itemprop": "articleBody"}).string,
                        'news_category': news_category}

                        for headline, article in 
                            zip(soup.find_all('div', 
                                            class_=["news-card-title news-right-box"]),
                                soup.find_all('div', 
                                            class_=["news-card-content news-right-box"]))
                    ]
    news_data.extend(news_articles)

Step #4

Now save your extracted data.

# Creating dataframe     
df =  pd.DataFrame(news_data)
df = df[['news_headline', 'news_article', 'news_category']] 
file_save = 'output/' + 'inshort' + '.csv' 
df.to_csv(file_save)
print('File save successfully!')

🎯That's all! You have created your own dataset with minimum effort.

By running this script, I collected data of 10k+ articles in 1 month. I run this script twice a day so, that I can collect different types of news information. You can also take the help of cronjob to run the script. If you run this script you will get the latest news of different categories of articles.

Thank you for reading! Do share your valuable feedback and suggestions!