2: Retrieval of raw data

Our project revolves around the classification of news articles based on authors and headlines. To gather the necessary data for this task, we utilized two primary sources accessed through the Python requests library and the Kaggle API:

News Category Dataset (via Kaggle API): This dataset provides a vast collection of news articles categorized into different categories. It includes metadata such as headlines, authors, publication dates, and article content. Accessed through the Kaggle API, the News Category Dataset serves as a valuable resource for training and testing our classification models based on headline and author features. We utilized the Kaggle api to write a script to get the news category dataset using an API call.

All the News Articles Dataset (via Kaggle API): Comprising a diverse range of news articles from various sources, this dataset offers extensive coverage of news topics and authors. Acquired via the Kaggle API, the All the News Articles Dataset enriches our data collection efforts by providing additional samples for training and validation. By leveraging the capabilities of the requests library and the Kaggle API, we obtained comprehensive datasets essential for our author and headline classification tasks. These datasets form the backbone of our project, enabling us to develop robust classification models and gain valuable insights into media narratives and authorship patterns.

All the News 2 News Articles Dataset The dataset consists of a diverse selection of news articles from various sources and categories. For this dataset, we directly downloaded the sqlite and csv files from the website. The dataset was then stored locally for further preprocessing and analysis as part of our project.

NYTimes Front Page Dataset For the NYTimes Front Page Dataset, we utilized the provided Dropbox link to access the dataset containing data from the front page of The New York Times newspaper. After downloading the dataset, we parsed the CSV file to extract relevant metadata fields such as headlines, publication dates, and article content. The dataset was then stored locally for preprocessing and integration into our analysis pipeline.

By programmatically retrieving and processing these datasets, we ensured a systematic approach to data acquisition, enabling us to conduct comprehensive analysis and classification tasks.

import requests
import zipfile
import os
import kaggle

def download_file_from_url(url, local_path):
    """
    Download a file from a direct URL and save it locally.
    """
    response = requests.get(url, stream=True)
    response.raise_for_status()  # This will raise an exception if there is an error

    with open(local_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

    print(f"File downloaded successfully and saved to {local_path}")


def extract_zip(zip_path, extract_to):
    """
    Extract a zip file to a specified directory.

    Args:
    zip_path (str): The path to the zip file.
    extract_to (str): The directory to extract the files into.
    """
    # Ensure the target directory exists
    if not os.path.exists(extract_to):
        os.makedirs(extract_to)

    # Open the zip file
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        # Extract all the contents into the directory
        zip_ref.extractall(extract_to)
        print(f"All files have been extracted to {extract_to}")


# Example command to download a dataset
# Replace 'username/dataset-name' with the actual path of the dataset on Kaggle
dataset_identifier = 'rmisra/news-category-dataset'

# Specify the path where you want to download the dataset
download_path = 'news-category-dataset'

# Download and unzip the dataset
kaggle.api.dataset_download_files(dataset_identifier, path=download_path, unzip=True)


url1 = 'https://www.dropbox.com/scl/fi/pp8gbi3j7lxlunucs93xu/all-the-news.db?rlkey=iy65g92gd7bsaligula1disfn&e=1&dl=1'
url2 = 'https://www.dropbox.com/scl/fi/ri2muuv85ri98odyi9fzn/all-the-news-2-1.zip?rlkey=8qeq5kpg5btk3vq5nbx2aojxx&e=1&dl=1'
url3='https://www.dropbox.com/scl/fi/pm4c66u0exyj0ihaxr95e/nytimes%20front%20page.csv?rlkey=7mus7otnczy9w9q8wr3bo52ga&e=1&dl=1'
# Local path where you want to save the file
local_path1 = 'all-the-news.db'
local_path2='all-the-news-2-1.zip'
local_path3='nytimes_front_page.csv'



download_file_from_url(url1, local_path1)
download_file_from_url(url2, local_path2)
download_file_from_url(url3, local_path3)
# Path to your zip file
zip_path = 'all-the-news-2-1.zip'

# Directory to extract the contents
extract_to = '.'

# Call the function to extract the zip file
extract_zip(zip_path, extract_to)

More information can be found in the get_data.py file