4: Data enrichment

Primarily, we enrich our data by obtaining certain metrics from the current data.

Our dataset is enhanced by adding additional information obtained from the existing data.

Text Summarization: For each article in our dataset, we utilize an LLM (Language Model) based text summarization method. This method condenses the content of each article into a concise summary consisting of two sentences. The resulting summaries are stored in a new column named “article_summary”.

Sentiment Analysis: We employ LLMs to analyze the sentiment expressed in both the news headlines and the article summaries. -Sentiment Categories: The sentiment analysis categorizes the sentiment into three main categories: “Positive”, “Negative”, and “Neutral”. -Storing Results: The sentiment analysis results for the article summaries are stored in new columns named “article_summary_sentiment”, while the sentiment analysis results for the headlines are stored in new columns named “headline_sentiment”.

Emotion Analysis: Similarly, we utilize LLMs to analyze the emotions conveyed in both the news headlines and the article summaries. -Emotion Classes: The emotion analysis classifies emotions into six distinct classes: “Anger”, “Sadness”, “Joy”, “Surprise”, “Love”, and “Disgust”. -Storing Results: The emotion analysis results for the article summaries are stored in new columns named “article_summary_emotion”, while the emotion analysis results for the headlines are stored in new columns named “headline_emotion”.

By performing text summarization, sentiment analysis, and emotion analysis using LLMs, we enrich our dataset with valuable insights into the sentiments and emotions conveyed in the news headlines and article summaries. These enriched features provide a deeper understanding of the content and help facilitate further analysis and interpretation of the data.

from transformers import pipeline
import pandas as pd

# Load your data
df = pd.read_csv('/content/cleaned_and_filtered.csv')

# Load the sentiment and emotion pipelines
sentiment_analysis = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment", device=0)  # device -1 for CPU
emotion_analysis = pipeline("text-classification", model="bhadresh-savani/distilbert-base-uncased-emotion", device=0)

# Function to apply the model in batches
def apply_model_in_batches(model, texts, batch_size=100):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        results.extend(model(batch))
    return results

# Apply sentiment analysis in batches
sentiment_results = apply_model_in_batches(sentiment_analysis, df['TITLE'].tolist())
df['sentiment'] = [result['label'] for result in sentiment_results]
df['sentiment_score'] = [result['score'] for result in sentiment_results]

# Apply emotion analysis in batches
emotion_results = apply_model_in_batches(emotion_analysis, df['TITLE'].tolist())
df['emotion'] = [result['label'] for result in emotion_results]
df['emotion_score'] = [result['score'] for result in emotion_results]

def label_to_sentiment(label):
    mapping = {
        'LABEL_0': 'negative',
        'LABEL_1': 'neutral',
        'LABEL_2': 'positive'
    }
    return mapping.get(label, 'unknown')  # Return 'unknown' if the label is not recognized

# Apply the function to the 'sentiment' column
df['sentiment'] = df['sentiment'].apply(label_to_sentiment)

# Optionally save the updated DataFrame back to a CSV
df.to_csv('main.csv', index=False)

# Print the head of the DataFrame to verify changes
import pandas as pd
df=pd.read_csv('main.csv')
df.head()
Unnamed: 0 DATE AUTHOR TITLE cleaned sentiment sentiment_score emotion emotion_score ARTICLE
0 205074 2019-06-29 Field Level Media Sanchez, Nationals shut down Tigers Sanchez , Nationals shut tiger neutral 0.892794 anger 0.772276 Anibal Sanchez pitched six strong innings Frid...
1 205655 2019-06-11 Field Level Media Rays top A's as Morton's unbeaten streak hits 21 ray Morton unbeaten streak hit 21 neutral 0.626775 joy 0.990553 Aided by three home runs, right-hander Charlie...
2 205885 2019-06-15 Field Level Media Jimenez, White Sox crush Sabathia, Yankees Jimenez , White Sox crush Sabathia , Yankees neutral 0.899590 anger 0.657171 EditorsNote: Changed stat ‘eight runs’ to ‘fiv...
3 206081 2019-06-15 Field Level Media Giants crack three homers, down Brewers giant crack homer , brewer neutral 0.829722 anger 0.467322 Kevin Pillar had three hits and drove in the t...
4 206131 2019-06-19 Field Level Media A's crack six homers, obliterate Orioles crack homer , obliterate Orioles neutral 0.706669 anger 0.490198 Beau Taylor triggered a six-homer assault on B...