4: Data enrichment

Primarily, we enrich our data by obtaining certain metrics from the current data.

Our dataset is enhanced by adding additional information obtained from the existing data.

Text Summarization: For each article in our dataset, we utilize an LLM (Language Model) based text summarization method. This method condenses the content of each article into a concise summary consisting of two sentences. The resulting summaries are stored in a new column named “article_summary”.

Sentiment Analysis: We employ LLMs to analyze the sentiment expressed in both the news headlines and the article summaries. -Sentiment Categories: The sentiment analysis categorizes the sentiment into three main categories: “Positive”, “Negative”, and “Neutral”. -Storing Results: The sentiment analysis results for the article summaries are stored in new columns named “article_summary_sentiment”, while the sentiment analysis results for the headlines are stored in new columns named “headline_sentiment”.

Emotion Analysis: Similarly, we utilize LLMs to analyze the emotions conveyed in both the news headlines and the article summaries. -Emotion Classes: The emotion analysis classifies emotions into six distinct classes: “Anger”, “Sadness”, “Joy”, “Surprise”, “Love”, and “Disgust”. -Storing Results: The emotion analysis results for the article summaries are stored in new columns named “article_summary_emotion”, while the emotion analysis results for the headlines are stored in new columns named “headline_emotion”.

By performing text summarization, sentiment analysis, and emotion analysis using LLMs, we enrich our dataset with valuable insights into the sentiments and emotions conveyed in the news headlines and article summaries. These enriched features provide a deeper understanding of the content and help facilitate further analysis and interpretation of the data.

from transformers import pipeline
import pandas as pd

# Load your data
df = pd.read_csv('/content/cleaned_and_filtered.csv')

# Load the sentiment and emotion pipelines
sentiment_analysis = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment", device=0)  # device -1 for CPU
emotion_analysis = pipeline("text-classification", model="bhadresh-savani/distilbert-base-uncased-emotion", device=0)

# Function to apply the model in batches
def apply_model_in_batches(model, texts, batch_size=100):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        results.extend(model(batch))
    return results

# Apply sentiment analysis in batches
sentiment_results = apply_model_in_batches(sentiment_analysis, df['TITLE'].tolist())
df['sentiment'] = [result['label'] for result in sentiment_results]
df['sentiment_score'] = [result['score'] for result in sentiment_results]

# Apply emotion analysis in batches
emotion_results = apply_model_in_batches(emotion_analysis, df['TITLE'].tolist())
df['emotion'] = [result['label'] for result in emotion_results]
df['emotion_score'] = [result['score'] for result in emotion_results]

def label_to_sentiment(label):
    mapping = {
        'LABEL_0': 'negative',
        'LABEL_1': 'neutral',
        'LABEL_2': 'positive'
    }
    return mapping.get(label, 'unknown')  # Return 'unknown' if the label is not recognized

# Apply the function to the 'sentiment' column
df['sentiment'] = df['sentiment'].apply(label_to_sentiment)

# Optionally save the updated DataFrame back to a CSV
df.to_csv('main.csv', index=False)

# Print the head of the DataFrame to verify changes

import pandas as pd
df=pd.read_csv('main.csv')
df.head()

	Unnamed: 0	DATE	AUTHOR	TITLE	cleaned	sentiment	sentiment_score	emotion	emotion_score	ARTICLE
0	205074	2019-06-29	Field Level Media	Sanchez, Nationals shut down Tigers	Sanchez , Nationals shut tiger	neutral	0.892794	anger	0.772276	Anibal Sanchez pitched six strong innings Frid...
1	205655	2019-06-11	Field Level Media	Rays top A's as Morton's unbeaten streak hits 21	ray Morton unbeaten streak hit 21	neutral	0.626775	joy	0.990553	Aided by three home runs, right-hander Charlie...
2	205885	2019-06-15	Field Level Media	Jimenez, White Sox crush Sabathia, Yankees	Jimenez , White Sox crush Sabathia , Yankees	neutral	0.899590	anger	0.657171	EditorsNote: Changed stat ‘eight runs’ to ‘fiv...
3	206081	2019-06-15	Field Level Media	Giants crack three homers, down Brewers	giant crack homer , brewer	neutral	0.829722	anger	0.467322	Kevin Pillar had three hits and drove in the t...
4	206131	2019-06-19	Field Level Media	A's crack six homers, obliterate Orioles	crack homer , obliterate Orioles	neutral	0.706669	anger	0.490198	Beau Taylor triggered a six-homer assault on B...