Primarily, we enrich our data by obtaining certain metrics from the current data.
Our dataset is enhanced by adding additional information obtained from the existing data.
Text Summarization: For each article in our dataset, we utilize an LLM (Language Model) based text summarization method. This method condenses the content of each article into a concise summary consisting of two sentences. The resulting summaries are stored in a new column named “article_summary”.
Sentiment Analysis: We employ LLMs to analyze the sentiment expressed in both the news headlines and the article summaries. -Sentiment Categories: The sentiment analysis categorizes the sentiment into three main categories: “Positive”, “Negative”, and “Neutral”. -Storing Results: The sentiment analysis results for the article summaries are stored in new columns named “article_summary_sentiment”, while the sentiment analysis results for the headlines are stored in new columns named “headline_sentiment”.
Emotion Analysis: Similarly, we utilize LLMs to analyze the emotions conveyed in both the news headlines and the article summaries. -Emotion Classes: The emotion analysis classifies emotions into six distinct classes: “Anger”, “Sadness”, “Joy”, “Surprise”, “Love”, and “Disgust”. -Storing Results: The emotion analysis results for the article summaries are stored in new columns named “article_summary_emotion”, while the emotion analysis results for the headlines are stored in new columns named “headline_emotion”.
By performing text summarization, sentiment analysis, and emotion analysis using LLMs, we enrich our dataset with valuable insights into the sentiments and emotions conveyed in the news headlines and article summaries. These enriched features provide a deeper understanding of the content and help facilitate further analysis and interpretation of the data.
from transformers import pipelineimport pandas as pd# Load your datadf = pd.read_csv('/content/cleaned_and_filtered.csv')# Load the sentiment and emotion pipelinessentiment_analysis = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment", device=0) # device -1 for CPUemotion_analysis = pipeline("text-classification", model="bhadresh-savani/distilbert-base-uncased-emotion", device=0)# Function to apply the model in batchesdef apply_model_in_batches(model, texts, batch_size=100): results = []for i inrange(0, len(texts), batch_size): batch = texts[i:i + batch_size] results.extend(model(batch))return results# Apply sentiment analysis in batchessentiment_results = apply_model_in_batches(sentiment_analysis, df['TITLE'].tolist())df['sentiment'] = [result['label'] for result in sentiment_results]df['sentiment_score'] = [result['score'] for result in sentiment_results]# Apply emotion analysis in batchesemotion_results = apply_model_in_batches(emotion_analysis, df['TITLE'].tolist())df['emotion'] = [result['label'] for result in emotion_results]df['emotion_score'] = [result['score'] for result in emotion_results]def label_to_sentiment(label): mapping = {'LABEL_0': 'negative','LABEL_1': 'neutral','LABEL_2': 'positive' }return mapping.get(label, 'unknown') # Return 'unknown' if the label is not recognized# Apply the function to the 'sentiment' columndf['sentiment'] = df['sentiment'].apply(label_to_sentiment)# Optionally save the updated DataFrame back to a CSVdf.to_csv('main.csv', index=False)# Print the head of the DataFrame to verify changes
import pandas as pddf=pd.read_csv('main.csv')df.head()