Introduction to Fake News Detection
In today’s digital era, fake news spreads faster than the truth, and the consequences can be serious. From influencing elections to spreading health misinformation, tackling fake news is more important than ever. Fake news detection might seem like a job best suited for cutting-edge transformer models like BERT, but can traditional models like LSTM still hold their ground?
The Tech Question
Which model is better at detecting fake news — a classic LSTM or a modern transformer like BERT? In this guide, we’ll compare two approaches for detecting fake news using deep learning:
- LSTM trained from scratch
- BERT fine-tuned using HuggingFace
Dataset: Fake and Real News Dataset
We’ll be using the popular dataset from Kaggle, which contains over 44,000 news articles, split into:
- REAL: Legitimate news from verified sources
- FAKE: Fabricated news with misleading content
Each entry includes:- Title: title of news article
- Text: body text of news article
- Subject: subject of news article
- Date: publish date of news article
Approach 1: LSTM Trained from Scratch
Step 1: Import Libraries
Essential packages for data handling, preprocessing, and modeling.
import pandas as pd, numpy as np, re, nltk
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Dropout
from tensorflow.keras.callbacks import EarlyStopping
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
Step 2: Load Dataset
Read the fake and real news datasets.
df_fake = pd.read_csv("Fake.csv")
df_real = pd.read_csv("True.csv")
Step 3: Add Labels
Label fake as 0 and real as 1.
df_fake['label'] = 0
df_real['label'] = 1
Step 4: Combine & Shuffle
Merge both datasets and shuffle.
df = pd.concat([df_fake, df_real], axis=0).sample(frac=1).reset_index(drop=True)
Step 5: Clean the Text
Remove HTML, punctuation, numbers, and stopwords.
def clean_text(text):
text = str(text).lower()
text = re.sub(r'<.*?>', '', text)
text = re.sub(r'[^ws]', '', text)
text = re.sub(r'd+', '', text)
text = " ".join([word for word in text.split() if word not in stop_words])
return text
df['text'] = df['title'] + " " + df['text']
df['text'] = df['text'].apply(clean_text)
Step 6: Tokenize & Pad
Convert text to sequences and pad them.
tokenizer = Tokenizer(num_words=50000, oov_token="<oov>")
tokenizer.fit_on_texts(df['text'])
sequences = tokenizer.texts_to_sequences(df['text'])
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1
max_length = 500
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')
Step 7: Train-Test Split
Split into training and validation datasets.
X = padded_sequences
y = df['label'].values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
Step 8: Build the LSTM Model
Create a stacked Bidirectional LSTM model.
model = Sequential([
Embedding(input_dim=vocab_size, output_dim=128, input_length=max_length),
Bidirectional(LSTM(128, return_sequences=True)),
Bidirectional(LSTM(64)),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
Step 9: Compile and Train
Compile the model and train it with early stopping.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
early_stop = EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), callbacks=[early_stop])
Step 10: Visualize Accuracy
Plot training and validation accuracy.
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.title("Training vs Validation Accuracy")
plt.show()
Step 11: Evaluate Model
Final validation accuracy after training.
loss, acc = model.evaluate(X_val, y_val)
print(f"n Final Validation Accuracy: {acc:.4f}")
Output:
Final Validation Accuracy: 0.9125
Approach 2: BERT Fine-Tuning with HuggingFace
Why build from scratch when you can fine-tune a powerful pre-trained model? In this approach, we’ll use bert-base-uncased, a general-purpose language model from HuggingFace, and fine-tune it on the Fake and Real News dataset. BERT understands syntax and context, making it ideal for classification tasks like fake news detection.
What Makes BERT Powerful?
- Pre-trained on a huge corpus (Wikipedia + BookCorpus)
- Captures contextual relationships between words
- Works great with minimal preprocessing
Step-by-Step Implementation
1. Install Required Libraries
pip install transformers datasets tensorflow
2. Load and Prepare the Dataset
import pandas as pd
from sklearn.model_selection import train_test_split
df_fake = pd.read_csv("Fake.csv")
df_real = pd.read_csv("True.csv")
df_fake['label'] = 0
df_real['label'] = 1
df = pd.concat([df_fake, df_real], axis=0).sample(frac=1).reset_index(drop=True)
df['text'] = df['title'] + " " + df['text']
train_texts, val_texts, train_labels, val_labels = train_test_split(
df['text'].tolist(), df['label'].tolist(), test_size=0.2, random_state=42)
3. Tokenization with BERT Tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512)
4. Prepare TensorFlow Datasets
import tensorflow as tf
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_labels
)).shuffle(1000).batch(16)
val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
val_labels
)).batch(16)
5. Load and Fine-Tune BERT
from transformers import TFBertForSequenceClassification, AdamW
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.compile(optimizer=AdamW(learning_rate=2e-5),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_dataset, validation_data=val_dataset, epochs=3)
6. Evaluate Model Performance
loss, accuracy = model.evaluate(val_dataset)
print(f"n Final Validation Accuracy: {accuracy:.4f}")
Output:
Final Validation Accuracy: 0.9314
Comparative Analysis: LSTM vs BERT
Now that we’ve implemented and evaluated both models, let’s compare them side by side:
Conclusion: BERT edges out LSTM in accuracy due to its deep understanding of context. However, if you’re short on compute or want a simpler model, LSTM still provides excellent results.
When to Use LSTM vs BERT
Choosing between LSTM and BERT depends on your goals and resources:
- Use LSTM if:
- You’re constrained on computational resources
- You want to build models from scratch for educational purposes
- Dataset is relatively small and domain-specific
- Use BERT if:
- You need state-of-the-art accuracy
- You’re working with large or noisy text data
- You want to leverage transfer learning for better generalization
Conclusion
Both LSTM and BERT are powerful in their own right. While BERT dominates with context-awareness and pre-training, LSTMs still remain relevant for faster deployment and simpler pipelines. In the battle against fake news, picking the right model is just one part — the key is using technology ethically and effectively to promote truth.
FAQs
Q: What is the main difference between LSTM and BERT?
A: LSTM is a type of Recurrent Neural Network (RNN) that is trained from scratch, while BERT is a pre-trained language model that is fine-tuned for specific tasks.
Q: Which model is more accurate for fake news detection?
A: BERT is more accurate for fake news detection due to its deep understanding of context and pre-training on a large corpus.
Q: What are the advantages of using LSTM?
A: LSTM is simpler to implement, requires less computational resources,