Can Traditional LSTMs Outperform Fine-Tuned BERT Models?

Introduction to Fake News Detection

In today’s digital era, fake news spreads faster than the truth, and the consequences can be serious. From influencing elections to spreading health misinformation, tackling fake news is more important than ever. Fake news detection might seem like a job best suited for cutting-edge transformer models like BERT, but can traditional models like LSTM still hold their ground?

The Tech Question

Which model is better at detecting fake news — a classic LSTM or a modern transformer like BERT? In this guide, we’ll compare two approaches for detecting fake news using deep learning:

LSTM trained from scratch
BERT fine-tuned using HuggingFace

Dataset: Fake and Real News Dataset

We’ll be using the popular dataset from Kaggle, which contains over 44,000 news articles, split into:

REAL: Legitimate news from verified sources
FAKE: Fabricated news with misleading content
Each entry includes:
1. Title: title of news article
2. Text: body text of news article
3. Subject: subject of news article
4. Date: publish date of news article

Approach 1: LSTM Trained from Scratch

Step 1: Import Libraries

Essential packages for data handling, preprocessing, and modeling.

import pandas as pd, numpy as np, re, nltk
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Dropout
from tensorflow.keras.callbacks import EarlyStopping
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

Step 2: Load Dataset

Read the fake and real news datasets.

df_fake = pd.read_csv("Fake.csv")
df_real = pd.read_csv("True.csv")

Step 3: Add Labels

Label fake as 0 and real as 1.

df_fake['label'] = 0
df_real['label'] = 1

Step 4: Combine & Shuffle

Merge both datasets and shuffle.

df = pd.concat([df_fake, df_real], axis=0).sample(frac=1).reset_index(drop=True)

Step 5: Clean the Text

Remove HTML, punctuation, numbers, and stopwords.

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^ws]', '', text)
    text = re.sub(r'd+', '', text)
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text
df['text'] = df['title'] + " " + df['text']
df['text'] = df['text'].apply(clean_text)

Step 6: Tokenize & Pad

Convert text to sequences and pad them.

tokenizer = Tokenizer(num_words=50000, oov_token="<oov>")
tokenizer.fit_on_texts(df['text'])
sequences = tokenizer.texts_to_sequences(df['text'])
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1
max_length = 500
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')

Step 7: Train-Test Split

Split into training and validation datasets.

X = padded_sequences
y = df['label'].values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Step 8: Build the LSTM Model

Create a stacked Bidirectional LSTM model.

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=128, input_length=max_length),
    Bidirectional(LSTM(128, return_sequences=True)),
    Bidirectional(LSTM(64)),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

Step 9: Compile and Train

Compile the model and train it with early stopping.

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
early_stop = EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), callbacks=[early_stop])

Step 10: Visualize Accuracy

Plot training and validation accuracy.

plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.title("Training vs Validation Accuracy")
plt.show()

Step 11: Evaluate Model

Final validation accuracy after training.

loss, acc = model.evaluate(X_val, y_val)
print(f"n Final Validation Accuracy: {acc:.4f}")

Output:
Final Validation Accuracy: 0.9125

Approach 2: BERT Fine-Tuning with HuggingFace

Why build from scratch when you can fine-tune a powerful pre-trained model? In this approach, we’ll use bert-base-uncased, a general-purpose language model from HuggingFace, and fine-tune it on the Fake and Real News dataset. BERT understands syntax and context, making it ideal for classification tasks like fake news detection.

What Makes BERT Powerful?

Pre-trained on a huge corpus (Wikipedia + BookCorpus)
Captures contextual relationships between words
Works great with minimal preprocessing

Step-by-Step Implementation

1. Install Required Libraries

pip install transformers datasets tensorflow

2. Load and Prepare the Dataset

import pandas as pd
from sklearn.model_selection import train_test_split

df_fake = pd.read_csv("Fake.csv")
df_real = pd.read_csv("True.csv")

df_fake['label'] = 0
df_real['label'] = 1

df = pd.concat([df_fake, df_real], axis=0).sample(frac=1).reset_index(drop=True)

df['text'] = df['title'] + " " + df['text']

train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['text'].tolist(), df['label'].tolist(), test_size=0.2, random_state=42)

3. Tokenization with BERT Tokenizer

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512)

4. Prepare TensorFlow Datasets

import tensorflow as tf
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).shuffle(1000).batch(16)
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(16)

5. Load and Fine-Tune BERT

from transformers import TFBertForSequenceClassification, AdamW
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.compile(optimizer=AdamW(learning_rate=2e-5),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_dataset, validation_data=val_dataset, epochs=3)

6. Evaluate Model Performance

loss, accuracy = model.evaluate(val_dataset)
print(f"n Final Validation Accuracy: {accuracy:.4f}")

Output:
Final Validation Accuracy: 0.9314

Comparative Analysis: LSTM vs BERT

Now that we’ve implemented and evaluated both models, let’s compare them side by side:

Conclusion: BERT edges out LSTM in accuracy due to its deep understanding of context. However, if you’re short on compute or want a simpler model, LSTM still provides excellent results.

When to Use LSTM vs BERT

Choosing between LSTM and BERT depends on your goals and resources:

Use LSTM if:
- You’re constrained on computational resources
- You want to build models from scratch for educational purposes
- Dataset is relatively small and domain-specific
Use BERT if:
- You need state-of-the-art accuracy
- You’re working with large or noisy text data
- You want to leverage transfer learning for better generalization

Conclusion

Both LSTM and BERT are powerful in their own right. While BERT dominates with context-awareness and pre-training, LSTMs still remain relevant for faster deployment and simpler pipelines. In the battle against fake news, picking the right model is just one part — the key is using technology ethically and effectively to promote truth.

FAQs

Q: What is the main difference between LSTM and BERT?
A: LSTM is a type of Recurrent Neural Network (RNN) that is trained from scratch, while BERT is a pre-trained language model that is fine-tuned for specific tasks.
Q: Which model is more accurate for fake news detection?
A: BERT is more accurate for fake news detection due to its deep understanding of context and pre-training on a large corpus.
Q: What are the advantages of using LSTM?
A: LSTM is simpler to implement, requires less computational resources,