Vanilla RNN vs LSTM on the IMDB Dataset: A Comparative Study for Sentiment Analysis

Introduction

Sentiment analysis is a fundamental task in natural language processing (NLP) that focuses on identifying the emotional tone behind textual data. From product reviews and social media posts to movie critiques, understanding sentiment helps businesses and researchers extract meaningful insights from large volumes of text. One of the most commonly used benchmark datasets for this task is the IMDB movie reviews dataset, which contains labeled positive and negative reviews.

Recurrent Neural Networks (RNNs) have traditionally been a popular choice for sentiment analysis because they are designed to handle sequential data such as text. However, standard RNNs often struggle with long-term dependencies, which led to the development of improved architectures like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks.

In this blog, we conduct a comparative study between a Vanilla RNN and a Long Short-Term Memory (LSTM) network for sentiment classification on the IMDB dataset. The goal is not only to compare performance metrics but also to understand how a standard RNN behaves during training, identify its practical limitations, and observe how LSTM addresses these shortcomings through its gated memory mechanism. Through this comparison, we highlight the strengths, weaknesses, and real-world trade-offs between these two sequential models.

Why Sentiment Analysis with RNNs?

Text data is inherently sequential. The meaning of a word often depends on the words that come before it, making sequence modeling essential for NLP tasks. RNNs are specifically designed to process sequences by maintaining a hidden state that captures information from previous time steps.

For sentiment analysis, this sequential modeling capability allows RNN-based models to understand context, negations, and dependencies across words in a sentence or paragraph. For example, phrases like “not good” or “although the movie started slow, it ended brilliantly” require contextual understanding that simple bag-of-words models often fail to capture.

Despite newer architectures like Transformers gaining popularity, RNN-based models remain important for understanding the evolution of sequence modeling and for use cases where computational resources are limited.

IMDB Dataset Overview

The IMDB dataset is a widely used benchmark for binary sentiment classification. It consists of movie reviews labeled as either positive or negative, making it ideal for evaluating text classification models.

Dataset Characteristics

Task: Binary sentiment classification (positive / negative)
Total reviews: 50,000
Training set: 25,000 reviews
Test set: 25,000 reviews
Balanced classes: Equal number of positive and negative samples

Preprocessing Steps

Common preprocessing steps applied to the IMDB dataset include:

Text tokenization
Converting words into integer indices
Limiting vocabulary size
Padding or truncating sequences to a fixed length

These steps ensure that the text data can be efficiently processed by neural network models.

Recurrent Neural Network Architectures

This section provides a high-level overview of the RNN architectures – Vanilla RNN and LSTM compared in this study.

Vanilla RNN

The Vanilla RNN is the simplest form of recurrent neural network. It processes sequences by updating a hidden state at each time step using the current input and the previous hidden state.

Advantages:

Simple and easy to implement
Fewer parameters
Useful for short sequences

Limitations:

Suffers from the vanishing gradient problem
Struggles with long-term dependencies
Performance degrades on longer text sequences

Because IMDB reviews can be lengthy, vanilla RNNs often fail to retain important contextual information from earlier parts of the review.

Long Short-Term Memory (LSTM)

LSTM networks were introduced to overcome the limitations of vanilla RNNs. They use a gated architecture that controls the flow of information through the network.

Key components of LSTM include:

Forget gate
Input gate
Output gate
Cell state

These mechanisms allow LSTMs to selectively remember or forget information over long sequences.

Advantages:

Excellent at capturing long-term dependencies
Stable training behavior
Strong performance on text-based tasks

Limitations:

Computationally expensive
Larger number of parameters
Slower training compared to simpler models

Want to learn more about RNN and LSTM architecture ? Read our detailed guide explaining how Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) models work.

Experimental Setup

To ensure a fair and controlled comparison between Vanilla RNN and LSTM, both the models were trained under identical experimental conditions. The only difference between experiments was the recurrent unit type. This controlled design allows performance differences to be attributed solely to architectural variations.

Dataset Split

The IMDb dataset consists of 50,000 labeled reviews, evenly split into:

25,000 training samples
25,000 test samples

From the training set, a validation subset was created to monitor generalization performance and enable early stopping. The test set was used strictly for final evaluation and was not accessed during training.

Model Configuration

Both RNN and LSTM models were configured with identical hyperparameters:

Vocabulary Size: 20,000
Embedding Dimension: 200
Hidden Size: 128
Number of Layers: 1
Direction: Unidirectional
Dropout: 0.5
Output Dimension: 1 (Binary classification)

This ensures architectural fairness across experiments.

Training Configuration

Loss Function: Binary Cross Entropy with Logits (BCEWithLogitsLoss)
Optimizer: Adam
Learning Rate: 0.001
Weight Decay: 1e-4
Batch Size: 128
Early Stopping: Based on validation loss

The models output raw logits, which are passed directly to BCEWithLogitsLoss. Sigmoid activation is applied only during evaluation for probability estimation.

Regularization Strategy

To improve generalization and prevent overfitting:

Dropout (0.5) was applied before the final classification layer.
L2 regularization was implemented through weight decay in the optimizer.
Early stopping was employed to halt training when validation performance stopped improving.

Evaluation Metrics

Model performance was evaluated using:

Training Loss
Validation Loss
Training Accuracy
Validation Accuracy
Final Test Accuracy

Binary predictions were obtained by applying a sigmoid function to logits and thresholding at 0.5.

Implementation Details

This section describes the technical implementation, including data preprocessing, model construction, and training workflow.

Data Preparation

Effective sentiment modeling begins with structured and consistent text preprocessing. Since neural networks operate on numerical tensors rather than raw text, the IMDb reviews were transformed into model-ready sequences through a carefully designed preprocessing pipeline.

1. Text Normalization and Tokenization

A lightweight custom tokenizer was implemented to standardize input text before numerical encoding.

Python

 def simple_tokenize(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return text.split()

This tokenizer performs three key operations:

Lowercasing – All text is converted to lowercase to reduce vocabulary size and avoid treating words like “Movie” and “movie” as separate tokens.
Character Filtering – A regular expression removes all non-alphanumeric characters except whitespace. This eliminates punctuation and special symbols that are unlikely to contribute meaningfully to binary sentiment classification.
Whitespace Tokenization – The cleaned text is split on spaces to produce word-level tokens.

Although simple, this approach is effective for baseline sentiment analysis tasks and maintains computational efficiency without introducing external dependencies.

2. Vocabulary Construction

A vocabulary was built from the training dataset only to prevent data leakage. Tokens were ranked by frequency, and the top 20,000 most frequent words were retained.

Two special tokens were introduced:

<PAD> (index 0) — used for sequence padding
<UNK> — used to represent out-of-vocabulary words

Limiting vocabulary size improves training efficiency and reduces noise introduced by rare terms.

3. Integer Encoding

Each token was mapped to its corresponding index in the vocabulary dictionary (word_to_idx).

This transforms text into sequences of integers suitable for embedding lookup.

4. Sequence Length Analysis

Movie reviews in the IMDb dataset vary significantly in length. Instead of selecting an arbitrary truncation threshold, the distribution of review lengths was analyzed.

Based on this distribution, selected max_length: 595

This value corresponds to approximately the 95th percentile of review lengths, meaning that 95% of all reviews fall within this length.

Choosing the 95th percentile provides a balanced trade-off:

Preserves most contextual information
Prevents excessive padding
Controls memory usage
Improves training efficiency

5. Padding and Truncation Strategy

To ensure consistent batch processing:

Sequences shorter than 595 tokens were padded using <PAD> (index 0).
Sequences longer than 595 tokens were truncated to 595 tokens.

This ensures consistent input shape across batches.

6. Handling Variable-Length Sequences

To prevent padded tokens from influencing hidden state updates, pack_padded_sequence() was used before passing inputs to the recurrent layer.

This improves:

Computational efficiency
Gradient flow
Model stability

The original sequence lengths were passed alongside the padded inputs.

Model Architecture Implementation

This section describes the implementation of the sentiment classification models used in the experiments. Both architectures follow a similar pipeline consisting of an embedding layer, a recurrent sequence modeling layer, a dropout regularization layer, and a fully connected output layer. The primary difference between the two models lies in the type of recurrent unit used: a Vanilla RNN or an LSTM.

The models are implemented using the PyTorch deep learning framework.

Overall Model Pipeline

The sentiment classification model follows the following processing pipeline:

Tokenized Review → Embedding Layer → Recurrent Layer (RNN/LSTM) → Dropout → Fully Connected Layer → Output Logits

The final output represents the predicted sentiment score for each review.

1. Embedding Layer

The first layer of the model is an embedding layer that converts token indices into dense vector representations.

Python

 self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

Configuration –

Vocabulary size: 20,000
Embedding dimension: 200
Padding index: 0

Purpose –

Text input to the model is represented as integer indices corresponding to words in the vocabulary. However, neural networks cannot directly interpret these indices as meaningful representations. The embedding layer addresses this by mapping each word index to a dense vector in a continuous vector space.

Instead of representing words using sparse one-hot vectors, embeddings allow the model to learn semantic relationships between words during training. Words that appear in similar contexts often develop similar vector representations.

For example, words such as “excellent”, “amazing”, and “fantastic” may eventually occupy nearby positions in the embedding space.

The padding_idx=0 parameter ensures that padding tokens used for sequence alignment do not contribute to gradient updates during training.

2. Recurrent Sequence Modelling Layer

After the embedding layer, the sequence of word vectors is passed to a recurrent neural network layer that processes the review sequentially.

Two variants were implemented:

Vanilla RNN
LSTM

Both layers operate on sequences of embeddings and update hidden states as they process each token in the sequence.

a. Vanilla RNN Layer

The Vanilla RNN model uses the following configuration:

Python

 self.rnn = nn.RNN(
    input_size=embedding_dim,
    hidden_size=hidden_size,
    num_layers=1,
    nonlinearity="tanh",
    batch_first=True
)

Configuration –

Input size: 200
Hidden size: 128
Number of layers: 1
Activation function: tanh
Batch-first format: True

Working Mechanism –

At each time step, the RNN updates its hidden state using the current input embedding and the previous hidden state.

The update rule can be expressed as:

Where:

is the input embedding at time step
is the previous hidden state
is the updated hidden state

As the network processes the review token by token, the hidden state gradually accumulates contextual information about the sequence.

After the entire sequence has been processed, the final hidden state is used as the representation of the entire review.

b. LSTM Layer

The LSTM model replaces the vanilla RNN layer with an LSTM layer.

Python

 self.lstm = nn.LSTM(
    input_size=embedding_dim,
    hidden_size=hidden_size,
    batch_first=True
)

Configuration –

Input size: 200
Hidden size: 128
Single-layer LSTM
Unidirectional processing

Working Mechanism –

Unlike vanilla RNNs, LSTMs maintain two states:

Hidden state ()
Cell state ()

The cell state acts as a long-term memory that allows the network to preserve information over long sequences.

LSTMs use three gating mechanisms to control information flow:

Forget Gate – Determines which information should be discarded from the cell state.

Input Gate – Controls which new information should be added to the cell state.

Output Gate – Determines which parts of the cell state should influence the hidden state output.

These gates enable LSTMs to capture long-term dependencies in text sequences and significantly mitigate the vanishing gradient problem that affects vanilla RNNs.

3. Dropout Regularization

Before classification, dropout is applied to the final hidden representation.

Python

 self.dropout = nn.Dropout(0.5)

Dropout randomly disables 50% of neurons during training, which helps reduce overfitting and improves model generalization.

4. Fully Connected Output Layer

The final layer of the model is a fully connected linear layer.

Python

 self.fc = nn.Linear(hidden_size, 1)

This layer maps the hidden representation to a single scalar value representing the predicted sentiment score.

The model outputs raw logits, which are passed directly to the loss function.

Loss Function and Output Handling

The models return raw logits.

BCEWithLogitsLoss() internally applies sigmoid activation and computes binary cross-entropy in a numerically stable manner as follows :

Optimization Strategy

The Adam optimizer was selected due to its adaptive learning rate mechanism and strong empirical performance in NLP tasks.

Weight decay (1e-4) was applied to introduce L2 regularization and reduce overfitting.

Python

 optimizer = torch.optim.Adam(model.parameters(), lr=0.001,weight_decay=1e-4)

Model Capacity

Vanilla RNN Total Parameters: 4,042,369

LSTM Total Parameters: 4,169,089

The majority of parameters originate from the embedding matrix (20,000 × 200), which dominates the model’s representational capacity.

The LSTM architecture contains more parameters due to its gated structure, increasing model complexity and learning capacity.

Results and Comparative Analysis

This section presents the experimental results obtained from training Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) models for sentiment classification on movie reviews. The models were evaluated using validation loss, test accuracy, and predictions on custom review samples to assess both quantitative performance and qualitative behavior.

Vanilla RNN Results

Training Performance

Early Stopping Triggered: Epoch 19
Best Validation Loss: 0.5234 (Epoch 9)

The model achieved optimal generalization relatively early during training, after which validation performance began to plateau.

Test Performance

Test Loss: 0.5173
Test Accuracy: 75.94%

Although the Vanilla RNN successfully learned sentiment patterns, its performance indicates limitations in handling long contextual dependencies.

Qualitative Evaluation — Custom Reviews

Review	Prediction	Confidence	Observation
Fantastic movie	Negative	0.6833	Misclassification
Terrible movie	Negative	0.8344	Correct
Confusing plot	Negative	0.8809	Correct
Best films ever	Positive	0.9660	Correct
Average movie	Negative	0.8201	Reasonable

Analysis

The RNN incorrectly classified a strongly positive review as negative. This behavior highlights a known limitation of traditional RNNs:

Difficulty preserving long-range sentiment cues
Sensitivity to earlier tokens in sequences
Gradual information decay during sequence processing

The model tends to rely heavily on localized patterns instead of overall contextual meaning.

LSTM Results

Training Performance

Early stopping triggered: Epoch 20
Best validation loss: 0.3661 (Epoch 10)

The significantly lower validation loss demonstrates improved learning stability compared to the Vanilla RNN.

Test Performance

Test Loss: 0.3677
Test Accuracy: 84.34%

The LSTM achieved an accuracy improvement of approximately 8.4% over the Vanilla RNN.

Qualitative Evaluation — Custom Reviews

Review	Prediction	Confidence	Observation
Fantastic movie	✅ Positive	0.5658	Correct
Terrible movie	✅ Negative	0.9872	Correct
Confusing plot	✅ Negative	0.9427	Correct
Best films ever	✅ Positive	0.9962	Correct
Average movie	✅ Negative	0.7564	Correct

Analysis

Unlike the Vanilla RNN, the LSTM correctly classified all custom reviews. This demonstrates its superior ability to understand contextual sentiment across entire sequences.

Even when confidence values were moderate, predictions remained semantically accurate, indicating better internal representation learning.

Quantitative Comparison

Metric	Vanilla RNN	LSTM
Best Validation Loss	0.5234	0.3661
Test Loss	0.5173	0.3677
Test Accuracy	75.94%	84.34%
Training Stability	Moderate	High
Long Context Understanding	Limited	Strong

Practical Implications

The experimental results suggest:

Vanilla RNNs are suitable as baseline models.
LSTMs provide significantly better performance for real-world sentiment analysis tasks.

Interactive Model Deployment (Streamlit App)

To make the models accessible for real-time experimentation, an interactive web application is being developed using Streamlit.

The application will allow users to:

Enter custom movie reviews
Select a trained model (RNN or LSTM)
View sentiment predictions instantly
Observe prediction confidence scores

This deployment bridges the gap between research experimentation and real-world usability by enabling anyone to test the trained models directly through a web interface.

The Streamlit application link is provided below:

👉 IMDB Sentiment Classifier · Streamlit

Source Code and Reproducibility

To ensure transparency and reproducibility, the complete implementation used in this study is publicly available. The repository contains a Jupyter Notebook that documents the entire development process step by step.

Readers interested in exploring the implementation details or reproducing the experiments can access the full project below:

👉 GitHub Repository:
View the Complete Notebook and Source Code

The notebook provides a hands-on walkthrough of the entire workflow, from raw text processing to model evaluation, making it suitable for both learning and experimentation.

Conclusion

In this project, we explored sentiment analysis using recurrent neural network architectures on the IMDB movie review dataset, progressing from a basic Vanilla RNN to Long Short-Term Memory (LSTM) model. The goal was not only to build accurate classifiers but also to understand how different sequence modeling architectures handle textual data and long-range dependencies.

The Vanilla RNN served as an important baseline, demonstrating how sequential neural networks can learn sentiment patterns from text. While it achieved reasonable performance with a test accuracy of 75.94%, qualitative analysis revealed its limitations in preserving contextual information across long reviews. Misclassifications in strongly positive sentences highlighted the challenges caused by vanishing gradients and limited memory retention.

The LSTM model addressed these shortcomings through its gating mechanisms, enabling selective memory retention and improved contextual understanding. This architectural improvement translated into a significant performance gain, achieving 84.34% test accuracy along with lower validation and test loss. The LSTM consistently produced more reliable predictions and successfully handled nuanced sentiment expressions present in real-world reviews.

Beyond model training, this project emphasized the complete machine learning workflow — including data preprocessing, tokenization, sequence padding, architecture design, experimental evaluation, and deployment considerations. By building an interactive Streamlit application, the models were extended from research experimentation to practical usability, allowing users to test sentiment predictions on their own inputs in real time.

Overall, the experiments demonstrate a key insight in natural language processing: model architecture plays a crucial role in capturing linguistic context. While simple RNNs provide foundational understanding, gated recurrent models like LSTM offer substantial improvements for tasks involving long textual sequences.

This work highlights how deep learning models evolve from theoretical concepts into deployable intelligent systems, forming a strong foundation for further exploration into modern NLP architectures such as GRUs, attention mechanisms, and Transformer-based models.

Tagged Deep Learning, IMDB Dataset, LSTM, Natural Language Processing, RNN, Sentiment Analysis

Vanilla RNN vs LSTM on the IMDB Dataset: A Comparative Study for Sentiment Analysis

Introduction

Why Sentiment Analysis with RNNs?

IMDB Dataset Overview

Dataset Characteristics

Preprocessing Steps

Recurrent Neural Network Architectures

Vanilla RNN

Long Short-Term Memory (LSTM)

Experimental Setup

Dataset Split

Model Configuration

Training Configuration

Regularization Strategy

Evaluation Metrics

Implementation Details

Data Preparation

1. Text Normalization and Tokenization

2. Vocabulary Construction

A vocabulary was built from the training dataset only to prevent data leakage. Tokens were ranked by frequency, and the top 20,000 most frequent words were retained.

3. Integer Encoding

Each token was mapped to its corresponding index in the vocabulary dictionary (word_to_idx).

4. Sequence Length Analysis

Movie reviews in the IMDb dataset vary significantly in length. Instead of selecting an arbitrary truncation threshold, the distribution of review lengths was analyzed.

5. Padding and Truncation Strategy

6. Handling Variable-Length Sequences

To prevent padded tokens from influencing hidden state updates, pack_padded_sequence() was used before passing inputs to the recurrent layer.

Model Architecture Implementation

Overall Model Pipeline

1. Embedding Layer

The first layer of the model is an embedding layer that converts token indices into dense vector representations.

Configuration –

Purpose –

2. Recurrent Sequence Modelling Layer

a. Vanilla RNN Layer

Configuration –

Working Mechanism –

b. LSTM Layer

Configuration –

Working Mechanism –

3. Dropout Regularization

4. Fully Connected Output Layer

Loss Function and Output Handling

Optimization Strategy

Model Capacity

Results and Comparative Analysis

Vanilla RNN Results

Training Performance

Test Performance

Qualitative Evaluation — Custom Reviews

Analysis

LSTM Results

Training Performance

Test Performance

Qualitative Evaluation — Custom Reviews

Analysis

Quantitative Comparison

Practical Implications

Interactive Model Deployment (Streamlit App)

Source Code and Reproducibility

Conclusion

Leave a Reply Cancel reply

Table of Contents