Introduction

Sentiment analysis is a fundamental task in natural language processing (NLP) that focuses on identifying the emotional tone behind textual data. From product reviews and social media posts to movie critiques, understanding sentiment helps businesses and researchers extract meaningful insights from large volumes of text. One of the most commonly used benchmark datasets for this task is the IMDB movie reviews dataset, which contains labeled positive and negative reviews.

Recurrent Neural Networks (RNNs) have traditionally been a popular choice for sentiment analysis because they are designed to handle sequential data such as text. However, standard RNNs often struggle with long-term dependencies, which led to the development of improved architectures like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks.

In this blog, we conduct a comparative study between a Vanilla RNN and a Long Short-Term Memory (LSTM) network for sentiment classification on the IMDB dataset. The goal is not only to compare performance metrics but also to understand how a standard RNN behaves during training, identify its practical limitations, and observe how LSTM addresses these shortcomings through its gated memory mechanism. Through this comparison, we highlight the strengths, weaknesses, and real-world trade-offs between these two sequential models.

Why Sentiment Analysis with RNNs?

Text data is inherently sequential. The meaning of a word often depends on the words that come before it, making sequence modeling essential for NLP tasks. RNNs are specifically designed to process sequences by maintaining a hidden state that captures information from previous time steps.

For sentiment analysis, this sequential modeling capability allows RNN-based models to understand context, negations, and dependencies across words in a sentence or paragraph. For example, phrases like “not good” or “although the movie started slow, it ended brilliantly” require contextual understanding that simple bag-of-words models often fail to capture.

Despite newer architectures like Transformers gaining popularity, RNN-based models remain important for understanding the evolution of sequence modeling and for use cases where computational resources are limited.

IMDB Dataset Overview

The IMDB dataset is a widely used benchmark for binary sentiment classification. It consists of movie reviews labeled as either positive or negative, making it ideal for evaluating text classification models.

Dataset Characteristics

  • Task: Binary sentiment classification (positive / negative)

  • Total reviews: 50,000

  • Training set: 25,000 reviews

  • Test set: 25,000 reviews

  • Balanced classes: Equal number of positive and negative samples

Preprocessing Steps

Common preprocessing steps applied to the IMDB dataset include:

  • Text tokenization

  • Converting words into integer indices

  • Limiting vocabulary size

  • Padding or truncating sequences to a fixed length

These steps ensure that the text data can be efficiently processed by neural network models.

Recurrent Neural Network Architectures

This section provides a high-level overview of the RNN architectures – Vanilla RNN and LSTM compared in this study.

Vanilla RNN

The Vanilla RNN is the simplest form of recurrent neural network. It processes sequences by updating a hidden state at each time step using the current input and the previous hidden state.

Figure: Vanilla RNN architecture (illustration created for this article)

Advantages:

  • Simple and easy to implement

  • Fewer parameters

  • Useful for short sequences

Limitations:

  • Suffers from the vanishing gradient problem

  • Struggles with long-term dependencies

  • Performance degrades on longer text sequences

Because IMDB reviews can be lengthy, vanilla RNNs often fail to retain important contextual information from earlier parts of the review.

Long Short-Term Memory (LSTM)

LSTM networks were introduced to overcome the limitations of vanilla RNNs. They use a gated architecture that controls the flow of information through the network.

Figure: LSTM architecture (illustration created for this article)

Key components of LSTM include:

  • Forget gate

  • Input gate

  • Output gate

  • Cell state

These mechanisms allow LSTMs to selectively remember or forget information over long sequences.

Advantages:

  • Excellent at capturing long-term dependencies

  • Stable training behavior

  • Strong performance on text-based tasks

Limitations:

  • Computationally expensive

  • Larger number of parameters

  • Slower training compared to simpler models

Want to learn more about RNN and LSTM architecture ? Read our detailed guide explaining how Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) models work.

Experimental Setup

To ensure a fair and controlled comparison between Vanilla RNN and LSTM, both the models were trained under identical experimental conditions. The only difference between experiments was the recurrent unit type. This controlled design allows performance differences to be attributed solely to architectural variations.

Dataset Split

The IMDb dataset consists of 50,000 labeled reviews, evenly split into:

  • 25,000 training samples

  • 25,000 test samples

From the training set, a validation subset was created to monitor generalization performance and enable early stopping. The test set was used strictly for final evaluation and was not accessed during training.

Model Configuration

Both RNN and LSTM models were configured with identical hyperparameters:

  • Vocabulary Size: 20,000

  • Embedding Dimension: 200

  • Hidden Size: 128

  • Number of Layers: 1

  • Direction: Unidirectional

  • Dropout: 0.5

  • Output Dimension: 1 (Binary classification)

This ensures architectural fairness across experiments.

Training Configuration

  • Loss Function: Binary Cross Entropy with Logits (BCEWithLogitsLoss)

  • Optimizer: Adam

  • Learning Rate: 0.001

  • Weight Decay: 1e-4

  • Batch Size: 128

  • Early Stopping: Based on validation loss

The models output raw logits, which are passed directly to BCEWithLogitsLoss. Sigmoid activation is applied only during evaluation for probability estimation.

Regularization Strategy

To improve generalization and prevent overfitting:

  • Dropout (0.5) was applied before the final classification layer.

  • L2 regularization was implemented through weight decay in the optimizer.

  • Early stopping was employed to halt training when validation performance stopped improving.

Evaluation Metrics

Model performance was evaluated using:

  • Training Loss

  • Validation Loss

  • Training Accuracy

  • Validation Accuracy

  • Final Test Accuracy

Binary predictions were obtained by applying a sigmoid function to logits and thresholding at 0.5.

Implementation Details

This section describes the technical implementation, including data preprocessing, model construction, and training workflow.

Data Preparation

Effective sentiment modeling begins with structured and consistent text preprocessing. Since neural networks operate on numerical tensors rather than raw text, the IMDb reviews were transformed into model-ready sequences through a carefully designed preprocessing pipeline.

1. Text Normalization and Tokenization

A lightweight custom tokenizer was implemented to standardize input text before numerical encoding.

Python
 def simple_tokenize(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return text.split()

This tokenizer performs three key operations:

  1. Lowercasing – All text is converted to lowercase to reduce vocabulary size and avoid treating words like “Movie” and “movie” as separate tokens.

  2. Character Filtering – A regular expression removes all non-alphanumeric characters except whitespace. This eliminates punctuation and special symbols that are unlikely to contribute meaningfully to binary sentiment classification.

  3. Whitespace Tokenization – The cleaned text is split on spaces to produce word-level tokens.

Although simple, this approach is effective for baseline sentiment analysis tasks and maintains computational efficiency without introducing external dependencies.

2. Vocabulary Construction
 A vocabulary was built from the training dataset only to prevent data leakage. Tokens were ranked by frequency, and the top 20,000 most frequent words were retained.

Two special tokens were introduced:

  • <PAD> (index 0) — used for sequence padding

  • <UNK> — used to represent out-of-vocabulary words

Limiting vocabulary size improves training efficiency and reduces noise introduced by rare terms.

3. Integer Encoding
Each token was mapped to its corresponding index in the vocabulary dictionary (word_to_idx).

This transforms text into sequences of integers suitable for embedding lookup.

4. Sequence Length Analysis
 Movie reviews in the IMDb dataset vary significantly in length. Instead of selecting an arbitrary truncation threshold, the distribution of review lengths was analyzed.
Review Length Statistics: Mean: 232.81, Median: 174.00, Min: 10, Max: 2469, Std: 172.96 , 75th percentile: 283, 90th percentile: 456, 95th percentile: 595, 99th percentile: 909

Based on this distribution, selected max_length: 595

This value corresponds to approximately the 95th percentile of review lengths, meaning that 95% of all reviews fall within this length.

Choosing the 95th percentile provides a balanced trade-off:

  • Preserves most contextual information

  • Prevents excessive padding

  • Controls memory usage

  • Improves training efficiency

5. Padding and Truncation Strategy

To ensure consistent batch processing:

  • Sequences shorter than 595 tokens were padded using <PAD> (index 0).

  • Sequences longer than 595 tokens were truncated to 595 tokens.

This ensures consistent input shape across batches.
6. Handling Variable-Length Sequences
 To prevent padded tokens from influencing hidden state updates, pack_padded_sequence() was used before passing inputs to the recurrent layer.

This improves:

  • Computational efficiency

  • Gradient flow

  • Model stability

The original sequence lengths were passed alongside the padded inputs.

Model Architecture Implementation

This section describes the implementation of the sentiment classification models used in the experiments. Both architectures follow a similar pipeline consisting of an embedding layer, a recurrent sequence modeling layer, a dropout regularization layer, and a fully connected output layer. The primary difference between the two models lies in the type of recurrent unit used: a Vanilla RNN or an LSTM.

The models are implemented using the PyTorch deep learning framework.

Overall Model Pipeline

The sentiment classification model follows the following processing pipeline:

Tokenized Review → Embedding Layer → Recurrent Layer (RNN/LSTM) → Dropout → Fully Connected Layer → Output Logits

The final output represents the predicted sentiment score for each review.

1. Embedding Layer
The first layer of the model is an embedding layer that converts token indices into dense vector representations.
Python
 self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
Configuration –
  • Vocabulary size: 20,000

  • Embedding dimension: 200

  • Padding index: 0

Purpose –

Text input to the model is represented as integer indices corresponding to words in the vocabulary. However, neural networks cannot directly interpret these indices as meaningful representations. The embedding layer addresses this by mapping each word index to a dense vector in a continuous vector space.

Instead of representing words using sparse one-hot vectors, embeddings allow the model to learn semantic relationships between words during training. Words that appear in similar contexts often develop similar vector representations.

For example, words such as “excellent”, “amazing”, and “fantastic” may eventually occupy nearby positions in the embedding space.

The padding_idx=0 parameter ensures that padding tokens used for sequence alignment do not contribute to gradient updates during training.

2. Recurrent Sequence Modelling Layer

After the embedding layer, the sequence of word vectors is passed to a recurrent neural network layer that processes the review sequentially.

Two variants were implemented:

  • Vanilla RNN

  • LSTM

Both layers operate on sequences of embeddings and update hidden states as they process each token in the sequence.

a. Vanilla RNN Layer

The Vanilla RNN model uses the following configuration:

Python
 self.rnn = nn.RNN(
    input_size=embedding_dim,
    hidden_size=hidden_size,
    num_layers=1,
    nonlinearity="tanh",
    batch_first=True
)
Configuration –
  • Input size: 200

  • Hidden size: 128

  • Number of layers: 1

  • Activation function: tanh

  • Batch-first format: True

Working Mechanism –

At each time step, the RNN updates its hidden state using the current input embedding and the previous hidden state.

The update rule can be expressed as:

Where:

  • xt is the input embedding at time step t

  • ht−1 is the previous hidden state

  • ht is the updated hidden state

As the network processes the review token by token, the hidden state gradually accumulates contextual information about the sequence.

After the entire sequence has been processed, the final hidden state is used as the representation of the entire review.

b. LSTM Layer

The LSTM model replaces the vanilla RNN layer with an LSTM layer.

Python
 self.lstm = nn.LSTM(
    input_size=embedding_dim,
    hidden_size=hidden_size,
    batch_first=True
)
Configuration –
  • Input size: 200

  • Hidden size: 128

  • Single-layer LSTM

  • Unidirectional processing

Working Mechanism –

Unlike vanilla RNNs, LSTMs maintain two states:

  • Hidden state (ht)

  • Cell state (ct)

The cell state acts as a long-term memory that allows the network to preserve information over long sequences.

LSTMs use three gating mechanisms to control information flow:

Forget Gate – Determines which information should be discarded from the cell state.

Input Gate – Controls which new information should be added to the cell state.

Output Gate – Determines which parts of the cell state should influence the hidden state output.

These gates enable LSTMs to capture long-term dependencies in text sequences and significantly mitigate the vanishing gradient problem that affects vanilla RNNs.

3. Dropout Regularization

Before classification, dropout is applied to the final hidden representation.

Python
 self.dropout = nn.Dropout(0.5)

Dropout randomly disables 50% of neurons during training, which helps reduce overfitting and improves model generalization.

4. Fully Connected Output Layer

The final layer of the model is a fully connected linear layer.

Python
 self.fc = nn.Linear(hidden_size, 1)

This layer maps the hidden representation to a single scalar value representing the predicted sentiment score.

The model outputs raw logits, which are passed directly to the loss function.

Loss Function and Output Handling

The models return raw logits.

BCEWithLogitsLoss() internally applies sigmoid activation and computes binary cross-entropy in a numerically stable manner as follows : 

Optimization Strategy

The Adam optimizer was selected due to its adaptive learning rate mechanism and strong empirical performance in NLP tasks.

Weight decay (1e-4) was applied to introduce L2 regularization and reduce overfitting.

Python
 optimizer = torch.optim.Adam(model.parameters(), lr=0.001,weight_decay=1e-4)
Model Capacity

Vanilla RNN Total Parameters: 4,042,369

LSTM Total Parameters: 4,169,089

The majority of parameters originate from the embedding matrix (20,000 × 200), which dominates the model’s representational capacity.

The LSTM architecture contains more parameters due to its gated structure, increasing model complexity and learning capacity.

Results and Comparative Analysis

This section presents the experimental results obtained from training Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) models for sentiment classification on movie reviews. The models were evaluated using validation loss, test accuracy, and predictions on custom review samples to assess both quantitative performance and qualitative behavior.

Vanilla RNN Results

Training Performance
  • Early Stopping Triggered: Epoch 19
  • Best Validation Loss: 0.5234 (Epoch 9)

The model achieved optimal generalization relatively early during training, after which validation performance began to plateau.

Test Performance
  •  Test Loss: 0.5173
  • Test Accuracy: 75.94%

Although the Vanilla RNN successfully learned sentiment patterns, its performance indicates limitations in handling long contextual dependencies.

Qualitative Evaluation — Custom Reviews
Figure: Vanilla RNN predictions on Custom Reviews
ReviewPredictionConfidenceObservation
Fantastic movie❌ Negative0.6833Misclassification
Terrible movie✅ Negative0.8344Correct
Confusing plot✅ Negative0.8809Correct
Best films ever✅ Positive0.9660Correct
Average movie✅ Negative0.8201Reasonable
Analysis

The RNN incorrectly classified a strongly positive review as negative. This behavior highlights a known limitation of traditional RNNs:

  • Difficulty preserving long-range sentiment cues

  • Sensitivity to earlier tokens in sequences

  • Gradual information decay during sequence processing

The model tends to rely heavily on localized patterns instead of overall contextual meaning.

LSTM Results

Training Performance
  • Early stopping triggered: Epoch 20

  • Best validation loss: 0.3661 (Epoch 10)

The significantly lower validation loss demonstrates improved learning stability compared to the Vanilla RNN.

Test Performance
  • Test Loss: 0.3677

  • Test Accuracy: 84.34%

The LSTM achieved an accuracy improvement of approximately 8.4% over the Vanilla RNN.

Qualitative Evaluation — Custom Reviews
Figure: LSTM predictions on Custom Reviews
ReviewPredictionConfidenceObservation
Fantastic movie✅ Positive0.5658Correct
Terrible movie✅ Negative0.9872Correct
Confusing plot✅ Negative0.9427Correct
Best films ever✅ Positive0.9962Correct
Average movie✅ Negative0.7564Correct
Analysis

Unlike the Vanilla RNN, the LSTM correctly classified all custom reviews. This demonstrates its superior ability to understand contextual sentiment across entire sequences.

Even when confidence values were moderate, predictions remained semantically accurate, indicating better internal representation learning.

Quantitative Comparison

MetricVanilla RNNLSTM
Best Validation Loss0.52340.3661
Test Loss0.51730.3677
Test Accuracy75.94%84.34%
Training StabilityModerateHigh
Long Context UnderstandingLimitedStrong

Practical Implications

The experimental results suggest:

  • Vanilla RNNs are suitable as baseline models.

  • LSTMs provide significantly better performance for real-world sentiment analysis tasks.

Interactive Model Deployment (Streamlit App)

To make the models accessible for real-time experimentation, an interactive web application is being developed using Streamlit.

The application will allow users to:

  • Enter custom movie reviews

  • Select a trained model (RNN or LSTM)

  • View sentiment predictions instantly

  • Observe prediction confidence scores

This deployment bridges the gap between research experimentation and real-world usability by enabling anyone to test the trained models directly through a web interface.

The Streamlit application link is provided below:

👉 IMDB Sentiment Classifier · Streamlit

Source Code and Reproducibility

To ensure transparency and reproducibility, the complete implementation used in this study is publicly available. The repository contains a Jupyter Notebook that documents the entire development process step by step.

Readers interested in exploring the implementation details or reproducing the experiments can access the full project below:

👉 GitHub Repository:
View the Complete Notebook and Source Code

The notebook provides a hands-on walkthrough of the entire workflow, from raw text processing to model evaluation, making it suitable for both learning and experimentation.

Conclusion

In this project, we explored sentiment analysis using recurrent neural network architectures on the IMDB movie review dataset, progressing from a basic Vanilla RNN to Long Short-Term Memory (LSTM) model. The goal was not only to build accurate classifiers but also to understand how different sequence modeling architectures handle textual data and long-range dependencies.

The Vanilla RNN served as an important baseline, demonstrating how sequential neural networks can learn sentiment patterns from text. While it achieved reasonable performance with a test accuracy of 75.94%, qualitative analysis revealed its limitations in preserving contextual information across long reviews. Misclassifications in strongly positive sentences highlighted the challenges caused by vanishing gradients and limited memory retention.

The LSTM model addressed these shortcomings through its gating mechanisms, enabling selective memory retention and improved contextual understanding. This architectural improvement translated into a significant performance gain, achieving 84.34% test accuracy along with lower validation and test loss. The LSTM consistently produced more reliable predictions and successfully handled nuanced sentiment expressions present in real-world reviews.

Beyond model training, this project emphasized the complete machine learning workflow — including data preprocessing, tokenization, sequence padding, architecture design, experimental evaluation, and deployment considerations. By building an interactive Streamlit application, the models were extended from research experimentation to practical usability, allowing users to test sentiment predictions on their own inputs in real time.

Overall, the experiments demonstrate a key insight in natural language processing: model architecture plays a crucial role in capturing linguistic context. While simple RNNs provide foundational understanding, gated recurrent models like LSTM offer substantial improvements for tasks involving long textual sequences.

This work highlights how deep learning models evolve from theoretical concepts into deployable intelligent systems, forming a strong foundation for further exploration into modern NLP architectures such as GRUs, attention mechanisms, and Transformer-based models.

Leave a Reply

Your email address will not be published. Required fields are marked *