Image Captioning with Neural Networks

Built as part of a Natural Language Processing (NLP) course, this project explores the intersection of computer vision and natural language processing by building an end-to-end image captioning system. Using the Flickr8k dataset, I implemented a CNN–RNN pipeline where images are encoded with a pretrained ResNet-18 and captions are generated by an LSTM language model. I extended the baseline with a conditioned LSTM that concatenates image features at every timestep, and I compared decoding strategies such as greedy search and beam search. I addressed practical issues like efficient preprocessing of 6,000+ images, managing tokenization and padding for sequences up to 40 words, and reducing repetition in generated captions

Goals

Develop a complete image-to-caption pipeline combining vision and sequence modeling.
Compare baseline text-only captioning against a conditioned LSTM that integrates image features directly.
Explore search strategies (greedy vs. beam search) for decoding and analyze their impact on fluency and diversity.
Gain hands-on experience with integrating pretrained encoders, sequence models, and generation techniques.

Technical specs

Dataset: Flickr8k (≈6,000 training, 1,000 dev, 1,000 test images).
Image Encoder: ResNet-18 pretrained on ImageNet, last layer removed, outputting 512-dimensional embeddings.
Text Processing: Tokenized captions padded to a max length of 40; special tokens <START>, <EOS>, <PAD>.
Models:
Baseline LSTM language model (embedding dim 512, hidden size 512).
Conditioned LSTM (concatenating 512-dim image vector to token embeddings at each step).
Training: Cross-entropy loss with AdamW, batch size 16, learning rate 1e-3.
Decoding: Greedy decoder, stochastic sampling, and beam search (beam widths 3 and 5).

Design process

Encoded all images up front into tensors to avoid recomputation and save training time.
Built a CaptionDataset class for flexible batching and alignment between images and their five captions.
Implemented the conditioned LSTM by expanding the image vector to match sequence length and concatenating it to the word embeddings.
Iteratively refined decoding strategies, moving from greedy (deterministic but repetitive) to beam search for more varied, natural captions.
Adjusted beam search logic to properly handle <EOS> tokens and normalize probabilities across different caption lengths.

Challenges

Handling the mismatch between simulation and real outputs: captions sometimes hallucinated details not in the image.
Preventing repetition in long captions, especially under beam search without length normalization.
Balancing efficiency and correctness when preprocessing thousands of images into tensors.
Debugging tensor shapes when concatenating image embeddings with word embeddings for the conditioned model.

Outcomes

Built a modular captioning pipeline capable of generating coherent captions on unseen dev images.
Conditioned LSTM produced more context-aware captions than the baseline, though still prone to hallucination.
Beam search improved fluency and reduced some repetition compared to greedy decoding.
Achieved ~60% token-level accuracy on validation data, with stable training behavior over multiple epochs.
Gained practical experience in sequence modeling, image–text integration, and debugging generation models.

Potential next steps

Integrate attention mechanisms so the model can focus on spatial regions of the image rather than a single global vector.
Add length normalization and coverage penalties to beam search to reduce bias toward shorter or repetitive captions.
Replace the LSTM decoder with a Transformer architecture for improved long-range sequence modeling.
Fine-tune the CNN encoder jointly with the decoder for end-to-end learning, rather than freezing ResNet features.