top of page

Image Captioning with Neural Networks

Built as part of a Natural Language Processing (NLP) course, this project explores the intersection of computer vision and natural language processing by building an end-to-end image captioning system. Using the Flickr8k dataset, I implemented a CNN–RNN pipeline where images are encoded with a pretrained ResNet-18 and captions are generated by an LSTM language model. I extended the baseline with a conditioned LSTM that concatenates image features at every timestep, and I compared decoding strategies such as greedy search and beam search. I addressed practical issues like efficient preprocessing of 6,000+ images, managing tokenization and padding for sequences up to 40 words, and reducing repetition in generated captions

Goals

  • Develop a complete image-to-caption pipeline combining vision and sequence modeling.

  • Compare baseline text-only captioning against a conditioned LSTM that integrates image features directly.

  • Explore search strategies (greedy vs. beam search) for decoding and analyze their impact on fluency and diversity.

  • Gain hands-on experience with integrating pretrained encoders, sequence models, and generation techniques.

​

Technical specs

  • Dataset: Flickr8k (≈6,000 training, 1,000 dev, 1,000 test images).

  • Image Encoder: ResNet-18 pretrained on ImageNet, last layer removed, outputting 512-dimensional embeddings.

  • Text Processing: Tokenized captions padded to a max length of 40; special tokens <START>, <EOS>, <PAD>.

  • Models:

  • Baseline LSTM language model (embedding dim 512, hidden size 512).

  • Conditioned LSTM (concatenating 512-dim image vector to token embeddings at each step).

  • Training: Cross-entropy loss with AdamW, batch size 16, learning rate 1e-3.

  • Decoding: Greedy decoder, stochastic sampling, and beam search (beam widths 3 and 5).

​

Design process

  • Encoded all images up front into tensors to avoid recomputation and save training time.

  • Built a CaptionDataset class for flexible batching and alignment between images and their five captions.

  • Implemented the conditioned LSTM by expanding the image vector to match sequence length and concatenating it to the word embeddings.

  • Iteratively refined decoding strategies, moving from greedy (deterministic but repetitive) to beam search for more varied, natural captions.

  • Adjusted beam search logic to properly handle <EOS> tokens and normalize probabilities across different caption lengths.

​

Challenges

  • Handling the mismatch between simulation and real outputs: captions sometimes hallucinated details not in the image.

  • Preventing repetition in long captions, especially under beam search without length normalization.

  • Balancing efficiency and correctness when preprocessing thousands of images into tensors.

  • Debugging tensor shapes when concatenating image embeddings with word embeddings for the conditioned model.

​

Outcomes

  • Built a modular captioning pipeline capable of generating coherent captions on unseen dev images.

  • Conditioned LSTM produced more context-aware captions than the baseline, though still prone to hallucination.

  • Beam search improved fluency and reduced some repetition compared to greedy decoding.

  • Achieved ~60% token-level accuracy on validation data, with stable training behavior over multiple epochs.

  • Gained practical experience in sequence modeling, image–text integration, and debugging generation models.

​

Potential next steps

  • Integrate attention mechanisms so the model can focus on spatial regions of the image rather than a single global vector.

  • Add length normalization and coverage penalties to beam search to reduce bias toward shorter or repetitive captions.

  • Replace the LSTM decoder with a Transformer architecture for improved long-range sequence modeling.

  • Fine-tune the CNN encoder jointly with the decoder for end-to-end learning, rather than freezing ResNet features.

Image by Jean-Philippe Delberghe
bottom of page