Full-Stack & AI Developer | Columbia CS | Puzzle solver & systems thinker
Image Captioning with Neural Networks
Built as part of a Natural Language Processing (NLP) course, this project explores the intersection of computer vision and natural language processing by building an end-to-end image captioning system. Using the Flickr8k dataset, I implemented a CNN–RNN pipeline where images are encoded with a pretrained ResNet-18 and captions are generated by an LSTM language model. I extended the baseline with a conditioned LSTM that concatenates image features at every timestep, and I compared decoding strategies such as greedy search and beam search. I addressed practical issues like efficient preprocessing of 6,000+ images, managing tokenization and padding for sequences up to 40 words, and reducing repetition in generated captions
Goals
-
Develop a complete image-to-caption pipeline combining vision and sequence modeling.
-
Compare baseline text-only captioning against a conditioned LSTM that integrates image features directly.
-
Explore search strategies (greedy vs. beam search) for decoding and analyze their impact on fluency and diversity.
-
Gain hands-on experience with integrating pretrained encoders, sequence models, and generation techniques.
​
Technical specs
-
Dataset: Flickr8k (≈6,000 training, 1,000 dev, 1,000 test images).
-
Image Encoder: ResNet-18 pretrained on ImageNet, last layer removed, outputting 512-dimensional embeddings.
-
Text Processing: Tokenized captions padded to a max length of 40; special tokens <START>, <EOS>, <PAD>.
-
Models:
-
Baseline LSTM language model (embedding dim 512, hidden size 512).
-
Conditioned LSTM (concatenating 512-dim image vector to token embeddings at each step).
-
Training: Cross-entropy loss with AdamW, batch size 16, learning rate 1e-3.
-
Decoding: Greedy decoder, stochastic sampling, and beam search (beam widths 3 and 5).
​
Design process
-
Encoded all images up front into tensors to avoid recomputation and save training time.
-
Built a CaptionDataset class for flexible batching and alignment between images and their five captions.
-
Implemented the conditioned LSTM by expanding the image vector to match sequence length and concatenating it to the word embeddings.
-
Iteratively refined decoding strategies, moving from greedy (deterministic but repetitive) to beam search for more varied, natural captions.
-
Adjusted beam search logic to properly handle <EOS> tokens and normalize probabilities across different caption lengths.
​
Challenges
-
Handling the mismatch between simulation and real outputs: captions sometimes hallucinated details not in the image.
-
Preventing repetition in long captions, especially under beam search without length normalization.
-
Balancing efficiency and correctness when preprocessing thousands of images into tensors.
-
Debugging tensor shapes when concatenating image embeddings with word embeddings for the conditioned model.
​
Outcomes
-
Built a modular captioning pipeline capable of generating coherent captions on unseen dev images.
-
Conditioned LSTM produced more context-aware captions than the baseline, though still prone to hallucination.
-
Beam search improved fluency and reduced some repetition compared to greedy decoding.
-
Achieved ~60% token-level accuracy on validation data, with stable training behavior over multiple epochs.
-
Gained practical experience in sequence modeling, image–text integration, and debugging generation models.
​
Potential next steps
-
Integrate attention mechanisms so the model can focus on spatial regions of the image rather than a single global vector.
-
Add length normalization and coverage penalties to beam search to reduce bias toward shorter or repetitive captions.
-
Replace the LSTM decoder with a Transformer architecture for improved long-range sequence modeling.
-
Fine-tune the CNN encoder jointly with the decoder for end-to-end learning, rather than freezing ResNet features.







