🚀 Image Captioning and Visual Understanding with VLM
Ever wondered how fast cutting-edge vision-language models can run on Apple Silicon? This script puts Moondream2 to the test with a wide range of visual understanding tasks:
- 📝 Image Captioning — short, detailed, streamed, and non-streamed
- ❓ Visual Queries — ask natural language questions about what’s inside the image
- 🎯 Object Detection — find all instances of a given object
- 📍 Pointing — locate objects in the image by name
It also benchmarks runtime performance of these tasks to compare different Apple Silicon chips (M2 vs M4) using the Metal Performance Shaders (MPS) backend.
1 | import time |
Code Overview
Load Model & Image
- Loads the
vikhyatk/moondream2
model with custom methods (caption
,query
,detect
,point
). - Opens a test image (
/tmp/img.jpg
) for processing.
- Loads the
Captioning
Generates a short caption (one sentence).
Generates a normal caption:
- In streaming mode, printing tokens as they arrive while measuring latency.
- In non-streaming mode, returning the full caption at once.
Visual Query
- Answers natural-language questions about the image, e.g. “How many people are in the image?”
Object Detection
- Detects specified objects, e.g. “face”.
Pointing
- Finds spatial locations of objects, e.g. “person”.
Benchmarking
- Tracks runtime for each task.
- Reports first token latency, average token latency, total generation time, and end-to-end runtime.
Results: M2 vs M4 (MPS Device)
We going to compare with this two image:
- Little snowman
- The public speech


Little Showman
Task | M2 Runtime (s) | M4 Runtime (s) | Speedup (≈) |
---|---|---|---|
Short caption | 13.37 | 4.44 | 3× faster |
Normal caption (stream) | 29.45 | 5.84 | 5× faster |
Full caption (non-stream) | 21.45 | 5.77 | 4× faster |
Visual query | 15.19 | 3.58 | 4× faster |
Object detection | 13.21 | 3.38 | 4× faster |
Pointing | 12.67 | 3.36 | 4× faster |
Total runtime | 105.34 | 26.36 | ~4× faster |
Public speech
Task | M2 Runtime (s) | M4 Runtime (s) | Speedup |
---|---|---|---|
Short Caption | 13.2 | 4.3 | ~3× |
Normal Caption (full) | 20.0 | 6.8 | ~3× |
Visual Query | 13.0 | 3.6 | ~3.5× |
Object Detection | 12.9 | 3.5 | ~3.5× |
Pointing | 14.9 | 3.8 | ~4× |
Total Runtime | 95.3 | 28.9 | ~3.3× |
Key Takeaways
- M4 dramatically outperforms M2 across all tasks, often by a factor of 3–5×.
- First token latency (important for streaming applications) is much lower on M4 (0.08s vs 0.16s).
- Average token latency improves significantly (0.043s vs 0.214s).
- This makes M4 much better suited for interactive multimodal applications like real-time captioning, chat, or object recognition.
Example Output (Short Caption)
M2 & M4 produced nearly identical captions, but M4 was ~3× faster.
1 | Short caption: |
How to Run
1 | pip install transformers pillow |
Replace device_map={"": "cuda"}
with "mps"
for Apple Silicon.
Update image_path
to your own test image.