🚀 Image Captioning and Visual Understanding with VLM
Ever wondered how fast cutting-edge vision-language models can run on Apple Silicon? This script puts Moondream2 to the test with a wide range of visual understanding tasks:
- 📝 Image Captioning — short, detailed, streamed, and non-streamed
- ❓ Visual Queries — ask natural language questions about what’s inside the image
- 🎯 Object Detection — find all instances of a given object
- 📍 Pointing — locate objects in the image by name
It also benchmarks runtime performance of these tasks to compare different Apple Silicon chips (M2 vs M4) using the Metal Performance Shaders (MPS) backend.