Vision language models (VLMs) combine computer vision and natural language processing to create powerful systems that can interpret, generate, and respond in multimodal contexts. Vision Language Models is a hands-on guide to building real-world VLMs using the most up-to-date stack of machine learning tools from Hugging Face, Meta (PyTorch), NVIDIA (Cuda), OpenAI (CLIP), and others, written by leading researchers and practitioners Merve Noyan, Miquel Farré, Andrés Marafioti, and Orr Zohar. From image captioning and document understanding to advanced zero-shot inference and retrieval-augmented generation (RAG), this book covers the full VLM application and development lifecycle. Designed for ML engineers, data scientists, and developers, this guide distills cutting-edge VLM research into practical techniques. Readers will learn how to prepare datasets, select the right architectures, fine-tune and deploy models, and apply them to real-world tasks across a range of industries. Explore core model architectures and alignment techniques - Train and fine-tune VLMs with Hugging Face, PyTorch, and others - Deploy models for applications like image search and captioning - Implement advanced inference strategies, from zero-shot to agentic systems - Build scalable VLM systems ready for production use Merve Noyan is a machine learning engineer working in the ML advocacy engineering team at Hugging Face. She builds tools to enable people to build with vision language models across the Hugging Face ecosystem (transformers, TRL, smolagents). Previously she has worked in different companies building natural language understanding based solutions on information retrieval and conversational agents. Andrés Marafioti holds a PhD in applied machine learning, with a focus on multimodal generative methods. Previously a senior ML engineer at Unity, he played a key role in bringing multimodal ML-based products from concept to market adoption. Now at Hugging Face, Andrés leads cutting-edge research in multimodal and memory-efficient models, leading the development of SmolVLM, a state-of-the-art vision-language model. He has co-authored several impactful papers in the VLM space, such as "Building and Better Understanding Vision-Language Models." Miquel Farré is a video technology expert with over 15 years of experience and more than 60 patents in machine learning and information science. His career began at the Fraunhofer Institute, where he designed advanced video codecs, and Nagravision, where he developed video streaming security modules. Transitioning to video understanding, Miquel joined Disney to architect the enterprise content metadata platform, leading machine learning initiatives across Pixar, Marvel, Lucasfilm, ABC, and ESPN. He then moved to YouTube, driving search monetization before expanding his focus to lead monetization for the platform's Home and Watch Next surfaces. Before joining Hugging Face to work on video multimodal large language models, Miquel founded Arbro AI to build automated farming solutions. Orr Zohar is a PhD candidate in SVL at Stanford University, advised by Prof. Serena Yeung-Levy and supported by the Knight-Hennessy Scholarship. His research centers on large multimodal models, particularly in video understanding, with a focus on self-training methodologies and agentic design. Orr has co-developed innovative approaches such as Video-STaR, a self-training method for video instruction tuning, and VideoAgent, an agent-based framework for long-form video comprehension. Notably, he led the Apollo project, a comprehensive study exploring video understanding in large multimodal models, resulting in the creation of the Apollo family of models that set new benchmarks in the field.