Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference

Liquid AI has just unveiled LFM2.5-VL-450M, an upgraded version of its previous LFM2-VL-450M vision-language model. The latest release brings in bounding box prediction, enhanced instruction following, expanded multilingual comprehension, and support for function calling — all packed within a 450M-parameter framework designed to operate directly on edge hardware, ranging from embedded AI modules like NVIDIA Jetson Orin to mini-PC APUs such as AMD Ryzen AI Max+ 395 and even leading phone SoCs like the Snapdragon 8 Elite found in the Samsung S25 Ultra.

What is a Vision-Language Model and Why Model Size Matters

Before delving deeper, it’s essential to grasp the concept of a vision-language model (VLM). A VLM is a model capable of processing both images and text simultaneously. You can feed it a photo and ask questions about it in natural language, and it will provide responses. Most large VLMs require substantial GPU memory and cloud infrastructure to function. This poses a challenge for real-world deployment scenarios like warehouse robots, smart glasses, or retail shelf cameras, where computational resources are limited, and low latency is crucial.

LFM2.5-VL-450M addresses this constraint by offering a model compact enough to fit on edge hardware while still supporting a significant range of vision and language capabilities.

Architecture and Training

LFM2.5-VL-450M utilizes LFM2.5-350M as its language model backbone and SigLIP2 NaFlex shape-optimized 86M as its vision encoder. It operates with a context window of 32,768 tokens and a vocabulary size of 65,536.

Regarding image processing, the model can handle native resolution processing up to 512×512 pixels without upscaling, maintains non-standard aspect ratios without distortion, and employs a tiling strategy that divides large images into non-overlapping 512×512 patches while incorporating thumbnail encoding for global context. The inclusion of thumbnail encoding is crucial as it provides the model with a sense of the overall scene, rather than just local patches. During inference, users have the flexibility to adjust the maximum image tokens and tile count to strike a balance between speed and quality without the need for retraining, which proves beneficial when deploying across hardware with varying compute budgets.

Liquid AI recommends specific generation parameters for text and vision inputs to optimize performance.

On the training front, Liquid AI expanded pre-training from 10T to 28T tokens compared to LFM2-VL-450M. Post-training involved preference optimization and reinforcement learning to enhance grounding, instruction following, and overall reliability across vision-language tasks.

New Capabilities Over LFM2-VL-450M

The most notable addition is bounding box prediction, with LFM2.5-VL-450M achieving a score of 81.28 on RefCOCO-M, a significant improvement from zero in the previous model. RefCOCO-M is a visual grounding benchmark that evaluates the model’s ability to locate objects in an image based on a natural language description. This feature enables the model to output structured JSON with normalized coordinates, pinpointing the location of objects in a scene, thereby offering more than just descriptive information but also spatial details. This distinguishes the model from mere image captioning and makes it directly applicable in pipelines requiring spatial outputs.

Multilingual support has also seen a substantial enhancement, with MMMB scores rising from 54.29 to 68.09 across languages such as Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish. This improvement is particularly relevant for global deployments where local-language prompts need to be understood alongside visual inputs without the need for separate localization pipelines.

Instruction following has also seen improvements, with MM-IFEval scores increasing from 32.93 to 45.00. This signifies that the model now more consistently adheres to explicit constraints provided in a prompt, such as responding in a specific format or restricting output to certain fields.

Additionally, function calling support for text-only input has been introduced, measured by BFCLv4 at 21.08. This capability allows the model to be utilized in pipelines that require invoking external tools, such as calling a weather API or triggering actions in downstream systems.

Benchmark Performance

Across vision benchmarks assessed using VLMEvalKit, LFM2.5-VL-450M surpasses both LFM2-VL-450M and SmolVLM2-500M in most tasks. Noteworthy scores include 86.93 on POPE, 684 on OCRBench, 60.91 on MMBench (dev en), and 58.43 on RealWorldQA.

Two benchmark improvements stand out, going beyond the headline figures. MMVet, which tests more open-ended visual understanding, saw an increase from 33.85 to 41.10, representing a substantial relative gain. CountBench, evaluating the model’s object counting ability, improved from 47.64 to 73.31, marking one of the most significant relative enhancements in the evaluation. InfoVQA scores remained relatively stable at 43.02 compared to 44.56 in the previous model.

On language-only benchmarks, IFEval saw an improvement from 51.75 to 61.16, and Multi-IF from 26.21 to 34.63. It should be noted that the model does not excel in all tasks, as MMMU (val) experienced a slight drop from 34.44 to 32.67. Liquid AI emphasizes that the model may not be ideal for knowledge-intensive tasks or fine-grained OCR.

Edge Inference Performance

LFM2.5-VL-450M, with Q4_0 quantization, is capable of running efficiently on a wide range of target hardware, from embedded AI modules like Jetson Orin to mini-PC APUs like Ryzen AI Max+ 395 and flagship phone SoCs like Snapdragon 8 Elite.

Latency numbers provide a clear picture of the model’s performance. On Jetson Orin, the model processes a 256×256 image in 233ms and a 512×512 image in 242ms, comfortably staying below the 250ms threshold at both resolutions. This speed enables it to handle every frame in a 4 FPS video stream with complete vision-language comprehension, not just detection. On Samsung S25 Ultra, latency stands at 950ms for 256×256 and 2.4 seconds for 512×512. On AMD Ryzen AI Max+ 395, the model achieves latencies of 637ms for 256×256 and 944ms for 512×512 — under one second for the smaller resolution on both consumer devices, ensuring responsive performance for interactive applications.

Real-World Use Cases

LFM2.5-VL-450M is particularly well-suited for real-world deployments where low latency, compact structured outputs, and efficient semantic reasoning are paramount, especially in scenarios where offline operation or on-device processing is crucial for maintaining privacy.

In industrial automation settings with limited compute resources, such as passenger vehicles, agricultural machinery, and warehouses, perception models are often constrained to bounding box outputs. LFM2.5-VL-450M goes a step further by providing grounded scene understanding in a single pass, enabling richer outputs for environments like warehouse aisles, encompassing worker actions, forklift movements, and inventory flow, all while remaining compatible with existing edge hardware like Jetson Orin.

For wearables and continuous monitoring devices like smart glasses, body-worn assistants, dashcams, and security or industrial monitors, the model offers a solution without the need for extensive perception stacks or continuous cloud streaming. A proficient VLM can generate compact semantic outputs locally, transforming raw video into structured insights while keeping computational demands low and preserving user privacy.

In retail and e-commerce applications, tasks such as catalog ingestion, visual search, product matching, and shelf compliance demand more than just object detection. However, deploying richer visual understanding at scale can be costly. LFM2.5-VL-450M makes structured visual reasoning practical for these workloads.

Key Takeaways

LFM2.5-VL-450M introduces bounding box prediction for the first time, achieving a score of 81.28 on RefCOCO-M, a significant improvement from zero in the previous model. This enhancement enables the model to provide structured spatial coordinates for detected objects, offering more than just descriptive information but also precise spatial details.

Pre-training was scaled from 10T to 28T tokens, coupled with post-training via preference optimization and reinforcement learning, resulting in consistent benchmark improvements across vision and language tasks compared to LFM2-VL-450M.

The model operates on edge hardware with sub-250ms latency, processing a 512×512 image in 242ms on NVIDIA Jetson Orin with Q4_0 quantization — fast enough to achieve full vision-language comprehension on every frame of a 4 FPS video stream without relying on cloud processing.

Multilingual visual understanding has significantly improved, with MMMB scores rising from 54.29 to 68.09 across Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish, making the model suitable for global deployments without the need for separate localization models.

For more technical details and model weight information, visit the Liquid AI blog. Stay updated by following us on Twitter, joining our ML SubReddit with over 120k members, and subscribing to our newsletter. Additionally, you can now connect with us on Telegram.

If you’re looking to collaborate with us to promote your GitHub Repo, Hugging Face Page, Product Release, Webinar, or other initiatives, feel free to reach out to us.