IBM Introduces Granite 4.0 3B Vision: A Breakthrough in Enterprise Document Data Extraction
IBM recently unveiled Granite 4.0 3B Vision, a cutting-edge vision-language model (VLM) tailored for extracting data from enterprise-grade documents. Unlike larger multimodal models, the 4.0 Vision release takes a specialized approach with a focus on high-fidelity visual reasoning within the Granite 4.0 Micro language backbone.
This launch marks a shift towards modular, extraction-centric AI that emphasizes the accuracy of structured data conversion, such as transforming complex charts into code or tables into HTML, over general image captioning.
Architecture: Modular LoRA and DeepStack Integration
The Granite 4.0 3B Vision model is delivered as a LoRA (Low-Rank Adaptation) adapter with approximately 0.5B parameters. This adapter is designed to be stacked on top of the Granite 4.0 Micro base model, a 3.5B parameter dense language model. This architecture enables a ‘dual-mode’ deployment, where the base model can handle text-only requests independently, activating the vision adapter only when multimodal processing is necessary.
Vision Encoder and Patch Tiling
The visual component of the model utilizes the google/siglip2-so400m-patch16-384 encoder. To maintain high resolution across various document layouts, the model employs a tiling mechanism. Input images are segmented into 384×384 patches, processed alongside a downscaled global view of the entire image. This methodology ensures the preservation of fine details, such as subscripts in formulas or small data points in charts, before they reach the language backbone.
The DeepStack Backbone
To bridge the vision and language modalities, IBM utilizes a version of the DeepStack architecture. This involves deeply stacking visual tokens into the language model across 8 specific injection points. By routing visual features into multiple transformer layers, the model achieves a closer alignment between the semantic content (‘what’) and spatial layout (‘where’), crucial for maintaining structure during document parsing.
Training Curriculum: Focused on Chart and Table Extraction
The training of Granite 4.0 3B Vision signifies a strategic pivot towards specialized extraction tasks. Instead of relying solely on general image-text datasets, IBM leveraged a curated mix of instruction-following data concentrated on complex document structures.
- ChartNet Dataset: The model underwent refinement using ChartNet, a multimodal dataset geared towards robust chart understanding.
- Code-Guided Pipeline: A key training highlight involves a ‘code-guided’ approach for chart reasoning. This pipeline utilizes aligned data comprising the original plotting code, resulting rendered image, and underlying data table, enabling the model to grasp the structural relationship between visual representations and their source data.
- Extraction Tuning: The model was fine-tuned on a blend of datasets focusing on Key-Value Pair (KVP) extraction, table structure recognition, and converting visual charts into machine-readable formats like CSV, JSON, and OTSL.
Performance and Evaluation Benchmarks
In technical evaluations, Granite 4.0 3B Vision underwent benchmarking against several industry-standard suites for document understanding. Notably, datasets like PubTables-v2 and OmniDocBench served as evaluation benchmarks to validate the model’s zero-shot performance in real-world scenarios.
| Task | Evaluation Benchmark | Metric |
|---|---|---|
| KVP Extraction | VAREX | 85.5% Exact Match (Zero-Shot) |
| Chart Reasoning | ChartNet (Human-Verified Test Set) | High Accuracy in Chart2Summary |
| Table Extraction | TableVQA-Bench & OmniDocBench | Evaluated via TEDS and HTML extraction |
The model currently holds the 3rd position among models in the 2–4B parameter class on the VAREX leaderboard (as of March 2026), showcasing its efficiency in structured extraction despite its compact size.
Key Takeaways
- Modular LoRA Architecture: The model, a 0.5B parameter LoRA adapter, operates on the Granite 4.0 Micro (3.5B) backbone. This design allows for efficient handling of text-only workloads in a single deployment, activating vision capabilities as needed.
- High-Resolution Tiling: Using the google/siglip2-so400m-patch16-384 encoder, the model processes images by tiling them into 384×384 patches alongside a downscaled global view, ensuring preservation of fine details in complex documents.
- DeepStack Injection: To enhance layout awareness, the model employs a DeepStack approach with 8 injection points. This method routes semantic features to earlier layers and spatial details to later layers, essential for accurate table and chart extraction.
- Specialized Extraction Training: Beyond general instruction following, the model underwent refinement using ChartNet and a ‘code-guided’ pipeline aligning plotting code, images, and data tables to help internalize the logic of visual data structures.
- Developer-Ready Integration: The release, licensed under Apache 2.0, offers native support for vLLM (via a custom model implementation) and Docling, IBM’s tool for converting unstructured PDFs into machine-readable JSON or HTML.
For more technical details and model weight, follow us on Twitter and stay updated with our Newsletter. Join our 120k+ ML SubReddit and Telegram group for insightful discussions.




Be the first to comment