IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

Optimizing Large Language Models with IndexCache

Large language models face challenges when processing extensive contexts due to the exponential increase in computational costs. Researchers at Tsinghua University and Z.ai have developed IndexCache, a groundbreaking technique that significantly reduces redundant computations in sparse attention models. This innovation leads to up to a 75% reduction in computation, resulting in 1.82x faster time-to-first-token and 1.48x faster generation throughput at extended context lengths.

The Challenge of Sparse Attention Models

Large language models rely on self-attention mechanisms to predict the next token by analyzing the relationship between each token in the context and all preceding tokens. However, the computational complexity of self-attention scales quadratically with sequence length, leading to slow inference speeds and high computational costs for long-context applications.

Sparse attention offers a solution by optimizing the process to focus only on the most relevant subset of tokens, rather than all preceding tokens. DeepSeek Sparse Attention (DSA) is an efficient implementation of this concept, dramatically speeding up models by reducing the heavy core attention computation from quadratic to linear.

Introducing IndexCache for Improved Efficiency

To address the bottleneck in DSA models caused by the quadratic complexity of indexers, the research team developed IndexCache. This technique leverages the stability of selected tokens across consecutive transformer layers to partition the model’s layers into full (F) and shared (S) categories. Full layers actively score and cache important tokens, while shared layers reuse the cached indices from the nearest preceding full layer.

During inference, the model checks the layer type to calculate and cache fresh indices for full layers, while skipping the math and copying cached data for shared layers. This approach optimizes compute efficiency rather than memory footprint, leading to significant speedups in model performance.

Real-World Performance Enhancements

Applying IndexCache to the 30-billion-parameter GLM-4.7 Flash model resulted in a 1.82x speedup in prefill latency and a 1.48x speedup in generation throughput at a 200K context length. These efficiency gains translate into cost savings for enterprises, particularly in long-context applications such as document analysis and agentic pipelines.

Remarkably, these speedups did not compromise the models’ reasoning capabilities. The researchers conducted tests on the 30B model and the production-scale 744B GLM-5 model, demonstrating significant performance improvements without sacrificing quality.

Implementing IndexCache in Production

For development teams looking to integrate IndexCache, the process involves using domain-specific data to calibrate the optimal layer configuration. Open-source patches are available on GitHub for major serving engines, facilitating seamless integration with existing inference stacks.

IndexCache represents a shift in how the AI industry approaches model design, emphasizing scalability, throughput, and latency optimization from the outset. Future models are likely to prioritize real-world performance considerations, reflecting a proactive approach to addressing computational bottlenecks.

Conclusion

IndexCache is a groundbreaking technique that revolutionizes the efficiency of large language models by reducing redundant computations and improving inference speeds. By implementing IndexCache, enterprises can achieve significant cost savings and performance enhancements in long-context applications. This innovative approach reflects a broader trend in AI model design, prioritizing real-world efficiency and scalability from the initial stages of development.