As the demand for artificial intelligence continues to rise, companies are increasingly struggling with limited computing resources. Running advanced AI systems requires vast numbers of powerful GPUs, and for Chinese firms such as DeepSeek, access to cutting-edge chips from companies like Nvidia remains restricted due to US export controls. In response, DeepSeek claims to have developed a method that significantly accelerates AI responses without relying on the latest hardware.
The company has introduced DSpark, a speculative decoding framework designed for its V4 family of AI models. According to DeepSeek, the technology can improve inference speed by up to 85 percent. In practical terms, a GPU that could previously handle 100 requests may now be capable of processing around 185.
DSpark focuses on improving AI inference, which refers to the time an AI model takes to generate a response. Since AI systems typically produce text one token at a time, generating long responses can be slow and computationally expensive. Tokens are the basic units AI models use to process and generate content.
To address this challenge, DSpark uses speculative decoding. A smaller and more lightweight model first predicts likely responses, while the larger primary model validates those predictions in batches instead of creating every token from scratch. When the smaller model's predictions are correct, the system can skip ahead, saving time and computing power. If the predictions are inaccurate, the main model takes over and corrects them.
DeepSeek says most tokens are relatively easy to predict, allowing the framework to move through responses more efficiently. The entire process remains on the GPU, avoiding the performance penalties that can arise when tasks are shifted to the CPU.
The framework also incorporates a semi-autoregressive generation approach. Rather than producing one token at a time, it generates small groups of tokens simultaneously, further improving speed.
DeepSeek has released the DSpark research publicly through GitHub and Hugging Face in collaboration with Peking University. While the company notes that DSpark does not enhance a model’s overall intelligence or capabilities, it can improve efficiency and reduce the need for additional computing infrastructure.
The technology has been tested on several open-source models, including Google DeepMind’s Gemma and Alibaba’s Qwen, suggesting that the performance improvements could benefit a wider range of AI systems.
This development comes at a time when AI companies are investing heavily in data-center infrastructure to secure more computing power. Rising costs have even led organizations such as Uber and Walmart to limit employee AI usage. By improving efficiency, frameworks like DSpark could help reduce those pressures.
Earlier this year, DeepSeek also released its V4 Preview models, positioning them as cost-effective solutions for handling contexts of up to one million tokens. The V4-Pro variant is designed for higher performance, while V4-Flash focuses on faster and more economical deployment.
DeepSeek is not alone in pursuing faster AI systems. Earlier this month, Xiaomi’s AI research team announced that its MiMo-V2.5-Pro-UltraSpeed model achieved generation speeds exceeding 1,000 tokens per second, placing it among the fastest AI models currently available.
