Last week, Chinese artificial intelligence lab DeepSeek released an upgraded version of its R1 reasoning model, which demonstrated strong performance across several math and coding benchmarks. While the company declined to reveal details about the training data behind the model, speculation has emerged that DeepSeek might have incorporated material derived from Google’s Gemini AI.
Sam Paech, a Melbourne-based developer who specializes in evaluating emotional intelligence aspects in AI systems, publicly shared what he identified as evidence suggesting DeepSeek’s newest model, designated R1-0528, was trained with outputs distinctive to Google’s Gemini 2.5 Pro. Paech noted in an online post that the linguistic patterns and expressions adopted by the DeepSeek model align closely with known characteristics of Google’s Gemini outputs.
Another developer, known only by the username associated with the AI evaluation project SpeechMap, pointed out that DeepSeek’s internal reasoning traces—text sequences generated as the AI moves step-by-step toward conclusions—seem notably similar to Gemini-generated traces as well.
This is not the first time DeepSeek has faced accusations of utilizing data from competitor models. Last December, observations revealed that DeepSeek’s earlier V3 model frequently identified itself as ChatGPT, OpenAI’s popular chatbot, fueling suspicion that DeepSeek models were trained on ChatGPT conversation logs.
Moreover, earlier this year, OpenAI disclosed evidence to the Financial Times that suggested DeepSeek leveraged a training technique called distillation, using data derived indirectly from OpenAI’s outputs. Around the same period, Microsoft reportedly detected substantial amounts of data extraction from OpenAI developer accounts. These accounts, allegedly tied to DeepSeek, were actively pulling data toward the end of 2024, raising further concerns.
Distillation itself is an established method within the AI community; however, OpenAI’s policies specifically forbid customers from using its model outputs for developing competing artificial intelligence products.
It is important to note that mistaken self-identification and similarities in linguistic style are common among contemporary AI models, primarily due to the general proliferation of content created by generative AIs throughout publicly accessible internet datasets. This abundance of AI-produced text online complicates efforts to definitively prove the origin of specific training data.
Still, AI experts acknowledge it is plausible—given DeepSeek’s known resource constraints—that the company may indeed pursue premium synthetic training data sourced from industry-leading AI systems like Gemini. Nathan Lambert, a researcher affiliated with AI2, an AI-focused nonprofit research institute, indicated publicly that leveraging synthetic data from superior models like Gemini would represent a logical and efficient strategy for a well-funded yet GPU-limited company such as DeepSeek.
In response to these industry-wide concerns over data security and unauthorized use of proprietary outputs, firms like OpenAI and Google have increasingly enhanced their security procedures. Recently, OpenAI tightened access by introducing mandatory identity verification procedures with government-issued IDs required for certain advanced model access—a measure explicitly excluding certain regions, including China. Google, for its part, has begun consolidating and summarizing the internal traces of its Gemini output submitted to external developers, making it harder for potential rivals to directly replicate Gemini’s reasoning style. Similarly, Anthropic recently announced steps aimed at protecting the competitive uniqueness of its models by summarizing trace data.
Google has been contacted for comment regarding the speculation about DeepSeek’s possible use of Gemini outputs to train its model.