Introduction

DeepSeek made waves in the AI community with the release of DeepSeek R1 on January 20, 2025. Unlike many proprietary models from big tech companies such as OpenAI, DeepSeek R1 is an open-source reasoning model available under the MIT license. It was developed by a Chinese quant finance startup as part of the DeepSeek LLM series. DeepSeek R1 has demonstrated impressive benchmark results, outperforming OpenAI’s O1-mini and achieving comparable performance to OpenAI’s O1 model:

What truly set DeepSeek R1 apart was its unprecedented efficiency. While major U.S. tech companies invest vast resources into training their LLMs, DeepSeek R1 delivers competitive or superior performance at a much lower cost. For example, DeepSeek V3 required only 2.788 million H800 GPU hours for full training—an efficiency that challenges the conventional notion that only billion-dollar budgets can produce top-tier LLMs.

This prompted the question: What makes DeepSeek R1 so powerful? To answer that, I decided to dive deep into its architecture and explore the factors contributing to its remarkable performance. This post is the first in a DeepDive into DeepSeek R1 series, where I will break down the model’s structure in detail.

In this section, I will focus specifically on the DeepSeek R1 Zero model.

Reference: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek R1 Zero

DeepSeek R1 Zero is a reasoning LLM model based on DeepSeek V3 that takes a unique approach to training. Unlike traditional models that rely on supervised fine-tuning (SFT) in the early stages, DeepSeek R1 Zero completely eliminates the use of supervised data. Instead, it is trained purely through reinforcement learning (RL), demonstrating that large-scale RL alone can significantly enhance reasoning capabilities. This approach challenges conventional training methodologies and highlights the potential of reinforcement learning in advancing LLM performance.

GRPO

For its RL algorithm, DeepSeek R1 Zero employs GRPO(Group Relative Policy Optimization), which shares a similar structure with PPO(Proximal Policy Optimization). However, GRPO samples groups of outputs and optimizes the policy model directly without relying on a critic model. The mathematical formulation of GRPO is as follows:

J_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(O|q)} \left[ \frac{1}{G} \sum_{i=1}^{G} \min\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)} A_i, \; \text{clip} \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)}, 1-\varepsilon, 1+\varepsilon \right) A_i \right) - \beta D_{\mathrm{KL}}(\pi_{\theta} || \pi_{\text{ref}}) \right]

From the equation above, we can derive the following insights:

1. Expectation terms

\mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(O|q)}

$q \sim P(Q)$ : The query (or input) $q$ is sampled from the distribution of input queries $P(Q)$ .
$\{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(O|q)$ : The group of outputs $\{o_i\}_{i=1}^{G}$ is sampled from the distribution of output sequences generated by the old policy model $\pi_{\theta_{\text{old}}}$ .
Instead of optimizing based on a single output, GRPO optimizes based on the relative quality of multiple outputs within a group.

2. Average accross the group

\frac{1}{G} \sum_{i=1}^{G} (\text{policy update term ..})

The loss function aggregates multiple outputs within a group and updates the policy based on their relative quality.
Unlike PPO, which directly evaluates individual trajectories, GRPO compares outputs within a group, ensuring a relative ranking approach.

3. Policy probability ratio

\frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)}

This term represents the probability ratio between the new policy $\pi_{\theta}$ and the old policy $\pi_{\theta_{\text{old}}}$ for each output $o_i$ in the group.
If the ratio is greater than 1, new policy assigns a higher probability to the output $o_i$ than the old policy.
If the ratio is less than 1, new policy assigns a lower probability to the output $o_i$ than the old policy.

4. Clipping mechanism

\min\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)} A_i, \; \text{clip} \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)}, 1-\varepsilon, 1+\varepsilon \right) A_i \right)

GRPO uses the clipping mechanism to prevent the policy update from diverging too much from the old policy.
If the probability ratio deviates too much from [1-ε, 1+ε], the ratio is clipped to [1-ε, 1+ε] to avoid excessive updates.

5. KL divergence penalty

- \beta D_{\mathrm{KL}}(\pi_{\theta} || \pi_{\text{ref}})

This term represents the KL divergence between the new policy $\pi_{\theta}$ and a reference policy $\pi_{\text{ref}}$ .
The reference policy is used to ensure that the new policy does not deviate too much from the old policy.
The hyperparameter $\beta$ controls the strength of the KL divergence penalty.

6. Group based advantage calculation

Instead of using a traditional value function-based advantage, GRPO computes relative advantage within a group as follows:

A_i = \frac{r_i - \text{mean}(r_1, r_2, ...r_G)}{\text{std}(r_1, r_2, ...r_G)}

$r_i$ : The reward associated with output $o_i$ .
$\text{mean}(r_1, r_2, ...r_G)$ : The mean reward of the group.
$\text{std}(r_1, r_2, ...r_G)$ : The standard deviation of the group.

The detailed algorithm of GRPO is described in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Reward function

The choice of the reward function is crucial for the performance of reinforcement learning. In DeepSeek R1 Zero, a simple rule-based reward system is applied, consisting of two key reward mechanisms:

1. Accuracy rewards

Accuracy rewards assess the correctness of the model’s final response, particularly for deterministic tasks such as mathematical and coding problems.
In the case of coding problems, the model’s output is directly executed in a compiler to generate immediate feedback.

2. Format rewards

Format rewards ensure that the model adheres to a structured response format by placing its reasoning process within <think></think> tags and the final answer within <answer></answer> tags.
The reasoning process must be encapsulated within the <think> tags, while the final answer must only appear within the <answer> tags. Training template is as follows:
Where prompt will be replaced with the specific reasoning question during training.
The purpose of this approach is to clearly separate the reasoning process from the final result, making it easier to interpret and verify model outputs.

Unlike more complex outcome-based or process-based reward models, DeepSeek R1 Zero relies solely on simple rule-based rewards, demonstrating that even a minimal reward structure can be effective in reinforcement learning for reasoning tasks.

Self-evaluation

During training, using the RL algorithm and reward functions described above, it was observed that the model’s average response time increased as training progressed. average response per training

Over time, the model generated longer responses, attempting deeper reasoning.
This phenomenon was not explicitly programmed but emerged as a result of the reinforcement learning environment, where the model learned on its own that increasing reasoning time leads to better performance (self-evolution).

Aha moment of DeepSeek R1 Zero

One of the most remarkable aspects of DeepSeek R1 Zero is that it demonstrates how a model’s reasoning ability can autonomously evolve through reinforcement learning alone—without the need for explicit instruction or supervised fine-tuning (SFT). Aha moment of DeepSeek R1 Zero

A compelling example of this can be seen in the image above, where the model is solving a square root equation. At one point, the response includes:

Wait, wait, wait. That’s an aha moment I can flag here.

This moment highlights how the model pauses its problem-solving process, reevaluates its approach, and revisits its previous reasoning before attempting a different solution path.

What makes this particularly fascinating is that the model was not explicitly programmed to behave this way. Instead, it learned on its own that improving its reasoning process leads to higher rewards. With just a simple reinforcement rule—“correct answers receive a reward”—the model naturally explores new problem-solving strategies and develops deeper, more structured reasoning.

This finding suggests that reinforcement learning alone can be a powerful driver of self-improving reasoning abilities, opening new possibilities for training advanced AI models without relying on extensive human-labeled datasets.

Drawback of DeepSeek R1 Zero

1. Poor Readability

Due to its extensive reasoning process, the model’s responses can sometimes be lengthy and convoluted, making them difficult to read. Additionally, language mixing issues may arise, where different languages and mathematical notations appear inconsistently within responses.

2. Unpredictable Behavior in RL Training

While the model autonomously develops its own reasoning strategies, it sometimes produces responses in a style that users may not prefer, such as being overly detailed or including unnecessary intermediate steps.

Conclusion

DeepSeek R1 Zero presents a groundbreaking approach to reinforcement learning-driven reasoning, proving that large-scale RL alone can significantly enhance an LLM’s problem-solving abilities. By eliminating supervised fine-tuning (SFT) and relying solely on simple rule-based rewards, the model autonomously learns to refine its reasoning process, leading to deeper, more structured thought patterns.

However, despite its impressive capabilities, DeepSeek R1 Zero has notable drawbacks, such as poor readability due to excessively detailed responses and unpredictable behavior in its reasoning strategies. These challenges highlight the need for further refinements to balance reasoning depth with clarity and user-friendliness.

In the next post, I will explore how DeepSeek R1 addresses these limitations, examining the improvements made to enhance both efficiency and usability while maintaining its strong reasoning performance.