Introduction

In the previous post, we explored DeepSeek R1 Zero and its approach to reinforcement learning (RL). DeepSeek R1 Zero demonstrated impressive reasoning capabilities through RL alone, eliminating the need for supervised fine-tuning (SFT).

However, it also presented several challenges:

Unstable cold start phase: Training directly with RL resulted in instability in the early stages, making it difficult for the model to converge efficiently.
Language mixing in reasoning (CoT) processes: The model sometimes switched between different languages or inconsistent formats, leading to readability issues.
Overly verbose reasoning: The model frequently generated excessively long explanations, producing responses that users might not prefer.

To address these issues, DeepSeek R1 introduced several improvements, incorporating a combination of supervised fine-tuning (SFT), reinforcement learning (RL), and enhanced reward mechanisms to achieve a more stable and effective model.

SFT in cold start phase

Unlike DeepSeek R1 Zero, which immediately started RL training, DeepSeek R1 mitigated early instability by first applying a small amount of high-quality supervised fine-tuning (SFT) or few-shot prompting.

This approach allowed the model to:

Stabilize initial training and prevent extreme fluctuations in performance.
Improve output format and accuracy with the help of human annotators.

By integrating this SFT phase, DeepSeek R1 established a stronger base model before transitioning into reinforcement learning.

Once the cold start phase was stabilized, DeepSeek R1 continued training through the standard RL process, but rather than a single-pass training approach, the model was optimized using multiple alternating cycles of SFT + RL to progressively enhance performance.

Iterative SFT + RL Training

Cold Start Phase: Apply a small amount of SFT to stabilize training and ensure structured responses.
Initial RL Training: Train with reinforcement learning to improve reasoning capabilities.
Reward Refinement: Identify issues in reasoning (e.g., language mixing) and introduce additional reward functions.
Checkpoint SFT Enhancement:
- Collect high-quality reasoning trajectories generated by the model.
- Curate the best examples as new SFT data for additional fine-tuning.
Repeat the process to iteratively improve performance.

This hybrid training strategy enabled DeepSeek R1 to outperform DeepSeek R1 Zero, leading to more coherent, accurate, and readable outputs.

Reward functions

DeepSeek R1 retained the accuracy and format rewards used in DeepSeek R1 Zero while introducing a Language consistency Reward to prevent language mixing in CoT (Chain of Thought) reasoning. Language consistency reward helps:

Addresses the language mixing problem observed in DeepSeek R1 Zero.
Ensures that the model maintains a consistent language throughout the reasoning process.
Helps improve readability and coherence of responses.

By refining the reward system, DeepSeek R1 successfully mitigated issues from its predecessor, producing more structured and user-friendly outputs.

Conclusion

DeepSeek R1 builds upon DeepSeek R1 Zero by introducing a structured cold start phase with SFT, refining RL training, and enhancing reward functions. These improvements address the core limitations of R1 Zero, resulting in:

More stable training through early SFT.
Better reasoning structure with alternating SFT + RL training.
Improved readability by reducing language inconsistencies.

In the next post, we'll take a deeper look at the DeepSeek R1 distillation and my personal experience with it.