This is the finest post on DeepSeek and RL in general. I would like to know if in this 5th step the “preference reward” was applied using RLHF or any other way? Thankyou for the post :)
Thank you for the kind words! I believe they also used the (now very popular) GRPO here for the reinforcement learning. It's an elegant technique that seems to move this field forward in a new direction.
Thank you for the kind words! I'm using Figma to create the visualizations but all visuals (same with my other visual guides) could have been done with something like Powerpoint, KeyNote, etc.
The article is awesome, thanks for the great content and the simplicity of the explanation and the flow of the content. I haven't finished it yet but can't wait to finish it and read it again and again
Very thorough and easy to read article. I have one follow-up question: What give us the confidence, beyond empirical evidence, that reasoning through Modifying Proposal Distribution actually improve the quality of the output? How could we explain this behaviour at a deeper level? So far all reports I read on the matter either present empirical evidence, or say that this is the way we do it as humans. What is not clear to me is, why and how does it work for a software program as well? The software might choose a series of poor quality reasoning steps after all. Granted with RL + fine tuning we can modify its weights to favour higher quality answers, but how does this generalizes? What happens inside the NN that makes it able to choose more correct reasoning steps for samples that weren't in the training data used during RL + fine tuning?
I loved the article. Is my below understanding correct?
During test time, no learning will happen at the model level, and we are adding additional systems, and they may give feedback to the model, and the model will regenrate the response. This process goes on for some steps, and the LLM will arrive at the final answer. Its a kind of reflexion pattern used in agents
It depends on the specific implementation but generally, it is still just next-token prediction. However, it tends to focus more on tokens that showcase "reasoning" rather than attempting to go straight for the answer. These "reasoning" tokens are passed back to the LLM as is done with any transformer LLM and used to update it's attention.
So it is not necessarily adding systems but nudging the model to choose tokens that showcase some sort of reasoning.
Great post! A quick question, "In step 2, the resulting model was trained using a similar RL process as was used to train DeepSeek-V3-Zero." , I looked up DeepSeek-V3-Zero and could't find any other mentions, could it be a typo for DeepSeek-R1-Zero?
Thanks for providing such a detailed content with brilliant explanations, easy to read. I did have some pre-information but now i'm clear and confident on most of the topics I was looking around Web.
Thanks for the helpful article and nice visuals but there is a point of confusion. What i understand from the report is that they directly finetune on 800k data for distillation rather than using logits from the bigger model. I think what they refer to as distillation is different from vanilla knowledge distillation.
This is not a post, it is a lecture. Simply amazing, thank you.
This is the finest post on DeepSeek and RL in general. I would like to know if in this 5th step the “preference reward” was applied using RLHF or any other way? Thankyou for the post :)
Thank you for the kind words! I believe they also used the (now very popular) GRPO here for the reinforcement learning. It's an elegant technique that seems to move this field forward in a new direction.
Very informative. Thanks for sharing this great lesson.
loved it. thanks for writing it.
DL had a turing award. sutton and barto for RL have one too.. ones coming for u and jay.. keep inspiring
This is very helpful. Thank you for writing it!
Thank you, for the awesome content. I have a question, what tools do you use for visualization?
Thank you for the kind words! I'm using Figma to create the visualizations but all visuals (same with my other visual guides) could have been done with something like Powerpoint, KeyNote, etc.
The article is awesome, thanks for the great content and the simplicity of the explanation and the flow of the content. I haven't finished it yet but can't wait to finish it and read it again and again
Thank you! Great content! One typo: the exploration and exploitation annotation in the illustration of the Monte Carlo tree search seems flipped.
Thanks for the feedback! I updated the visual.
Very thorough and easy to read article. I have one follow-up question: What give us the confidence, beyond empirical evidence, that reasoning through Modifying Proposal Distribution actually improve the quality of the output? How could we explain this behaviour at a deeper level? So far all reports I read on the matter either present empirical evidence, or say that this is the way we do it as humans. What is not clear to me is, why and how does it work for a software program as well? The software might choose a series of poor quality reasoning steps after all. Granted with RL + fine tuning we can modify its weights to favour higher quality answers, but how does this generalizes? What happens inside the NN that makes it able to choose more correct reasoning steps for samples that weren't in the training data used during RL + fine tuning?
I loved the article. Is my below understanding correct?
During test time, no learning will happen at the model level, and we are adding additional systems, and they may give feedback to the model, and the model will regenrate the response. This process goes on for some steps, and the LLM will arrive at the final answer. Its a kind of reflexion pattern used in agents
It depends on the specific implementation but generally, it is still just next-token prediction. However, it tends to focus more on tokens that showcase "reasoning" rather than attempting to go straight for the answer. These "reasoning" tokens are passed back to the LLM as is done with any transformer LLM and used to update it's attention.
So it is not necessarily adding systems but nudging the model to choose tokens that showcase some sort of reasoning.
Thanks a lot for this. Loved the crisp explanation!
Great post! A quick question, "In step 2, the resulting model was trained using a similar RL process as was used to train DeepSeek-V3-Zero." , I looked up DeepSeek-V3-Zero and could't find any other mentions, could it be a typo for DeepSeek-R1-Zero?
Thanks! It was indeed a typo. Fixed :)
Thanks for providing such a detailed content with brilliant explanations, easy to read. I did have some pre-information but now i'm clear and confident on most of the topics I was looking around Web.
Thanks for the helpful article and nice visuals but there is a point of confusion. What i understand from the report is that they directly finetune on 800k data for distillation rather than using logits from the bigger model. I think what they refer to as distillation is different from vanilla knowledge distillation.
A very great informative content .Thanku so much
Great article, did you use deepseek for any part?