qwen3、gemma3 GPRO强化训练案例
参考:
https://docs.unsloth.ai/basics/reinforcement-learning-guide/tutorial-train-your-own-reasoning-model-with-grpo
在线colab:
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/HuggingFace%20Course-Gemma3_(1B)-GRPO.ipynb
https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb#scrollTo=vzOuSVCL_GA9
比较费时间,这很数据集open-r1/DAPO-Math-17k-Processed共14116条,gpro训练3个小时