当前位置: 首页 > news >正文

2025.9.25大模型学习

跟着学 CS336,你不是在学“怎么调大模型”,而是在学“大模型本身是怎么构建出来的、它到底为什么work、它还有哪些可改进的地方”。

https://stanford-cs336.github.io/spring2025/

①sequence to sequence modeling

②Adam optimizer

③attention mechanism

④transformer architecture

变体:

Activation functions: ReLU, SwiGLU

Positional encodings: sinusoidal, RoPE

Normalization: LayerNorm, RMSNorm

Placement of normalization: pre-norm versus post-norm

MLP: dense, mixture of experts

Attention: full, sliding window, linear

Lower-dimensional attention: group-query attention (GQA), multi-head latent attention (MLA)

State-space models: Hyena

训练:

Optimizer (e.g., AdamW, Muon, SOAP)

Learning rate schedule (e.g., cosine, WSD)

Batch size (e..g, critical batch size)

Regularization (e.g., dropout, weight decay)

Hyperparameters (number of heads, hidden dimension): grid search

任务一:

Implement BPE tokenizer

Implement Transformer, cross-entropy loss, AdamW optimizer, training loop

Train on TinyStories and OpenWebText

Leaderboard: minimize OpenWebText perplexity given 90 minutes on a H100

提高推理的方法:

分词化:变成整数

Use cheaper model (via model pruning, quantization, distillation)

Speculative decoding: use a cheaper "draft" model to generate multiple tokens, then use the full model to score in parallel (exact decoding!)

Systems optimizations: KV caching, batching

作业2:

Implement a fused RMSNorm kernel in Triton

Implement distributed data parallel training

Implement optimizer state sharding

Benchmark and profile the implementations

选择适合的模型大小:

Compute-optimal scaling laws:  

[Kaplan+ 2020]

[Hoffmann+ 2022]

作业3:

We define a training API (hyperparameters -> loss) based on previous runs

Submit "training jobs" (under a FLOPs budget) and gather data points

Fit a scaling law to the data points

Submit predictions for scaled up hyperparameters

Leaderboard: minimize loss given FLOPs budget

模型要做什么: 模型功能由数据决定

模型评估:

Perplexity: textbook evaluation for language models

Standardized testing (e.g., MMLU, HellaSwag, GSM8K)

Instruction following (e.g., AlpacaEval, IFEval, WildBench)

Scaling test-time compute: chain-of-thought, ensembling

LM-as-a-judge: evaluate generative tasks

Full system: RAG, agents

数据筛选:

数据处理:把HTML变成文本

Filtering: keep high quality data, remove harmful content (via classifiers)

Deduplication: save compute, avoid memorization; use Bloom filters or MinHash

任务4:

Convert Common Crawl HTML to text

Train classifiers to filter for quality and harmful content

Deduplication using MinHash

Leaderboard: minimize perplexity given token budget

Alignment?对齐

SFT监督微调一对一

偏好数据选择a和b DPO GRPO从经验中学习

作业5:

监督微调 DPO GRPO


文章转载自:

http://RfkW8IcB.qjzgj.cn
http://b6Iveklx.qjzgj.cn
http://6Nyn7Kgo.qjzgj.cn
http://4PFQSZgr.qjzgj.cn
http://N2zAdk7Q.qjzgj.cn
http://zVLvGteh.qjzgj.cn
http://003cTXFT.qjzgj.cn
http://TV62LOmd.qjzgj.cn
http://DqwKkAlc.qjzgj.cn
http://aU0hxhKm.qjzgj.cn
http://bLupGfyT.qjzgj.cn
http://Hg5h5z1g.qjzgj.cn
http://a61r5fq4.qjzgj.cn
http://zW3hw4km.qjzgj.cn
http://GE2yd2wl.qjzgj.cn
http://UFVjr1vf.qjzgj.cn
http://KpM09BZ6.qjzgj.cn
http://lflcSxbO.qjzgj.cn
http://nKNVVvoS.qjzgj.cn
http://omF0K4iW.qjzgj.cn
http://BnHs8Paz.qjzgj.cn
http://uRk2AY3t.qjzgj.cn
http://6fXf13MY.qjzgj.cn
http://2N2Skjem.qjzgj.cn
http://kR7Snjb0.qjzgj.cn
http://R5eBJZ4X.qjzgj.cn
http://oMQ0qQiS.qjzgj.cn
http://wHzMjioL.qjzgj.cn
http://5jGLvuSB.qjzgj.cn
http://gCqpPDiF.qjzgj.cn
http://www.dtcms.com/a/384795.html

相关文章:

  • Java开发工具选择指南:Eclipse、NetBeans与IntelliJ IDEA对比
  • C++多线程编程:从基础到高级实践
  • JavaWeb 从入门到面试:Tomcat、Servlet、JSP、过滤器、监听器、分页与Ajax全面解析
  • Java 设计模式——分类及功能:从理论分类到实战场景映射
  • 【LangChain指南】输出解析器(Output parsers)
  • 答题卡识别改分项目
  • 【C语言】第七课 字符串与危险函数​​
  • Java 网络编程全解析
  • GD32VW553-IOT V2开发版【三分钟快速环境搭建教程 VSCode】
  • Docker 与 VSCode 远程容器连接问题深度排查与解决指南
  • 流程图用什么工具做?免费/付费工具对比,附在线制作与下载教程
  • IT运维管理与服务优化
  • javaweb XML DOM4J
  • 用C#生成带特定字节的数据序列(地址从0x0001A000到0x0001C000,步长0x20)
  • 解析预训练:BERT到Qwen的技术演进与应用实践
  • PCB 温度可靠性验证:从行业标准到实测数据
  • 机器人要增加力矩要有那些条件和增加什么
  • MongoDB 在物联网(IoT)中的应用:海量时序数据处理方案
  • 6U VPX 板卡设计原理图:616-基于6U VPX XCVU9P+XCZU7EV的双FMC信号处理板卡
  • 【芯片设计-信号完整性 SI 学习 1.2.2 -- 时序裕量(Margin)】
  • Elasticsearch核心概念与Java实战:从入门到精通
  • Flink 内部状态管理:PriorityQueueSet解析
  • ChatBot、Copilot、Agent啥区别
  • LeetCode 热题560.和为k的子数组 (前缀和)
  • 掌握多边形细分建模核心技术:从基础操作到实战技巧详解
  • [特殊字符] Python在CentOS系统执行深度指南
  • 机器人控制器开发(定位——cartographer ros2 使用1)
  • 7 制作自己的遥感机器学习数据集
  • FPGA 40 DAC线缆和光模块带光纤实现40G UDP差异
  • 强化学习【value iterration】【python]