2025.9.25大模型学习
跟着学 CS336,你不是在学“怎么调大模型”,而是在学“大模型本身是怎么构建出来的、它到底为什么work、它还有哪些可改进的地方”。
https://stanford-cs336.github.io/spring2025/
①sequence to sequence modeling
②Adam optimizer
③attention mechanism
④transformer architecture
变体:
Activation functions: ReLU, SwiGLU
Positional encodings: sinusoidal, RoPE
Normalization: LayerNorm, RMSNorm
Placement of normalization: pre-norm versus post-norm
MLP: dense, mixture of experts
Attention: full, sliding window, linear
Lower-dimensional attention: group-query attention (GQA), multi-head latent attention (MLA)
State-space models: Hyena
训练:
Optimizer (e.g., AdamW, Muon, SOAP)
Learning rate schedule (e.g., cosine, WSD)
Batch size (e..g, critical batch size)
Regularization (e.g., dropout, weight decay)
Hyperparameters (number of heads, hidden dimension): grid search
任务一:
Implement BPE tokenizer
Implement Transformer, cross-entropy loss, AdamW optimizer, training loop
Train on TinyStories and OpenWebText
Leaderboard: minimize OpenWebText perplexity given 90 minutes on a H100
提高推理的方法:
分词化:变成整数
Use cheaper model (via model pruning, quantization, distillation)
Speculative decoding: use a cheaper "draft" model to generate multiple tokens, then use the full model to score in parallel (exact decoding!)
Systems optimizations: KV caching, batching
作业2:
Implement a fused RMSNorm kernel in Triton
Implement distributed data parallel training
Implement optimizer state sharding
Benchmark and profile the implementations
选择适合的模型大小:
Compute-optimal scaling laws:
[Kaplan+ 2020]
[Hoffmann+ 2022]
作业3:
We define a training API (hyperparameters -> loss) based on previous runs
Submit "training jobs" (under a FLOPs budget) and gather data points
Fit a scaling law to the data points
Submit predictions for scaled up hyperparameters
Leaderboard: minimize loss given FLOPs budget
模型要做什么: 模型功能由数据决定
模型评估:
Perplexity: textbook evaluation for language models
Standardized testing (e.g., MMLU, HellaSwag, GSM8K)
Instruction following (e.g., AlpacaEval, IFEval, WildBench)
Scaling test-time compute: chain-of-thought, ensembling
LM-as-a-judge: evaluate generative tasks
Full system: RAG, agents
数据筛选:
数据处理:把HTML变成文本
Filtering: keep high quality data, remove harmful content (via classifiers)
Deduplication: save compute, avoid memorization; use Bloom filters or MinHash
任务4:
Convert Common Crawl HTML to text
Train classifiers to filter for quality and harmful content
Deduplication using MinHash
Leaderboard: minimize perplexity given token budget
Alignment?对齐
SFT监督微调一对一
偏好数据选择a和b DPO GRPO从经验中学习
作业5:
监督微调 DPO GRPO