书生五期--端侧小模型论文分类微调打榜
1. 资料
项目地址: https://aicarrier.feishu.cn/wiki/I5ywwN1ZZi5m1pk5fhKcGL0Dn0j
比赛将使用具有丰富学术资源的 arxiv 数据集,该数据集涵盖物理学、数学、计算机科学、定量生物学、定量金融等多个学科领域的海量论文。arxiv 作为全球知名的预印本论文平台,论文更新速度快,涵盖前沿研究成果,能为模型训练提供多元且专业的学术数据。同时,比赛还引入了采集的arixv数据(包括尖米近半年挂的两篇水文),这些数据来源广泛,与原 arxiv 数据集形成互补,进一步丰富了数据的多样性与全面性,确保模型在更广泛的论文样本上进行训练,提升对不同风格、主题论文的分类能力。
在模型使用方面,参赛者需使用 InternLM 系列模型进行微调。参赛者应依据比赛提供的数据特点,结合自身对模型的理解与技术能力,选择合适的 InternLM 系列模型版本开展微调工作,严格遵循相关的技术规范和道德准则,以确保模型在论文分类任务中的稳定性和可靠性。
2. 复现步骤
本次使用的是modelscope的Swift框架试下,具体文档: https://aicarrier.feishu.cn/wiki/HfQuwYa4Xi4WJVkuNF3cygqnnch
2.1 环境安装
- 安装ms-swift
conda create -n ms-swift python=3.10 -y
conda activate ms-swift
pip install ms-swift -U
pip install wandb
2.2 下载数据集
进入到创建的ms-swift环境中,下载arxiv数据集,数据集作者贴心处理好了
conda activate ms-swift
pip install modelscope
modelscope download --dataset JimmyMa99/smartflow-arxiv-dataset --local_dir ./datasets/train
2.3 训练
在训练过程中,先预训练,再基于预训练模型进行SFT(监督微调),使用了两种不同的数据集,预训练的目标是经过大规模数据集进行初步训练,学习到相关领域知识的通用特征和知识,通过自监督任务学习语言和语法,这些特征具有夸任务和跨领域的泛化能力。SFT(监督微调)的目的是在预训练基础上进行专业领域的训练,通过监督学习优化模型在具体任务上的表现。
2.3.1 预训练
使用lora预训练
conda activate ms-swift
bash config/internlm3-8b-lora.sh
具体的config/internlm3-8b-lora.sh内容
#!/bin/bash# 创建日志目录
LOG_DIR="logs"
mkdir -p $LOG_DIR # 确保日志目录存在,如果不存在则创建# 获取当前时间戳,用于生成唯一的日志文件名
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
LOG_FILE="$LOG_DIR/internlm3-8b_lora_sft_${TIMESTAMP}.log" # 设置日志文件路径# 设置CUDA环境变量
export NPROC_PER_NODE=1 # 设置每个节点使用的进程数为1
export OMP_NUM_THREADS=1 # 限制OpenMP线程数为1,避免过多线程竞争
export CUDA_VISIBLE_DEVICES=0 # 指定使用的GPU编号为0
export MASTER_PORT=$((10000 + RANDOM % 50000))# 使用nohup命令在后台运行训练任务,即使终端关闭也能继续运行
nohup swift sft \--model {model_path} \ # 指定基础模型路径--train_type lora \ # 使用LoRA训练方法--dataset {data_path} \ # 指定训练数据集--torch_dtype bfloat16 \ # 使用bfloat16精度以节省显存--num_train_epochs 1 \ # 设置训练轮数为2--per_device_train_batch_size 2 \ # 每个设备的训练批次大小为4--learning_rate 5e-5 \ # 学习率设置为5e-5--warmup_ratio 0.1 \ # 预热阶段占总训练步数的10%--split_dataset_ratio 0 \ # 不拆分数据集--lora_rank 8 \ # LoRA的秩设置为8--lora_alpha 32 \ # LoRA的alpha参数设置为32--use_chat_template false \ # 不使用聊天模板--target_modules all-linear \ # 对所有线性层应用LoRA--gradient_accumulation_steps 2 \ # 梯度累积步数为2,用于增大有效批次大小--save_steps 2000 \ # 每2000步保存一次模型--save_total_limit 5 \ # 最多保存5个检查点--gradient_checkpointing_kwargs '{"use_reentrant": false}' \ # 梯度检查点设置,禁用重入--logging_steps 5 \ # 每5步记录一次日志--max_length 2048 \ # 最大序列长度设为2048--output_dir ./swift_output/InternLM3-8B-Lora \ # 输出目录--dataloader_num_workers 256 \ # 数据加载器使用256个工作线程--model_author JimmyMa99 \ # 模型作者信息--model_name InternLM3-8B-Lora \ # 模型名称> "$LOG_FILE" 2>&1 & # 将标准输出和错误输出重定向到日志文件,并在后台运行# 打印进程ID和日志文件位置,便于用户跟踪
echo "Training started with PID $!" # 显示后台进程的PID
echo "Log file: $LOG_FILE" # 显示日志文件位置# 提示用户如何实时查看日志
echo "To view logs in real-time, use:"
echo "tail -f $LOG_FILE"
训练完这一步,会在swift_output中输出这个文件
合并权重
swift export --adapters xx/checkpoint-xxx --merge_lora true
执行完代码后,得到如下的合并文件
2.3.2 SFT
使用lora训练
在这里监督微调使用的数据集是带有标签的数据
bash config/internlm3-8b-sft-lora.sh
internlm3-8b-sft-lora.sh脚本:
#!/bin/bash
# 指定使用bash解释器执行脚本# 创建日志目录
LOG_DIR="logs"
# 定义日志存储目录变量
mkdir -p $LOG_DIR
# 创建日志目录,-p参数确保即使目录已存在也不会报错# 获取当前时间戳
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
# 获取当前时间并格式化为年月日_时分秒格式
LOG_FILE="$LOG_DIR/internlm3-8b_lora_sft_${TIMESTAMP}.log"
# 组合日志文件路径,使用时间戳确保文件名唯一# 设置CUDA设备
export NPROC_PER_NODE=1
# 设置每个节点的进程数为1
export OMP_NUM_THREADS=1
# 设置OpenMP线程数为1,限制并行线程数
export CUDA_VISIBLE_DEVICES=0
# 将自己的wandb的api——key设置上
export WANDB_API_KEY="********"
# 设置环境变量以避免内存碎片化
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truenohup swift sft \
# 使用nohup命令在后台运行swift sft命令,即使终端关闭也能继续运行--model /root/code/camp5_course/swift_output/InternLM3-8B-Lora/v1-20250416-140542/checkpoint-74-merged \# 指定基础模型路径,使用之前训练的checkpoint-74的合并模型--train_type lora \# 设置训练类型为LoRA(低秩适应)--dataset '/root/code/camp5_course/data/swift_formatted_sft_train_data.jsonl' \# 指定训练数据集的路径--torch_dtype bfloat16 \# 设置模型参数的数据类型为bfloat16,减少内存占用--num_train_epochs 1 \# 设置训练轮数为5轮--per_device_train_batch_size 22 \# 设置每个设备的训练批次大小为8--learning_rate 1e-4 \# 设置学习率为0.0001--warmup_ratio 0.1 \# 设置预热比例为0.1,即10%的训练步骤用于学习率从小到大的预热--split_dataset_ratio 0 \# 设置数据集分割比例为0,不进行训练/验证分割--report_to wandb \# 设置训练报告发送到Weights & Biases平台--lora_rank 8 \# 设置LoRA的秩为8,控制可训练参数的数量--lora_alpha 32 \# 设置LoRA的alpha为32,影响LoRA更新的缩放程度--target_modules all-linear \# 设置LoRA目标模块为所有线性层--gradient_accumulation_steps 2 \# 设置梯度累积步数为2,相当于扩大了批次大小--save_steps 2000 \# 每2000步保存一次检查点--save_total_limit 5 \# 最多保存5个检查点,超过会删除旧的--gradient_checkpointing_kwargs '{"use_reentrant": false}' \# 设置梯度检查点的参数,关闭重入功能以提高稳定性--logging_steps 5 \# 每5步记录一次日志--max_length 2048 \# 设置最大序列长度为2048--output_dir ./swift_output/InternLM3-8B-Lora \# 设置输出目录--dataloader_num_workers 256 \# 设置数据加载器的工作进程数为256,加速数据加载--model_author JimmyMa99 \# 设置模型作者信息--model_name InternLM3-8B-Lora \# 设置模型名称> "$LOG_FILE" 2>&1 &# 将标准输出和标准错误重定向到日志文件,并在后台运行# 打印进程ID和日志文件位置
echo "Training started with PID $!"
# 显示训练进程的PID($!代表最近一个后台进程的PID)
echo "Log file: $LOG_FILE"
# 显示日志文件的路径# 显示查看日志的命令
echo "To view logs in real-time, use:"
echo "tail -f $LOG_FILE"
# 提示用户如何实时查看日志文件内容
合并权重
swift export --adapters xx/checkpoint-xxx --merge_lora true
合并完成的文件如下
2.4 推理测试
建立streamlit代码
"""This script refers to the dialogue example of streamlit, the interactive
generation code of chatglm2 and transformers.We mainly modified part of the code logic to adapt to the
generation of our model.
Please refer to these links below for more information:1. streamlit chat example:https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps2. chatglm2:https://github.com/THUDM/ChatGLM2-6B3. transformers:https://github.com/huggingface/transformers
Please run with the command `streamlit run path/to/web_demo.py--server.address=0.0.0.0 --server.port 7860`.
Using `python path/to/web_demo.py` may cause unknown problems.
"""
# isort: skip_file
import copy
import warnings
from dataclasses import asdict, dataclass
from typing import Callable, List, Optionalimport streamlit as st
import torch
from torch import nn
from transformers.generation.utils import (LogitsProcessorList,StoppingCriteriaList)
from transformers.utils import loggingfrom transformers import AutoTokenizer, AutoModelForCausalLM # isort: skiplogger = logging.get_logger(__name__)
model_name_or_path="/root/finetune/models/internlm2-chat-7b"@dataclass
class GenerationConfig:# this config is used for chat to provide more diversitymax_length: int = 32768top_p: float = 0.8temperature: float = 0.8do_sample: bool = Truerepetition_penalty: float = 1.005@torch.inference_mode()
def generate_interactive(model,tokenizer,prompt,generation_config: Optional[GenerationConfig] = None,logits_processor: Optional[LogitsProcessorList] = None,stopping_criteria: Optional[StoppingCriteriaList] = None,prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor],List[int]]] = None,additional_eos_token_id: Optional[int] = None,**kwargs,
):inputs = tokenizer([prompt], padding=True, return_tensors='pt')input_length = len(inputs['input_ids'][0])for k, v in inputs.items():inputs[k] = v.cuda()input_ids = inputs['input_ids']_, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]if generation_config is None:generation_config = model.generation_configgeneration_config = copy.deepcopy(generation_config)model_kwargs = generation_config.update(**kwargs)bos_token_id, eos_token_id = ( # noqa: F841 # pylint: disable=W0612generation_config.bos_token_id,generation_config.eos_token_id,)if isinstance(eos_token_id, int):eos_token_id = [eos_token_id]if additional_eos_token_id is not None:eos_token_id.append(additional_eos_token_id)has_default_max_length = kwargs.get('max_length') is None and generation_config.max_length is not Noneif has_default_max_length and generation_config.max_new_tokens is None:warnings.warn(f"Using 'max_length''s default \({repr(generation_config.max_length)}) \to control the generation length. "'This behaviour is deprecated and will be removed from the \config in v5 of Transformers -- we'' recommend using `max_new_tokens` to control the maximum \length of the generation.',UserWarning,)elif generation_config.max_new_tokens is not None:generation_config.max_length = generation_config.max_new_tokens + \input_ids_seq_lengthif not has_default_max_length:logger.warn( # pylint: disable=W4902f"Both 'max_new_tokens' (={generation_config.max_new_tokens}) "f"and 'max_length'(={generation_config.max_length}) seem to ""have been set. 'max_new_tokens' will take precedence. "'Please refer to the documentation for more information. ''(https://huggingface.co/docs/transformers/main/''en/main_classes/text_generation)',UserWarning,)if input_ids_seq_length >= generation_config.max_length:input_ids_string = 'input_ids'logger.warning(f'Input length of {input_ids_string} is {input_ids_seq_length}, 'f"but 'max_length' is set to {generation_config.max_length}. "'This can lead to unexpected behavior. You should consider'" increasing 'max_new_tokens'.")# 2. Set generation parameters if not already definedlogits_processor = logits_processor if logits_processor is not None \else LogitsProcessorList()stopping_criteria = stopping_criteria if stopping_criteria is not None \else StoppingCriteriaList()logits_processor = model._get_logits_processor(generation_config=generation_config,input_ids_seq_length=input_ids_seq_length,encoder_input_ids=input_ids,prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,logits_processor=logits_processor,)stopping_criteria = model._get_stopping_criteria(generation_config=generation_config,stopping_criteria=stopping_criteria)logits_warper = model._get_logits_warper(generation_config)unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)scores = Nonewhile True:model_inputs = model.prepare_inputs_for_generation(input_ids, **model_kwargs)# forward pass to get next tokenoutputs = model(**model_inputs,return_dict=True,output_attentions=False,output_hidden_states=False,)next_token_logits = outputs.logits[:, -1, :]# pre-process distributionnext_token_scores = logits_processor(input_ids, next_token_logits)next_token_scores = logits_warper(input_ids, next_token_scores)# sampleprobs = nn.functional.softmax(next_token_scores, dim=-1)if generation_config.do_sample:next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)else:next_tokens = torch.argmax(probs, dim=-1)# update generated ids, model inputs, and length for next stepinput_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)model_kwargs = model._update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False)unfinished_sequences = unfinished_sequences.mul((min(next_tokens != i for i in eos_token_id)).long())output_token_ids = input_ids[0].cpu().tolist()output_token_ids = output_token_ids[input_length:]for each_eos_token_id in eos_token_id:if output_token_ids[-1] == each_eos_token_id:output_token_ids = output_token_ids[:-1]response = tokenizer.decode(output_token_ids)yield response# stop when each sentence is finished# or if we exceed the maximum lengthif unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):breakdef on_btn_click():del st.session_state.messages@st.cache_resource
def load_model():model = (AutoModelForCausalLM.from_pretrained(model_name_or_path,trust_remote_code=True).to(torch.bfloat16).cuda())tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,trust_remote_code=True)return model, tokenizerdef prepare_generation_config():with st.sidebar:max_length = st.slider('Max Length',min_value=8,max_value=32768,value=32768)top_p = st.slider('Top P', 0.0, 1.0, 0.8, step=0.01)temperature = st.slider('Temperature', 0.0, 1.0, 0.7, step=0.01)st.button('Clear Chat History', on_click=on_btn_click)generation_config = GenerationConfig(max_length=max_length,top_p=top_p,temperature=temperature)return generation_configuser_prompt = '<|im_start|>user\n{user}<|im_end|>\n'
robot_prompt = '<|im_start|>assistant\n{robot}<|im_end|>\n'
cur_query_prompt = '<|im_start|>user\n{user}<|im_end|>\n\<|im_start|>assistant\n'def combine_history(prompt):messages = st.session_state.messagesmeta_instruction = ('You are a helpful, honest, ''and harmless AI assistant.')total_prompt = f'<s><|im_start|>system\n{meta_instruction}<|im_end|>\n'for message in messages:cur_content = message['content']if message['role'] == 'user':cur_prompt = user_prompt.format(user=cur_content)elif message['role'] == 'robot':cur_prompt = robot_prompt.format(robot=cur_content)else:raise RuntimeErrortotal_prompt += cur_prompttotal_prompt = total_prompt + cur_query_prompt.format(user=prompt)return total_promptdef main():st.title('internlm2_5-7b-chat-assistant')# torch.cuda.empty_cache()print('load model begin.')model, tokenizer = load_model()print('load model end.')generation_config = prepare_generation_config()# Initialize chat historyif 'messages' not in st.session_state:st.session_state.messages = []# Display chat messages from history on app rerunfor message in st.session_state.messages:with st.chat_message(message['role'], avatar=message.get('avatar')):st.markdown(message['content'])# Accept user inputif prompt := st.chat_input('What is up?'):# Display user message in chat message containerwith st.chat_message('user', avatar='user'):st.markdown(prompt)real_prompt = combine_history(prompt)# Add user message to chat historyst.session_state.messages.append({'role': 'user','content': prompt,'avatar': 'user'})with st.chat_message('robot', avatar='assistant'):message_placeholder = st.empty()for cur_response in generate_interactive(model=model,tokenizer=tokenizer,prompt=real_prompt,additional_eos_token_id=92542,device='cuda:0',**asdict(generation_config),):# Display robot response in chat message containermessage_placeholder.markdown(cur_response + '▌')message_placeholder.markdown(cur_response)# Add robot response to chat historyst.session_state.messages.append({'role': 'robot','content': cur_response, # pylint: disable=undefined-loop-variable'avatar': 'assistant',})torch.cuda.empty_cache()if __name__ == '__main__':main()
运行
pip install streamlit==1.31.0
streamlit run /root/Tutorial/tools/L1_XTuner_code/xtuner_streamlit_demo.py
2.5 评测
2.5.1 本地评测
使用司南进行测评,下载司南测评的库
git clone https://github.moeyy.xyz/https://github.com/open-compass/opencompass opencompass
cd opencompass
conda create -n opencompass python=3.10 -y
conda activate opencompass
pip install -e .
根据教程修改opencompass/opencompass/models/huggingface.py
文件
新建py脚本internlm3-oc_eval.py
# To run this example, you need to do the following steps:
# 1. Install latest opencompass
# 2. Start a local server with Qwen2.5-72B-Instruct as LLMJudge server (i.e. using vLLM or LMDeploy)
# 3. Change the judge_cfg openai_api_base to your corresponindg local server address
# 4. Start this evaluation by running 'opencompass eval_internlm3_math500_thinking.py'
# from opencompass.models import VLLMwithChatTemplate, OpenAISDK
from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLMmodels = [dict(type=HuggingFaceCausalLM,path='/root/code/camp5_course/swift_output/InternLM3-8B-Lora/v3-20250416-163856/checkpoint-95-merged',tokenizer_path='/root/code/camp5_course/swift_output/InternLM3-8B-Lora/v3-20250416-163856/checkpoint-95-merged',tokenizer_kwargs=dict(padding_side='left', truncation_side='left'),model_kwargs=dict(device_map='auto'),max_seq_len=4090, --max_out_len=100, --batch_size=4,run_cfg=dict(num_gpus=1, num_procs=1),)
]datasets = [{"path": "/root/code/camp5_course/test_data/newformat_sft_test_data.csv", "data_type": "mcq", "infer_method": "gen"},
]
执行命令
python run.py internlm3-oc_eval.py --debug > app.log 2>&1
输出结果在app.log日志文件中
2.6 提交至MdoelScope
首先在魔搭社区上建一个自己的模型
然后将自己训练好的模型上传
from modelscope.hub.api import HubApi
from modelscope.hub.constants import Licenses, ModelVisibility# 配置基本信息
YOUR_ACCESS_TOKEN = '**************' # your api key
api = HubApi()
api.login(YOUR_ACCESS_TOKEN)# 取名字
owner_name = '**********' # ModelScope 的用户名,需根据自己情况修改
model_name = '**********' # 为模型库取个响亮优雅又好听的名字,需根据自己情况修改
model_id = f"{owner_name}/{model_name}"# 创建模型库,若已通过6.1节的ModelScope网页端创建,此段代码可忽略
# api.create_model(
# model_id,
# visibility=ModelVisibility.PUBLIC,
# license=Licenses.APACHE_V2,
# chinese_name=f"{owner_name}的论文分类打榜赛模型"
# )# 上传模型
api.upload_folder(repo_id=f"{owner_name}/{model_name}",folder_path='/root/swift_output/InternLM3-8B-Lora/v4-20250521-215238/checkpoint-8-merged', # 微调后模型的文件夹名称commit_message='', # 写点开心的上传信息
)