DeepSeek R1 部署笔记(个人无框架部署,缓慢更新中)
部署相关
新电脑显卡 4060Ti 8G
准备部署 DeepSeek-R1-Distill-Llama-8B
环境搭建
通过 nvidia-smi
查询得知CUDA 12.7 ,装了anaconda
创建了环境,安装依赖
发现国外的网站有点慢,想换个镜像
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # CUDA版本,这里用的是118版本,还是121以上版本比较好,否则无法用Flash Attention对生成过程进行加速
# pip install torch torchvision torchaudio # CPU版本
# CUDA下载太慢,准备试试阿里云的镜像,
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 -f https://mirrors.aliyun.com/pytorch-wheels/cu118
pip install transformers==4.37.0 accelerate sentencepiece # 后续证明4.37 不行,会报错,使用了pip install --upgrade transformers命令升级到4.49后问题解决
# 发现比那个强个三倍吧。但是也要下一个小时,我承认我急了,直接把原文件复制到浏览器下载,五分钟就下好了
https://mirrors.aliyun.com/pytorch-wheels/cu118/torch-2.3.1%2Bcu118-cp310-cp310-win_amd64.whl
然后进入所下载的文件夹,使用pip命令进行安装
`pip install 文件名.whl`
有好几个不是断联就是下载太慢,就直接手动下载解决了 ,记得把相应的包放到运行命令的文件夹下,或者进入到相应的文件夹再执行pip命令
模型下载
准备用git lfs 命令在魔搭上下载源模型进行部署
pip install git-lfs
git lfs clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Llama-8B D:\PythonProject\DeepSeek_Project
发现无法下载模型文件(有人知道为什么记得告诉我),最后还是手动下载了,手动下载都下了得有一个多小时,已老实。
运行相关
简单部署使用
编写了一个简单的python脚本,进行运行python deepseek.py 命令进行运行
此处的脚本借鉴了https://blog.csdn.net/ddv_08/article/details/145412729的代码
#-*-coding:GBK -*-
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_path = "D:\PythonProject\DeepSeek_8B" # 模型路径
device = "cuda" #if torch.cuda.is_available() else "cpu"
# 加载模型和分词器
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch.float16
#device_map="auto"
).to(device)
# 生成函数
def generate_response(prompt):
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(device)
outputs = model.generate(
inputs,
max_new_tokens=1000,
do_sample=True,
temperature=0.8,
top_p=0.9
)
response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(len(response))
return response
# 测试对话
if __name__ == "__main__":
while True:
user_input = input("用户:")
if user_input.lower() == "exit":
break
print("助手:", generate_response(user_input))
进一步优化
使用Flash Attention进行加速
要使用该参数必须先安装flash att包,可以使用pip install flash-attn --no-build-isolation
,但是下载速度很慢而且多次断联
安装可以参考这两篇文章的内容
https://blog.csdn.net/A15216110998/article/details/144854255
https://blog.51cto.com/u_15344287/13120915
我是在windows上运行的,于是下载了 flash_attn-2.7.1.post1+cu124torch2.3.1cxx11abiFALSE-cp310-cp310-win_amd64.whl
亲测确实有加速,具体加速多少没有进行详细测试
model = AutoModelForCausalLM.from_pretrained(
model_path,
use_flash_attention_2=True,
# ...其他参数
)
遇到的报错
RuntimeError: You can’t move a model that has some modules offloaded to cpu or disk.
完整报错:
(deepseek) D:\PythonProject\DeepSeek_8B>python deepseek.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.11s/it]
Some parameters are on the meta device because they were offloaded to the cpu.
You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
File "D:\PythonProject\DeepSeek_8B\deepseek.py", line 16, in <module>
).to(device)
File "D:\ProgramData\anaconda3\envs\deepseek\lib\site-packages\accelerate\big_modeling.py", line 458, in wrapper
raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.
此处是因为使用了#device_map=“auto” 就会报错,具体原因暂时还不知道,只能先不用
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input’s attention_mask
to obtain reliable results. (不影响运行)
在运行时发现会报这个,但是不影响后续运行,所以就没管了
(deepseek) D:\PythonProject\DeepSeek_8B>python deepseek.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.41s/it]
用户:解释一些use_flash_attention_2=True能加速推理的原理
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
D:\ProgramData\anaconda3\envs\deepseek\lib\site-packages\transformers\integrations\sdpa_attention.py:53: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
attn_output = torch.nn.functional.scaled_dot_product_attention(