当前位置：首页 > news >正文

DeepSeek R1 部署笔记（个人无框架部署，缓慢更新中）

news 来源：原创 2025/6/9 4:56:59

部署相关

新电脑显卡 4060Ti 8G
准备部署 DeepSeek-R1-Distill-Llama-8B

环境搭建

通过 nvidia-smi 查询得知CUDA 12.7 ，装了anaconda

创建了环境，安装依赖
发现国外的网站有点慢，想换个镜像


pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118  # CUDA版本,这里用的是118版本，还是121以上版本比较好，否则无法用Flash Attention对生成过程进行加速
# pip install torch torchvision torchaudio  # CPU版本

# CUDA下载太慢，准备试试阿里云的镜像，
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 -f https://mirrors.aliyun.com/pytorch-wheels/cu118
pip install transformers==4.37.0 accelerate sentencepiece       # 后续证明4.37 不行，会报错，使用了pip install --upgrade transformers命令升级到4.49后问题解决

# 发现比那个强个三倍吧。但是也要下一个小时，我承认我急了，直接把原文件复制到浏览器下载，五分钟就下好了
https://mirrors.aliyun.com/pytorch-wheels/cu118/torch-2.3.1%2Bcu118-cp310-cp310-win_amd64.whl

然后进入所下载的文件夹，使用pip命令进行安装
`pip install 文件名.whl`
有好几个不是断联就是下载太慢，就直接手动下载解决了 ，记得把相应的包放到运行命令的文件夹下，或者进入到相应的文件夹再执行pip命令

模型下载

准备用git lfs 命令在魔搭上下载源模型进行部署

pip install git-lfs
git lfs clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Llama-8B D:\PythonProject\DeepSeek_Project

发现无法下载模型文件（有人知道为什么记得告诉我），最后还是手动下载了，手动下载都下了得有一个多小时，已老实。

运行相关

简单部署使用

编写了一个简单的python脚本，进行运行python deepseek.py 命令进行运行

此处的脚本借鉴了https://blog.csdn.net/ddv_08/article/details/145412729的代码

#-*-coding:GBK -*-
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
 
model_path = "D:\PythonProject\DeepSeek_8B"  # 模型路径
device = "cuda" #if torch.cuda.is_available() else "cpu"
 
# 加载模型和分词器

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.float16
    #device_map="auto"
).to(device)



# 生成函数
def generate_response(prompt):
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(device)
    
    outputs = model.generate(
        inputs,
        max_new_tokens=1000,
        do_sample=True,
        temperature=0.8,
        top_p=0.9
    )
    response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
    print(len(response))
    return response
 
# 测试对话
if __name__ == "__main__":
    while True:
        user_input = input("用户：")
        if user_input.lower() == "exit":
            break
        print("助手：", generate_response(user_input))

进一步优化

使用Flash Attention进行加速

要使用该参数必须先安装flash att包，可以使用pip install flash-attn --no-build-isolation，但是下载速度很慢而且多次断联

安装可以参考这两篇文章的内容
https://blog.csdn.net/A15216110998/article/details/144854255
https://blog.51cto.com/u_15344287/13120915

我是在windows上运行的，于是下载了 flash_attn-2.7.1.post1+cu124torch2.3.1cxx11abiFALSE-cp310-cp310-win_amd64.whl
亲测确实有加速，具体加速多少没有进行详细测试


model = AutoModelForCausalLM.from_pretrained(
    model_path,
    use_flash_attention_2=True,
    # ...其他参数
)

遇到的报错

RuntimeError: You can’t move a model that has some modules offloaded to cpu or disk.

完整报错：

(deepseek) D:\PythonProject\DeepSeek_8B>python deepseek.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.11s/it]
Some parameters are on the meta device because they were offloaded to the cpu.
You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
  File "D:\PythonProject\DeepSeek_8B\deepseek.py", line 16, in <module>
    ).to(device)
  File "D:\ProgramData\anaconda3\envs\deepseek\lib\site-packages\accelerate\big_modeling.py", line 458, in wrapper
    raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

此处是因为使用了#device_map=“auto” 就会报错，具体原因暂时还不知道，只能先不用

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input’s `attention_mask` to obtain reliable results. (不影响运行)

在运行时发现会报这个，但是不影响后续运行，所以就没管了

(deepseek) D:\PythonProject\DeepSeek_8B>python deepseek.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.41s/it]
用户：解释一些use_flash_attention_2=True能加速推理的原理
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
D:\ProgramData\anaconda3\envs\deepseek\lib\site-packages\transformers\integrations\sdpa_attention.py:53: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(