当前位置: 首页 > news >正文

DeepSeek R1 部署笔记(个人无框架部署,缓慢更新中)

部署相关

新电脑显卡 4060Ti 8G
准备部署 DeepSeek-R1-Distill-Llama-8B

环境搭建

通过 nvidia-smi 查询得知CUDA 12.7 ,装了anaconda

创建了环境,安装依赖
发现国外的网站有点慢,想换个镜像


pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118  # CUDA版本,这里用的是118版本,还是121以上版本比较好,否则无法用Flash Attention对生成过程进行加速
# pip install torch torchvision torchaudio  # CPU版本

# CUDA下载太慢,准备试试阿里云的镜像,
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 -f https://mirrors.aliyun.com/pytorch-wheels/cu118
pip install transformers==4.37.0 accelerate sentencepiece       # 后续证明4.37 不行,会报错,使用了pip install --upgrade transformers命令升级到4.49后问题解决

# 发现比那个强个三倍吧。但是也要下一个小时,我承认我急了,直接把原文件复制到浏览器下载,五分钟就下好了
https://mirrors.aliyun.com/pytorch-wheels/cu118/torch-2.3.1%2Bcu118-cp310-cp310-win_amd64.whl

然后进入所下载的文件夹,使用pip命令进行安装
`pip install 文件名.whl`
有好几个不是断联就是下载太慢,就直接手动下载解决了 ,记得把相应的包放到运行命令的文件夹下,或者进入到相应的文件夹再执行pip命令

模型下载

准备用git lfs 命令在魔搭上下载源模型进行部署

pip install git-lfs
git lfs clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Llama-8B D:\PythonProject\DeepSeek_Project

发现无法下载模型文件(有人知道为什么记得告诉我),最后还是手动下载了,手动下载都下了得有一个多小时,已老实。

运行相关

简单部署使用

编写了一个简单的python脚本,进行运行python deepseek.py 命令进行运行

此处的脚本借鉴了https://blog.csdn.net/ddv_08/article/details/145412729的代码

#-*-coding:GBK -*-
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
 
model_path = "D:\PythonProject\DeepSeek_8B"  # 模型路径
device = "cuda" #if torch.cuda.is_available() else "cpu"
 
# 加载模型和分词器

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.float16
    #device_map="auto"
).to(device)



# 生成函数
def generate_response(prompt):
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(device)
    
    outputs = model.generate(
        inputs,
        max_new_tokens=1000,
        do_sample=True,
        temperature=0.8,
        top_p=0.9
    )
    response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
    print(len(response))
    return response
 
# 测试对话
if __name__ == "__main__":
    while True:
        user_input = input("用户:")
        if user_input.lower() == "exit":
            break
        print("助手:", generate_response(user_input))



进一步优化

使用Flash Attention进行加速

要使用该参数必须先安装flash att包,可以使用pip install flash-attn --no-build-isolation,但是下载速度很慢而且多次断联

安装可以参考这两篇文章的内容
https://blog.csdn.net/A15216110998/article/details/144854255
https://blog.51cto.com/u_15344287/13120915

我是在windows上运行的,于是下载了 flash_attn-2.7.1.post1+cu124torch2.3.1cxx11abiFALSE-cp310-cp310-win_amd64.whl
亲测确实有加速,具体加速多少没有进行详细测试


model = AutoModelForCausalLM.from_pretrained(
    model_path,
    use_flash_attention_2=True,
    # ...其他参数
)

遇到的报错

RuntimeError: You can’t move a model that has some modules offloaded to cpu or disk.

完整报错:

(deepseek) D:\PythonProject\DeepSeek_8B>python deepseek.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.11s/it]
Some parameters are on the meta device because they were offloaded to the cpu.
You shouldn't move a model that is dispatched using accelerate hooks.
Traceback (most recent call last):
  File "D:\PythonProject\DeepSeek_8B\deepseek.py", line 16, in <module>
    ).to(device)
  File "D:\ProgramData\anaconda3\envs\deepseek\lib\site-packages\accelerate\big_modeling.py", line 458, in wrapper
    raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

此处是因为使用了#device_map=“auto” 就会报错,具体原因暂时还不知道,只能先不用

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input’s attention_mask to obtain reliable results. (不影响运行)

在运行时发现会报这个,但是不影响后续运行,所以就没管了

(deepseek) D:\PythonProject\DeepSeek_8B>python deepseek.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.41s/it]
用户:解释一些use_flash_attention_2=True能加速推理的原理
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
D:\ProgramData\anaconda3\envs\deepseek\lib\site-packages\transformers\integrations\sdpa_attention.py:53: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(

相关文章:

  • Spring AOP 切面打印日志完整版
  • ubuntu离线安装Ollama并部署Llama3.1 70B INT4并对外发布服务
  • 鲲鹏麒麟离线安装Docker
  • 定义数组存储3部汽车对象(class2:类在class1中请看上一篇博客)
  • Everything in Python is an object. What does that mean?
  • 分类算法——逻辑回归 详解
  • 软件工程复试专业课-能力成熟度模型CMM
  • 幂等性 如何通过设计避免重复操作的影响
  • Springboot统一功能处理
  • [VMware]卸载VMware虚拟机和Linux系统ubuntu(自记录版)
  • NTS库学习,找bug中......
  • docker高级
  • AI agent(以AutoGPT为例)和AI Workflow 区别
  • 【PyTorch】2024保姆级安装教程-Python-(CPU+GPU详细完整版)-
  • leetcode刷题-动态规划08
  • MYSQL数据备份与恢复(mysqldump)
  • 青少年编程与数学 02-010 C++程序设计基础 11课题、程序结构
  • 8_安装Thrift
  • jsonp
  • 采样算法二:去噪扩散隐式模型(DDIM)采样算法详解教程
  • 在火炉做网站公园坐什么车/关键词优化课程
  • 做网站编程时容易遇到的问题/seo搜索引擎推广什么意思
  • wordpress前台美化/长春seo优化企业网络跃升
  • 福永医院网站建设/投广告的平台有哪些
  • 潍坊市网站建设/b站黄页推广
  • 怎么做一张图片的网站/怎么样拓展客户资源