当前位置：首页 > wzjs >正文

做兼职网站赚钱吗微信如何做公众号

wzjs 2025/9/5 9:37:14

做兼职网站赚钱吗,微信如何做公众号,公司官网建设方案,dedecms如何做网站以下程序调用本地部署的 LLaMA3 模型进行多轮对话生成，通过 Hugging Face Transformers API 加载、预处理、生成并输出最终回答。程序用的是 Chat 模型格式（如 LLaMA3 Instruct 模型），遵循 ChatML 模板，并使用 apply…

以下程序调用本地部署的 LLaMA3 模型进行多轮对话生成，通过 Hugging Face Transformers API 加载、预处理、生成并输出最终回答。

程序用的是 Chat 模型格式（如 LLaMA3 Instruct 模型），遵循 ChatML 模板，并使用 apply_chat_template 正确构造 prompt。

首先执行下面这个python脚本下载大模型到本地（下载到本地的具体路径通过cache_dir参数指定）。

#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('LLM-Research/Llama-3.2-1B-Instruct', cache_dir="/root/autodl-tmp/llm")

然后，加载下载好的本地大模型，执行后续操作

from transformers import AutoModelForCausalLM, AutoTokenizerDEVICE = "cuda"# 加载本地模型路径为该模型配置文件所在的根目录
model_dir = "/root/autodl-tmp/llm/LLM-Research/Llama-3___2-1B-Instruct"
# 使用transformer加载模型
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_dir)# 调用模型
# 定义提示词
prompt = "你好，你叫什么名字？你是由谁创造的？"
# 将提示词封装为message
messages = [{"role": "system", "content": "You are a helpful assistant system"},{"role": "user", "content": prompt}]
# 使用分词器的apply_chat_template方法将messages转换为chat引擎可以接受的信息
# tokenize=False表示此时不进行令牌化
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)print("--------------")
print(text)
print("--------------")# 将处理后的文本令牌化并转换为模型的输入张量
model_inputs = tokenizer([text], return_tensors="pt").to(DEVICE)
# 输入模型得到输出
generated = model.generate(model_inputs.input_ids, max_new_tokens=512)
print(generated)# 对输出的内容进行解码还原
response = tokenizer.batch_decode(generated, skip_special_tokens=True)
print(response)

root@autodl-container-38c543b634-d7f7c9f4:~/autodl-tmp/demo_10# python llama3.2_test.py 
--------------
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Cutting Knowledge Date: December 2023
Today Date: 12 May 2025You are a helpful assistant system<|eot_id|><|start_header_id|>user<|end_header_id|>你好，你叫什么名字？你是由谁创造的？<|eot_id|><|start_header_id|>assistant<|end_header_id|>--------------
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
tensor([[128000, 128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,2696,     25,   6790,    220,   2366,     18,    198,  15724,   2696,25,    220,    717,   3297,    220,   2366,     20,    271,   2675,527,    264,  11190,  18328,   1887, 128009, 128006,    882, 128007,271,  57668,  53901, 104660, 105424, 101879, 119395,  11571,  57668,21043,  68171, 112471, 104251,  67178,   9554,  11571, 128009, 128006,78191, 128007,    271,  37046,  21043,  16197,  44689,  15836,  18328,3922, 113230,  13372,  18184,    445,  81101,   1811,     43,  81101,55951,  16197,  44689,  48044,  27384, 121790,   9554, 109683, 120074,55642, 123123,   3922,  68438,  38129,  43240,  87502,  41007,  37507,111478,  34208,  23226,  42399,   1811, 128009]], device='cuda:0')
['system\n\nCutting Knowledge Date: December 2023\nToday Date: 12 May 2025\n\nYou are a helpful assistant systemuser\n\n你好，你叫什么名字？你是由谁创造的？assistant\n\n我是 Meta 的AI assistant，目前名为 Llama。Llama 是 Meta 的一个大规模的自然语言处理模型，通过使用多种方法来学习和改进。']

这段程序的目的是：使用本地部署的 LLaMA 3 模型进行多轮中文对话生成，主要包括模型加载、输入构造、文本生成和输出解析四个核心部分。

程序一开始导入了必要的模块，并设置计算设备为 "cuda"，也就是使用 GPU 来加速模型推理。接着，它指定了模型所在的本地目录路径 model_dir，这个目录中应该包含 Hugging Face 格式的模型权重和配置文件。程序通过 AutoModelForCausalLM.from_pretrained 来加载模型，并指定了 torch_dtype="auto" 和 device_map="auto"，这让 transformers 自动选择合适的数据精度（比如 float16）并智能将模型加载到可用的 GPU 上（需要安装 accelerate 库）。同时，AutoTokenizer 也从该路径中加载对应的分词器，它会把人类语言转换成模型可以理解的 token ID。

接下来，程序准备了一条用户输入：你好，你叫什么名字？你是由谁创造的？。为了构建标准的聊天输入，程序创建了一个 messages 列表，其中包含一个 "system" 信息（设定助手角色），以及一条 "user" 提问。这种格式是 Chat 模型（如 LLaMA3 Instruct）所支持的，类似于 ChatGPT 的对话格式。之后，通过 tokenizer.apply_chat_template 方法将这组消息转换为模型能够理解的 纯文本格式。这个方法的参数中，tokenize=False 表示暂时不转换为 token ID，而 add_generation_prompt=True 会在文本结尾添加生成提示符，引导模型开始生成回答。

完成 prompt 构造后，程序使用分词器将文本转换为 token ID，并用 return_tensors="pt" 表示返回 PyTorch 张量，随后将其 .to(DEVICE) 移到 GPU 上，准备作为模型输入。然后调用 model.generate 方法，让模型基于输入生成回复。这里设置了 max_new_tokens=512，即最多生成 512 个 token 的新内容。

模型生成的是一串 token ID，因此最后一步需要用分词器进行反解码。tokenizer.batch_decode 方法会将生成的 token 序列还原为人类可读的自然语言文本，skip_special_tokens=True 参数会去除控制符号。程序最后将生成的回复打印出来，实现了从用户提问到模型生成回答的完整过程。

查看全文

http://www.dtcms.com/wzjs/595165.html