当前位置：首页 > news >正文

06.AI搭建preparationの(transformers02)bertmodel实现bert-base-chinese的编码

news 2025/10/8 10:19:04

一、下载

google-bert/bert-base-chinese at main

二、简介：

该模型的主要作用是获取每个汉字的向量表示，后续通过微调可应用于各种简体和繁体中文任务。

三、环境与设备：

pycharm:2024

torch:2.2.0+cu118

tensorflow2.6.0

python:3.9

transformers:4.32.0(4.3X系列应该都可以）目前镜像更新到4.50.0

在IDE进行进一步检测：

import sys
import tensorflow as tf
import torch
import transformers

if __name__ == '__main__':
    print(sys.version)#查看当前Python版本
print(tf.test.is_built_with_cuda())# 判断CUDA是否可以用
print(tf.config.list_physical_devices('GPU'))#测试 tensorflow-gpu 是否安装正确
print(torch.__version__)#输出torch版本
x = torch.rand(5, 3)
print(x)#简单torch运算
print(torch.cuda.is_available())#测试是否支持cuda,ture是支持，否则仅CPU
print(transformers.__version__)#transformers版本

四、实操含解释：

import torch
from transformers import BertTokenizer, BertModel#引用库
#加载模型和分词器，form_pretrained()函数: 用于加速加载(下载)预训练模型及其配套的分词器
model_path = "./bert-base-chinese"#模型的路径

tokenizer = BertTokenizer.from_pretrained(model_path)#下载分词工具
model = BertModel.from_pretrained(model_path)#下载模型

def encode_text_with_bert(text):#函数将使用bert-base-chinese模型对其进行编码，并返回编码后的张量，这个张量可以被用于后续的机器学习或深度学习任务。
    # 使用tokenizer对文本进行编码，并去掉起始和结束标志
    encoded_text = tokenizer.encode(text)[1: -1]
    # 把列表转成张量
    encoded_tensor = torch.LongTensor([encoded_text])
    # 不自动进行梯度计算
    with torch.no_grad():
        output = model(encoded_tensor)
    # 返回编码后的张量(取last_hidden_state)
    return output[0]
if __name__ == '__main__':
    text1 = "床前明月光，"
    result = encode_text_with_bert(text1)
    print('text1编码的形状:', result.size())
    print('text1编码:\n', result)