7.4-Creating data loaders for an instruction dataset
Chapter 7-Fine-tuning to follow instructions
7.4-Creating data loaders for an instruction dataset
-
我们只需将
InstructionDataset
对象和custom_collate_fn
函数接入 PyTorch 数据加载器
-
使用以下代码来初始化设备信息
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")# Note: # Uncommenting the following lines will allow the code to run on Apple Silicon chips, if applicable, # which is much faster than on an Apple CPU (as measured on an M3 MacBook Air). # However, the resulting loss values may be slightly different.#if torch.cuda.is_available(): # device = torch.device("cuda") #elif torch.backends.mps.is_available(): # device = torch.device("mps") #else: # device = torch.device("cpu")print("Device:", device)"""输出""" Device: cuda
将
custom_collate_fn
函数中的device
参数和allowed_max_length
预先设定为变量device
和1024
。这样在后续调用customized_collate_fn
时,就不需要再手动传入这两个参数的值了。from functools import partialcustomized_collate_fn = partial(custom_collate_fn,device=device,allowed_max_length=1024 )
接下来,我们设置数据加载器,但是这次,我们将使用我们的自定义排序函数进行批处理过程。
from torch.utils.data import DataLoadernum_workers = 0 batch_size = 8torch.manual_seed(123)train_dataset = InstructionDataset(train_data, tokenizer) train_loader = DataLoader(train_dataset,batch_size=batch_size,collate_fn=customized_collate_fn,shuffle=True,drop_last=True,num_workers=num_workers )val_dataset = InstructionDataset(val_data, tokenizer) val_loader = DataLoader(val_dataset,batch_size=batch_size,collate_fn=customized_collate_fn,shuffle=False,drop_last=False,num_workers=num_workers )test_dataset = InstructionDataset(test_data, tokenizer) test_loader = DataLoader(test_dataset,batch_size=batch_size,collate_fn=customized_collate_fn,shuffle=False,drop_last=False,num_workers=num_workers )
让我们看看input 和target批次的维度是什么样的
print("Train loader:") for inputs, targets in train_loader:print(inputs.shape, targets.shape)"""输出""" Train loader: torch.Size([8, 61]) torch.Size([8, 61]) torch.Size([8, 76]) torch.Size([8, 76]) ...... torch.Size([8, 69]) torch.Size([8, 69])
根据上面的输出,我们可以看到,所有批次的批次大小为8,但长度不同,第一个[8,61]表示,batchsize为8,在当前批次中,每个训练示例中的
token
数量为61。让我们通过打印“input”批处理中第一个训练示例的内容来仔细检查输入是否包含与tokenID 50256对应的“<|endoftext|>”填充tokenprint(inputs[0])"""输出""" tensor([21106, 318, 281, 12064, 326, 8477, 257, 4876, 13, 19430,257, 2882, 326, 20431, 32543, 262, 2581, 13, 198, 198,21017, 46486, 25, 198, 30003, 6525, 262, 6827, 1262, 257,985, 576, 13, 198, 198, 21017, 23412, 25, 198, 464,5156, 318, 845, 13779, 13, 198, 198, 21017, 18261, 25,198, 464, 5156, 318, 355, 13779, 355, 257, 4936, 13,50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256],device='cuda:0')
同样,我们仔细检查target是否包含-100占位符标记
print(target[0])"""输出""" tensor([ 318, 281, 12064, 326, 8477, 257, 4876, 13, 19430, 257,2882, 326, 20431, 32543, 262, 2581, 13, 198, 198, 21017,46486, 25, 198, 30003, 6525, 262, 6827, 1262, 257, 985,576, 13, 198, 198, 21017, 23412, 25, 198, 464, 5156,318, 845, 13779, 13, 198, 198, 21017, 18261, 25, 198,464, 5156, 318, 355, 13779, 355, 257, 4936, 13, 50256,-100, -100, -100, -100, -100, -100, -100, -100, -100],device='cuda:0')