论文笔记(LLM distillation):Distilling Step-by-Step!
Distilling Step-by-Step! Outperforming Larger Language Modelswith Less Training Data and Smaller Model Sizes
- 0、abstract
- 1、 Introduction
- 2、 Related work
- 3、 Distilling step-by-step
- 3.1、 Extracting rationales from LLMs
- 3.2、 Training smaller models with rationales
- 4、Experiments
- 4.1、 Reducing training data
- 4.2、Reducing model size
- 4.3、 Outperforming LLMs using minimum model size and least training data
- 4.4、 Further ablation studies
- 5、 Discussion
- Limitations
0、abstract
Deploying large language models (LLMs) is challenging because they are memory inefficient and compute -intensive for practical applications.
部署大预言模型具有挑战性,因为它们在实际应用中内存效率低下且计算资源消耗巨大。
指出大语言模型部署应用存在的问题:内存效率低(目前很多大语言模型有千亿参数)、计算资源消耗巨大(推理和训练都需要进行大量的矩阵运算)。
In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels.
作为应对措施,研究人员通过以下两种方式来训练更小的、面向特定任务的模型:
- 使用人工标注的标签进行微调(fine-tuning);
- 或者使用大语言模型(LLM)生成的标签进行知识蒸馏(konwledge distillation)。
指出解决上述问题的目前的解决办法即用更小的模型代替。
However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs.
然而,微调(finetuning)和知识蒸馏(distillation) 方法需要大量训练数据,才能达到与大型语言模型(LLMs)相当的性能。
指出微调和知识蒸馏存在的问题是需要大量的训练数据。
We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation.
我们提出了一种新的训练机制——逐步蒸馏(Distilling step-by-step),它具有两个核心优势:
(a) 能够训练出性能优于LLMs的小型模型;
(b) 实现这一性能的方式是:相比传统的微调或知识蒸馏方法,所需训练数据更少。
提出本文的方法distilling step-by-step 以及本文提出方法的两个优势:a. 训练得到的小模型性能优于大模型;b. 相比于传统的微调和知识蒸馏需要i更少的训练数据。
Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework.
我们的方法在多任务框架中,将大语言模型(LLM)的推理过程提取出来,作为额外的监督信号来训练小型模型。
解释了distiling step-by-step方法具体是如何进行的。
We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes.Third, we reduce both the model size and the amount of data required to outperform LLMs;
我们在4个自然语言处理(NLP)基准任务上提出了三个发现:
- 与微调(finetuning)和知识蒸馏(distillation)相比,我们的方法在使用更少的有标签/无标签训练样本的情况下取得了更好的性能。
- 与少样本提示的大语言模型(few-shot prompted LLMs)相比,我们使用显著更小的模型尺寸也取得了更好的性能。
- 我们减少了要超越大语言模型(LLMs)所需的模型大小和数据量。
概括本文提出的方法在benchmark上的表现
our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset.
我们在某个基准测试中,使用仅80%的可用数据,对一个7.7亿参数的T5模型进行微调,其表现超过了使用少样本提示的5400亿参数PaLM模型;而标准的微调方法即使使用了全部100%的数据集,也难以达到同样的性能。
这段话主要是用实验数据证明本文提出的方法的有效性
1、 Introduction
Despite the impressive few-shot ability offered by large language models (LLMs) (Brown et al., 2020; Chowdhery et al., 2022; Thoppilan et al., 2022;Hoffmann et al., 2022; Smith et al., 2022b; Zhang et al., 2022), these models are challenging to deploy in real world applications due to their sheer size. Serving a single 175 billion LLM requires at least 350GB GPU memory using specialized in-frastructure (Zheng et al., 2022). To make matters worse, today’s state-of-the-art LLMs are composed of over 500B parameters (Chowdhery et al., 2022), requiring significantly more memory and compute. Such computational requirements are far beyond affordable for most product teams, especially for applications that require low latency performance.
尽管大语言模型(LLMs)提供了令人印象深刻的小样本学习能力(Brown et al., 2020; Chowdhery et al., 2022; Thoppilan et al., 2022; Hoffmann et al., 2022; Smith et al., 2022b; Zhang et al., 2022),但由于其庞大的模型规模,这些模型在现实世界的应用中部署起来非常困难。部署一个拥有1750亿参数的大语言模型(LLM),至少需要350GB 的 GPU 显存,并且需要使用专用的基础设施(Zheng et al., 2022)。更糟糕的是,当前最先进的大语言模型(LLMs)已经包含超过 5000亿(500B)个参数(Chowdhery 等 2022),这需要显著增加的内存和计算资源。这种计算需求远远超出了大多数产品团队的承受能力,尤其是对于那些需要低延迟性能的应用场景。
To circumvent these deployment challenges of large models, practitioners often choose to deploy smaller specialized models instead. These smaller models are trained using one of two common paradigms: finetuning or distillation.
为了绕过大型模型在部署上的挑战,从业者通常选择改用更小的、专门化的模型来替代。这些小型模型通常是通过以下两种常见方式之一进行训练的:微调(finetuning) 或 知识蒸馏(distillation)。
Finetuning updates a pretrained smaller model (e.g. BERT (Devlin et al., 2018) or T5 (Raffel et al., 2020)) using downstream human annotated data (Howard and Ruder, 2018).
微调(Finetuning) 是通过使用下游的人工标注数据,对一个已经预训练好的小型模型(例如 BERT (Devlin et al., 2018) 或 T5 (Raffel et al., 2020))进行进一步更新和优化。
Distillation trains the same smaller models with labels generated by a larger LLM (Tang et al., 2019; Wang et al., 2021;Smith et al., 2022a; Arora et al., 2022).
知识蒸馏(Distillation) 是使用一个更大的大语言模型(LLM)生成的标签,来训练相同的小型模型(Tang et al., 2019; Wang et al., 2021; Smith et al., 2022a; Arora et al., 2022)。
Unfortunately, these paradigms reduce model size at a cost: to achieve comparable performance to LLMs, fine-tuning requires expensive human labels, and distillation requires large amounts of unlabeled data which can be hard to obtain (Tang et al., 2019;Liang et al., 2020).
不幸的是,这些方法虽然减小了模型规模,但也要付出代价:为了达到与大语言模型(LLMs)相当的性能,微调需要大量昂贵的人工标注数据,而知识蒸馏则需要大量的无标签数据,这些数据可能难以获取(Tang et al., 2019;Liang et al., 2020)。
In this work, we introduce Distilling step-by-step, a new simple mechanism for training smaller models with less training data. Our mechanism reduces the amount of training data required for both finetuning and distillation of LLMs into smaller model sizes.
在本研究中,我们提出了一种新的、简单有效的机制——逐步蒸馏(Distilling step-by-step),用于使用更少的训练数据来训练小型模型。我们的方法减少了在将大语言模型(LLMs)微调和知识蒸馏为更小型模型过程中所需的训练数据量。
Core to our mechanism is changing our perspective from viewing LLMs as a source of noisy labels to viewing them as agents that can reason: LLMs can produce natural language rationales justifying their predicted labels (Wei et al.,2022; Kojima et al., 2022).
我们方法的核心在于改变了看待大语言模型(LLMs)的视角:不再将它们视为噪声标签的来源,而是将其视为具有推理能力的智能体。LLMs 可以生成自然语言形式的推理过程(rationales),来解释它们所预测的标签(Wei et al., 2022; Kojima et al., 2022)。
总结了本文所提出方法的核心思想
For example, when asked “Jesse’s room is 11 feet long and 15 feet wide. If she already has 16 square feet of carpet. How much more carpet does she need to cover the whole floor?”, an LLM can be prompted by chain-of-thought (CoT) technique (Wei et al., 2022) to provide intermediate rationales “Area = length × width. Jesse’s room has 11 × 15 square feet.” that better connects the input to the final answer “(11 × 15) − 16”. These rationales can contain relevant task knowledge, such as “Area = length × width”, that may originally require many data for small task-specific models to learn.
例如,当被问到“Jesse 的房间长 11 英尺,宽 15 英尺。如果她已经有 16 平方英尺的地毯,她还需要多少地毯才能覆盖整个地板?”时,可以通过推理链(Chain-of-Thought, CoT)技术(Wei et al., 2022)引导一个大语言模型(LLM),使其提供中间的推理过程:“面积 = 长 × 宽。Jesse 的房间面积是 11 × 15 平方英尺。”这种推理过程能更好地将输入与最终答案“(11 × 15) − 16”联系起来。这些推理中可能包含相关的任务知识(如“面积 = 长 × 宽”),而这些知识原本需要大量数据,小型任务专用模型才能学会。
We thus utilize these extracted rationales as additional, richer information to train small models through a multi-task training setup, with both label prediction and rationale prediction tasks (Raffel et al., 2020; Narang et al., 2020).
因此,我们通过一个多任务训练框架,将这些提取出的推理过程(rationales)作为额外的、更丰富的信息,用于训练小型模型。该训练框架同时包括标签预测任务和推理过程预测任务(Raffel et al., 2020;Narang et al., 2020)。
Distilling step-by-step allows us to learn task specific smaller models that outperform LLMs using over 500× less model parameters, and it does so with far fewer training examples compared to traditional finetuning or distillation (Figure 1).
“逐步蒸馏”(Distilling step-by-step) 使我们能够训练出在任务上表现优于大语言模型(LLMs)的小型专用模型,其参数量还不到 LLMs 的 500 分之一。而且与传统的微调或知识蒸馏方法相比,它所需的训练样本也少得多(见图1)。
图像的横坐标是需要的训练数据,纵坐标是在特定任务上的精度,图中圆圈大小代表模型的参数量,可以看到distilling step-by-step需要的数据量小于传统的微调和知识蒸馏,同时在任务的精度上更高,同时参数量比大语言模型更少。
Our results show three promising empirical conclusions across 4 NLP benchmarks. First, compared to both finetuning and distillation, our resulting models achieve better performance with over 50% less training examples on average across datasets (and up to over 85% reduction). Second, our models
outperform LLMs with much smaller model sizes (up to 2000× smaller), drastically reducing the computation cost required for model deployment. Third, we simultaneously reduce the model size
as well as the amount of data required to outperform LLMs. We surpass the performance of 540B
parameter LLMs using a 770M T5 model; this smaller model only uses 80% of a labeled dataset that would otherwise be required if using an existing finetuning method. When only unlabeled data is present, our small models still perform on par or better than LLMs. We outperform 540B PaLM’s performance with only a 11B T5 model. We further show that when a smaller model performs worse than an LLM, Distilling step-by-step can more efficiently leverage additional unlabeled data to match the LLM performance compared to the standard distillation approach.
我们的结果在4个自然语言处理(NLP)基准任务上展示了三个有希望的实证结论。
- 与传统的微调和知识蒸馏方法相比,我们的模型在平均数据集上的训练样本使用量减少了超过50%,却仍取得了更好的性能表现(某些情况下甚至减少了85%的样本量)。
- 我们的模型在模型大小上远小于大语言模型(LLMs),最大可达其规模的两千分之一,从而大幅降低了部署所需的计算成本。
- 我们同时减少了模型规模和所需数据量,以实现超越LLM的性能。我们仅使用一个7.7亿参数的T5模型,就超过了拥有5400亿参数的大语言模型的表现;这个小模型仅使用了原本用传统微调方法所需标注数据集的80%。即使只有无标签数据可用时,我们的小模型仍然能与LLMs持平或更优。我们在只使用一个110亿参数的T5模型的情况下,就能超越5400亿参数的PaLM模型的表现。
此外,我们进一步发现:当一个小模型的表现不如LLM时,“逐步蒸馏”方法比标准的知识蒸馏方法更能高效地利用额外的无标签数据来达到LLM的性能水平。
2、 Related work
Our work distills task-specific knowledge of LLMs into smaller specialist models by leveraging the emergent reasoning capabilities of today’s LLMs. We draw on knowledge distillation research and
methods that learn from both human-generated rationales and LLM-generated rationales.
我们的工作通过利用当今大语言模型(LLMs)所展现出的推理能力,将它们的任务特定知识蒸馏到更小的专用模型中。我们借鉴了知识蒸馏的研究方法,这些方法可以从人类生成的推理过程和大语言模型生成的推理过程中学习。
介绍本文工作的基础
Knowledge distillation from large models.
Knowledge distillation has been successfully used to transfer knowledge from larger, more competent teacher models into smaller student models affordable for practical applications (Buciluˇa et al., 2006; Hinton et al., 2015; Beyer et al., 2022; West et al., 2021; Fu et al., 2023).
知识蒸馏已被成功用于将知识从更大、更有能力的教师模型转移到更小的学生模型中,这些学生模型更适合实际应用(Buciluă et al., 2006; Hinton et al., 2015; Beyer et al., 2022; West et al., 2021; Fu et al., 2023)。
It supports learning from limited labeled data, since the larger teacher model is often used to generate a training dataset with noisy pseudo labels (Chen et al., 2020; Iliopoulos et al., 2022; Wang et al., 2021; Smith et al., 2022a; Arora et al., 2022; Agrawal et al., 2022). The one limitation that knowledge distillation often faces is its reliance on large amounts of unlabelled data required to create a useful noisy training dataset.
知识蒸馏支持在有限的标注数据条件下进行学习,因为通常会使用更大的教师模型来生成一个带有噪声伪标签(noisy pseudo labels)的训练数据集(Chen et al., 2020; Iliopoulos et al., 2022; Wang et al., 2021; Smith et al., 2022a; Arora et al., 2022; Agrawal et al., 2022)。然而,知识蒸馏经常面临的一个限制是:它依赖于大量无标签数据,以生成一个有用的、包含伪标签的训练集。
伪标签:就是使用一个已经训练好的模型或者能力较强的模型对没有进行标注的数据进行预测,并用这些数据和生成的标签去训练另一个模型。
Although prior work has explored using data augmentation techniques to reduce this hunger for data (Tang et al., 2019; Liang et al., 2020; Srinivas and Fleuret, 2018; Milli et al., 2019), we propose an alternative approach: we reduce the need for large unlabeled data by distilling not just labels but also the teacher’s rationales.
尽管已有研究探索使用数据增强技术来缓解对大量数据的依赖(Tang et al., 2019; Liang et al., 2020; Srinivas and Fleuret, 2018; Milli et al., 2019),但我们提出了一种替代方法:我们不仅蒸馏教师模型的标签,还蒸馏它的推理过程(rationales),从而减少对大规模无标签数据的需求。
Learning with human rationales.
While utilizing LLM-generated rationales is a new exciting area of investigation, using human-generated rationales has a rich history (Hase and Bansal, 2021). For instance, human rationales can be used to regularize model behavior (Ross et al., 2017); it can be used as additional inputs to guide a model’s predictions (Rajani et al., 2019); it can be used to improve overall model performance (Zaidan et al., 2007; Zhang et al., 2016; Camburu et al., 2018; Hancock et al., 2019; Pruthi et al., 2022);
尽管利用大语言模型(LLM)生成的推理过程(rationales)是一个令人兴奋的新研究方向,但使用人工生成的推理过程已有很长的研究历史(Hase 和 Bansal,2021)。例如,人工生成的推理可以用来:
- 规范模型行为(Ross et al., 2017);
- 作为额外输入来引导模型的预测(Rajani et al., 2019);
- 用于提升模型整体性能(Zaidan et al., 2007; Zhang et al., 2016; Camburu et al., 2018; Hancock et al., 2019; Pruthi et al., 2022)。
and human rationales can be used as gold standard labels to make models more interpretable by generating similar rationales (Wiegreffe et al., 2021; Narang et al., 2020; Eisenstein et al., 2022). Unfortunately, human rationales are expensive.
此外,人工生成的推理过程(rationales)可以作为黄金标准标签(gold standard labels),用于训练模型生成类似的推理过程,从而使模型更具可解释性(Wiegreffe et al., 2021; Narang et al., 2020; Eisenstein et al., 2022)。不幸的是,人工生成这些推理过程成本很高。
Learning with LLM generated rationales.
Today’s LLMs are capable of explaining their predictions by generating high-quality reasoning steps (Wei et al., 2022; Kojima et al., 2022). These reasoning steps have been used to augment input prompts to LLMs, improving their few-shot or zero-shot performance (Wei et al., 2022; Kojima et al., 2022; Wang et al., 2022b); reasoning steps have also been used as additional finetuning data “self-improve” LLMs (Zelikman et al., 2022; Huang et al., 2022). Unfortunately, regardless of how LLMs are improved, their large size limits their utility in most test-time applications.
如今的大语言模型(LLMs)能够通过生成高质量的推理步骤(reasoning steps)来解释它们的预测结果(Wei et al., 2022;Kojima et al., 2022)。这些推理步骤已被用于增强输入提示(prompt),从而提升了 LLMs 在少样本(few-shot)或零样本(zero-shot)设置下的表现(Wei et al., 2022;Kojima et al., 2022;Wang et al., 2022b);此外,这些推理步骤还被用作额外的微调数据,以“自我提升”LLMs 的性能(Zelikman et al., 2022;Huang et al., 2022)。不幸的是,无论我们如何改进 LLMs,它们庞大的参数规模仍然限制了其在大多数实际部署场景中的应用。
By contrast, we leverage generated rationales as informative supervision to train smaller task-specific models, i.e. models that can be deployed without incurring large computation or memory costs.
相比之下,我们利用生成的推理过程(rationales)作为有信息量的监督信号,来训练更小的、任务专用模型。也就是说,我们训练出的模型可以在不带来高昂计算或内存开销的情况下进行部署。
Several concurrent works have also proposed a similar idea to ours – that of using extracted rationales as supervision (Wang et al., 2022a; Ho et al., 2022; Magister et al., 2022; Li et al., 2023). Amongst them, PINTO (Wang et al., 2022a) relies on an LLM to generate rationales at test-time, and thus does not fully solve deployment challenges. Compared with Ho et al. (2022) and Magister et al. (2022), we go beyond their experiments to provide a granular study by varying training dataset size, exploring downstream model sizes, and demonstrating the effectiveness of our method on fully unlabeled datasets.
几项同期研究也提出了与我们类似的想法 —— 即利用提取出的推理过程(rationales)作为监督信号(Wang et al., 2022a;Ho et al., 2022;Magister et al., 2022;Li et al., 2023)。在这些工作中,PINTO(Wang et al., 2022a)依赖于大语言模型(LLM)在推理阶段(test-time)生成推理链,因此并未完全解决部署上的挑战。与 Ho et al. (2022) 和 Magister et al. (2022) 相比,我们在他们的实验基础上进行了更深入细致的研究,通过调整训练数据集大小、探索不同规模的下游模型,并在完全无标签的数据集上验证了我们方法的有效性。
3、 Distilling step-by-step
We propose a new paradigm, Distilling step-by-step, that leverages the ability of LLMs to reason about their predictions to train smaller models in a data-efficient way. Our overall framework is illustrated in Figure 2.
我们提出了一种新的范式 —— 逐步蒸馏(Distilling step-by-step),该方法利用大语言模型(LLMs)对其预测结果进行推理的能力,以一种数据高效的方式训练更小的模型。我们的整体框架如图2所示。
Our paradigm has two simple steps: First, given an LLM and an unlabeled dataset, we prompt the LLM to generate output labels along with rationales to justify the labels. Rationales are natural language explanations that provide support for the model’s predicted label (see Figure 2). Second, we leverage these rationales in addition to the task labels to train smaller downstream models. Intuitively, rationales provide richer, more detailed information about why an input is mapped to a specific output label, and often contain relevant task knowledge that may be hard to infer solely from the original inputs.
我们的方法包含两个简单的步骤:
-
第一步:给定一个大语言模型(LLM)和一个无标签数据集,我们通过提示(prompt)让 LLM 生成输出标签以及用于解释这些标签的推理过程。这里的“推理过程”是指自然语言形式的解释,它们为模型预测的标签提供了支持依据(见图2)。
-
第二步:我们在使用任务标签的同时,也利用这些推理过程来训练更小的下游模型。
直观上,推理过程提供了关于为什么某个输入被映射到特定输出标签的更丰富、更详细的信息,并且通常包含了一些仅从原始输入中难以推断出的相关任务知识。
两步:先用LLM对无标签数据集生成标签和推理过程,再用标签和推理过程训练更小的下游模型
3.1、 Extracting rationales from LLMs
Recent studies observe one intriguing emerging property of LLMs: their ability to generate rationales that support their predictions (Wei et al., 2022; Kojima et al., 2022). While the studies have largely focused on how to elicit such reasoning ca- pability from LLMs (Nye et al., 2021; Wei et al., 2022; Kojima et al., 2022), we use them in training smaller downstream models.
最近的研究发现了一个令人着迷的大语言模型(LLMs)的新兴特性:它们能够生成支持其预测结果的推理过程(rationales)(Wei et al., 2022;Kojima et al., 2022)。虽然这些研究主要关注如何激发大语言模型中的这种推理能力(Nye et al., 2021;Wei et al., 2022;Kojima et al., 2022),而我们则将这些推理过程用于训练更小的下游模型。
Specifically, we utilize Chain-of-Thought (CoT) prompting (Wei et al., 2022) to elicit and extract rationales from LLMs.As illustrated in Figure 3,given an unlabeled dataset xi∈Dx_i \in Dxi∈D, we first curate a prompt template ppp that articulates how the task should be solved. Each prompt is a triplet (xp,rp,yp)(x_p, r_p, y_p)(xp,rp,yp), where xpx_pxp is an example input, ypy_pyp is its corresponding label and rpr_prp is a user-provided rationale that explains why xp can be categorized as ypy_pyp. We append each input xix_ixi to p and use it as an input to prompt the LLM to generate rationales and labels for each xi∈Dx_i \in Dxi∈D. With the demonstrations seen in ppp, the LLM is able to mimic the triplet demonstration to generate the rationale ˆri and output y^i\hat{y}_iy^i for xi.x_i.xi.
具体来说,我们使用 思维链(Chain-of-Thought, CoT)提示(Wei et al., 2022)来激发并提取大语言模型(LLMs)中的推理过程。如图3所示,给定一个无标签数据集 xi∈Dx_i \in Dxi∈D,我们首先设计一个提示模板 ppp,用于清晰地说明任务应如何求解。
每个提示是一个三元组(xp,rp,yp)(x_p, r_p, y_p)(xp,rp,yp) ,其中:
- xpx_pxp是一个示例输入;
- ypy_pyp是其对应的标签;
- rpr_prp是用户提供的推理过程,解释为什么xpx_pxp可以被归类为ypy_pyp。
我们将每个输入xix_ixi添加到提示模板ppp的末尾,并将其作为输入提供给 LLM,以提示它为每个xi∈Dx_i \in Dxi∈D生成相应的推理过程和标签。通过观察提示中提供的三元组示例,LLM 能够模仿这种格式,为xix_ixi生成推理过程r^i\hat{r}_ir^i和输出标签y^i\hat{y}_iy^i。
3.2、 Training smaller models with rationales
We first describe the current framework for learning task-specific models. With this framework in place, we extend it to incorporate rationales into the training process.
我们首先介绍当前用于训练任务专用模型的框架。在此基础上,我们对其进行了扩展,将推理过程(rationales)整合进训练流程中。
Formally, we denote a dataset as D={(xi,yi)}i=1ND = \{(x_i, y_i)\}^N_{i=1}D={(xi,yi)}i=1N where each xix_ixi represents an input and yiy_iyi is the corresponding desired output label. While our framework supports inputs and outputs of any modality, our experiments limits xxx and yyy to be natural language. This text-to-text framework (Raffel et al., 2020) encompasses a variety of NLP tasks: classification, natural language inference, question answering and more.
形式上,我们将一个数据集表示为 D={(xi,yi)}i=1ND = \{(x_i, y_i)\}^N_{i=1}D={(xi,yi)}i=1N 其中每个 xix_ixi 表示一个输入,yiy_iyi 是对应的期望输出标签。虽然我们的框架支持任意模态的输入和输出,但我们的实验将 xxx 和 yyy 限定为自然语言。我们采用的是文本到文本的框架(Raffel et al., 2020),它可以涵盖多种自然语言处理(NLP)任务,包括分类、自然语言推理、问答等。
Standard finetuning and task distillation.
The most common practice to train a task-specific model is to finetune a pretrained model with supervised data (Howard and Ruder, 2018). In the absence of human-annotated labels, task-specific distillation (Hinton et al., 2015; Tang et al., 2019) uses LLM teachers to generates pseudo noisy training labels, y^i\hat{y}_iy^i inplace of yiy_iyi (Wang et al., 2021; Smith et al., 2022a; Arora et al., 2022).
训练任务专用模型的最常见做法是使用有监督数据对预训练模型进行微调(Howard and Ruder, 2018)。在缺乏人工标注标签的情况下,任务特定的知识蒸馏(task-specific distillation)(Hinton et al., 2015;Tang et al., 2019)会使用大语言模型(LLM)作为“教师”来生成伪标签 y^i\hat{y}_iy^i,以替代真实标签 yiy_iyi(Wang et al., 2021;Smith et al., 2022a;Arora et al., 2022)。
For both scenarios, the smaller model f is trained to minimize the label prediction loss:
Llabel=1N∑i=1Nl(f(xi),y^i)(1)L_{label} = \frac{1}{N}\sum^N_{i=1} l(f(x_i), \hat{y}_i)(1)Llabel=N1∑i=1Nl(f(xi),y^i)(1)
where ℓ is the cross-entropy loss between the predicted and target tokens. Note that for ease of exposition, we overload y^i\hat{y}_iy^i in Eq. 1 to be either human-annotated labels yiy_iyi for the standard finetuning case, or LLM-predicted labels y^i\hat{y}_iy^i for the model distillation case.
对于上述两种情况(有监督微调和任务特定蒸馏),我们都训练小型模型 fff,使其最小化标签预测损失:
Llabel=1N∑i=1Nl(f(xi),y^i)(1)L_{label} = \frac{1}{N}\sum^N_{i=1} l(f(x_i), \hat{y}_i)(1)Llabel=N1i=1∑Nl(f(xi),y^i)(1)
其中,ℓ 是预测输出与目标标签之间的交叉熵损失(cross-entropy loss)。为了表述方便,在公式 (1) 中我们对 y^i\hat{y}_iy^i 进行了“重载”:在标准微调情况下,它表示人工标注的标签 yiy_iyi ;而在模型蒸馏情况下,则表示由大语言模型(LLM)预测出的伪标签 y^i\hat{y}_iy^i 。
Multi-task learning with rationales.
To create a more explicit connection between xi’sx_i’sxi’s to y^i’s\hat{y}_i’sy^i’s, we use extracted rationales r^i\hat{r}_ir^i as additional supervision. There are several ways to incorporate rationales into the downstream model’s training process.
为了在输入 xix_ixi 与预测标签 y^i’s\hat{y}_i’sy^i’s 之间建立更明确的联系,我们使用提取出的推理过程 r^i\hat{r}_ir^i 作为额外监督信号。将推理链整合进下游模型训练流程的方式有多种。
One straightforward approach is feed r^i\hat{r}_ir^i as an additional input—as proposed by other concurrent research (Rajani et al., 2019; Wang et al., 2022a). In other words, the f(xi,r^ix_i,\hat{r}_ixi,r^i) →y^i\hat{y}_iy^i is trained with both text and rationale [x, r] as inputs:
一种直接的方法是将生成的推理链 r^i\hat{r}_ir^i作为额外输入加入模型训练中 —— 这一做法也被其他同期研究提出(Rajani et al., 2019;Wang et al., 2022a)。换句话说,我们训练模型 f(xi,r^i)→y^if(x_i,\hat{r}_i) \rightarrow \hat{y}_if(xi,r^i)→y^i,使其以文本和推理链[xi,ri][x_i, r_i][xi,ri]作为联合输入来预测输出标签:
Llabel=1N∑i=1Nl(f(xi,r^i),y^i)(1)L_{label} = \frac{1}{N}\sum^N_{i=1} l(f(x_i, \hat{r}_i), \hat{y}_i)(1)Llabel=N1i=1∑Nl(f(xi,r^i),y^i)(1)
Unfortunately, this design requires an LLM to first generate a rationale before the f can make a prediction. The LLM is still necessary during deployment, limited its deployability.
不幸的是,这种设计要求在模型 f做出预测之前,首先要由大语言模型(LLM)生成推理链。这意味着在部署阶段仍然需要调用 LLM,从而限制了该方法的可部署性。
In this work, instead of using rationales as additional model inputs, we frame learning with rationales as a multi-task problem. Specifically, we train the model f(xi)→(y^i,r^i)f(x_i) → (\hat{y}_i, \hat{r}_i)f(xi)→(y^i,r^i) to not only predict the task labels but also generate the corresponding rationales given the text inputs:
在本研究中,我们不再将推理链(rationales)作为模型的额外输入,而是将其视为一个多任务学习问题。具体来说,我们训练模型 f(xi)→(y^i,r^i)f(x_i) → (\hat{y}_i, \hat{r}_i)f(xi)→(y^i,r^i) ,使其在接收文本输入的同时,既能预测任务标签 y^i\hat{y}_iy^i ,也能生成相应的推理链 r^i\hat{r}_ir^i:
L=Llabel+λLrationale,(3)L = L_{label} + λL_{rationale},(3)L=Llabel+λLrationale,(3)
其中LlabelL_{label}Llabel是标签预测的损失,LrationaleL_{rationale}Lrationale是推理产生的损失。
Llabel=1N∑i=1Nl(f(xi),r^i)(4)L_{label} = \frac{1}{N}\sum^N_{i=1} l(f(x_i), \hat{r}_i)(4)Llabel=N1i=1∑Nl(f(xi),r^i)(4)
The rationale generation loss enables the model to learn to generate the intermediate reasoning steps
for the prediction, and could therefore guide the model in better predicting the resultant label. This is our proposed Distilling step-by-step. Compared with Eq. 2, the rationale ˆri is not required in the test time, which removes the need for an LLM at test-time.
推理链生成损失(rationale generation loss)使模型能够学习生成用于预测的中间推理步骤,从而有助于模型更准确地预测最终标签。这就是我们提出的 Distilling step-by-step 方法。与公式 (2) 中的方法相比,在测试阶段不再需要推理链r^i\hat{r}_ir^i,因此也无需在推理时调用大语言模型(LLM)。
We prepend “task prefixes” ([label],[rationale]) to the input examples and train the smaller model to output y^i\hat{y}_iy^i when [label] is provided and to produce r^i\hat{r}_ir^i with [rationale] (Raffel et al., 2020).
我们在输入样本前添加“任务前缀”([label]、[rationale]),并训练小模型在提供 [label] 时输出预测标签
yi^\hat{y_i}yi^ ,在提供 [rationale] 时生成对应的推理链 r^i\hat{r}_ir^i(Raffel et al., 2020)。
两种方法对比:
方法一:Rationale-as-Input:将LLM生成的推理链 r^i\hat{r}_ir^i 拼接到原始输入 xix_ixi 中作为模型的新输入
方法二:Distilling step-by-step(多任务学习):训练模型在没有推理链输入的情况下,自行生成推理链并预测标签。
两种方法的主要区别在于:方法一推理依赖于LLM生成的推理链, 也就是说方法一在推理时也需要部署LLM,这样蒸馏就失去了原本的意义,目的是用小模型代替大模型,但现在仍需要部署大模型。
4、Experiments
We empirically validate the effectiveness of Distilling step-by-step. First, we show that when compared to standard finetuning and task distillation approaches, Distilling step-by-step achieves better performance with much fewer number of training examples, substantially improving the data efficiency to learn small task-specific models (Sec. 4.1). Second, we show that Distilling step-by-step surpasses the performance of LLMs with much smaller model size, drastically lowering the deployment cost compared to LLMs (Sec. 4.2). Third, we investigate the minimum resources required, w.r.t. both number of training examples and
model size, for Distilling step-by-step to outper- form LLMs. We show that Distilling step-by-step outperforms LLMs by using less data and smaller model, simultaneously improving both data- and deployability-efficiency (Sec. 4.3). Finally, we perform ablation studies to understand the influence
of different components and design choices in the Distilling step-by-step framework (Sec. 4.4).
我们通过实验验证了 Distilling step-by-step 方法的有效性。首先,我们表明:与标准微调和任务蒸馏方法相比,Distilling step-by-step 在使用更少训练样本的情况下仍能取得更好的性能,从而显著提升了小规模任务专用模型的数据效率(见第 4.1 节)。其次,我们展示了 Distilling step-by-step 的性能优于大语言模型(LLMs),但使用的是体积小得多的模型,这大幅降低了部署成本(见第 4.2 节)。第三,我们研究了 Distilling step-by-step 超过 LLM 性能所需的最低资源要求,包括训练样本数量和模型大小。我们证明:Distilling step-by-step 可以在使用更少数据和更小模型的前提下超越 LLM,同时提升数据效率和可部署性(见第 4.3 节)。最后,我们进行了消融实验,以理解 Distilling step-by-step 框架中不同组件和设计选择的影响(见第 4.4 节)。
Setup. In the experiments, we consider the 540B PaLM model (Chowdhery et al., 2022) as the LLM. For task-specific downstream models, we use T5 models (Raffel et al., 2020) where we initialize the models with pretrained weights obtained from publicly available sources2. For CoT prompting, we follow Wei et al. (2022) when available, and curate our own examples for new datasets. We include more implementation details in Appendix A.1.
设置。 在实验中,我们将 540B 参数的 PaLM 模型(Chowdhery et al., 2022)作为大语言模型(LLM)。对于任务专用的下游模型,我们使用 T5 模型(Raffel et al., 2020),并使用从公开来源获取的预训练权重来初始化这些模型。在需要思维链提示(CoT prompting)时,我们遵循 Wei et al. (2022) 的方法(如适用),并对新数据集自行构建示例。更多实现细节请见附录 A.1。
Datasets. We conduct the experiments on 4 popular benchmark datasets across 3 different NLP tasks: e-SNLI (Camburu et al., 2018) and ANLI (Nie et al., 2020) for natural language inference; CQA (Talmor et al., 2019; Rajani et al., 2019) for commonsense question answering; SVAMP (Patel et al., 2021) for arithmetic math word problems. We include more dataset details in Appendix A.2.
数据集。 我们在 3 种不同的自然语言处理(NLP)任务 中,共使用了 4 个流行的基准数据集 进行实验:
- 自然语言推理(Natural Language Inference, NLI):e-SNLI(Camburu et al., 2018)和 ANLI(Nie et al., 2020);
- 常识问答(Commonsense Question Answering):CQA(Talmor et al., 2019;Rajani et al., 2019);
- 算术应用题(Arithmetic Math Word Problems):SVAMP(Patel et al., 2021)。
更多关于数据集的细节请参见附录 A.2。
4.1、 Reducing training data
实验证明distilling step-by-step方法能够在达到和传统方法相同效果的情况下,有效减少所使用的训练数据,需要和传统监督微调和传统知识蒸馏进行对比实验。
We compare Distilling step-by-step to two most common methods in learning task-specific models:(1) STANDARD FINETUNING when human-labeled examples are available, and (2) STANDARD TASK
DISTILLATION when only unlabeled examples are available. Specifically, standard finetuning refers to
the prevailing pretrain-then-finetune paradigm that finetunes a model with ground-truth labels via standard label supervision (Howard and Ruder, 2018).On the other hand, when only unlabeled examples are available, standard task distillation learns the task-specific model by treating a teacher LLM’s predicted labels as ground-truths (Hinton et al.,2015; Chen et al., 2020; Wang et al., 2021; Smith et al., 2022a; Arora et al., 2022).
我们将 Distilling step-by-step 方法与两种最常用的任务专用模型训练方法进行比较:
- 标准微调(Standard Finetuning):当有人工标注的样本时使用;
- 标准任务蒸馏(Standard Task Distillation):当只有未标注样本可用时使用。
具体来说,标准微调 指的是当前主流的“预训练 + 微调”范式,即通过标准的标签监督方式,使用真实标签对模型进行微调(Howard and Ruder, 2018)。而当只有未标注样本可用时,标准任务蒸馏 则是将教师大语言模型(LLM)预测出的标签当作“真实标签”,来训练任务专用的小模型(Hinton et al., 2015;Chen et al., 2020;Wang et al., 2021;Smith et al., 2022a;Arora et al., 2022)。
In the following set of experiments, we fix the task-specific models to be 220M T5-Base models, and compare the task performances achieved by different methods under varying number of available training examples.
在接下来的一系列实验中,我们将任务专用模型固定为 220M 参数的 T5-Base 模型,并在不同数量的可用训练样本条件下,比较不同方法所达到的任务性能。
Distilling step-by-step outperforms standard finetuning with much less labeled examples. When finetuned with human-labeled examples, Figure 4 shows that Distilling step-by-step consistently achieves better performance than standard finetuning across varying numbers of labeled examples used. Furthermore, we see that Distilling step-by-step can achieve the same performance as standard finetuning with much less labeled examples. In particular, by using only 12.5% of the full eSNLI dataset, Distilling step-by-step can outperform standard finetuning trained with 100% of the full dataset. Similarly, we achieve 75%, 25%, and 20% reduction in training examples required to outperform standard finetuning on ANLI, CQA, and SVAMP respectively.
Distilling step-by-step 在使用更少人工标注样本的情况下,表现优于标准微调方法。
当使用人工标注样本进行微调时,图 4 显示,在不同数量的标注样本下,Distilling step-by-step 始终优于标准微调方法。此外,我们发现 Distilling step-by-step 可以在使用更少标注数据的前提下达到与标准微调相当甚至更好的性能。
具体来说,仅使用完整 e-SNLI 数据集的 12.5%,Distilling step-by-step 就能超越使用全部数据训练的标准微调模型。类似地,在 ANLI、CQA 和 SVAMP 数据集上,我们分别实现了 75%、25% 和 20% 的训练样本减少量,即使用更少的数据就能超越标准微调的效果。
Distilling step-by-step outperforms standard distillation with much less unlabeled examples. When only unlabeled data is available, we compare Distilling step-by-step to standard task distillation. In Figure 5, we observe an overall similar trend to the finetuning setup. Specifically, we see that Distilling step-by-step outperforms standard task distillation on all 4 datasets under different numbers of unlabeled data used. We as well see that Distilling step-by-step requires much less unlabeled data to outperform standard task distillation. For instance, we need only 12.5% of the full unlabeled dataset to outperform the performance achieved by standard task distillation using 100% of the training examples on e-SNLI dataset.
Distilling step-by-step 在使用更少未标注样本的情况下,表现优于标准任务蒸馏方法。
当只有未标注数据可用时,我们将 Distilling step-by-step 与标准任务蒸馏(standard task distillation) 进行比较。从图5中可以看出,其趋势与微调设置下的结果总体一致。具体来说,我们发现 在所有4个数据集上,Distilling step-by-step 在不同数量的未标注数据下均优于标准任务蒸馏方法。
此外,我们还发现 Distilling step-by-step 只需使用更少的未标注数据就能超越标准任务蒸馏的效果。例如,在 e-SNLI 数据集上,仅使用完整未标注数据集的 12.5%,Distilling step-by-step 就能超过标准任务蒸馏使用全部训练数据所达到的性能。
4.2、Reducing model size
In the following set of experiments, we hold the training set size fixed (using 100% of the datasets), and compare varying sizes of small T5 models trained with Distilling step-by-step and standard approaches to LLMs. Specifically, we consider 3 different sizes of T5 models, i.e., 220M T5-Base, 770M T5-Large, and 11B T5-XXL. For LLMs, we include two baseline methods: (1) FEW-SHOT COT (Wei et al., 2022), and (2) PINTO TUNING (Wang et al., 2022a). Few-shot CoT directly utilizes CoT demonstrations to prompt the 540B PaLM to generate intermediate steps before predicting the final labels without any further finetuning of the LLM. PINTO tuning refers to our extension of Wang et al. (2022a) to handle tasks beyond question-answering, which are not studied by Wang et al. (2022a). Here, we finetune a 220M T5-Base model on top of the outputs generated from the PaLM model, which can be viewed as a finetuning method for LLMs with additional parameters (Zhang et al., 2020; Lester et al., 2021).
在接下来的一组实验中,我们保持训练集大小固定(即使用完整的数据集),并比较了使用 Distilling step-by-step 方法和标准方法训练的不同规模的小型 T5 模型 与 LLM(大型语言模型) 的性能。
具体来说,我们考虑了三种不同规模的 T5 模型:
- 220M 参数的 T5-Base
- 770M 参数的 T5-Large
- 11B(110亿)参数的 T5-XXL
对于 LLM,我们采用了两种基线方法: - Few-shot CoT(Wei et al., 2022):
- 直接利用思维链(Chain-of-Thought, CoT)示例来提示拥有 5400亿参数的 PaLM 模型,让它在预测最终标签前先生成中间推理步骤;
- 在此过程中,LLM 不会被进一步微调。
- PINTO Tuning(Wang et al., 2022a):
- 这是我们对 Wang 等人提出的方法的扩展,使其能处理除了问答任务以外的其他类型任务(这些任务并未被 Wang et al., 2022a 所研究);
- 在这里,我们在 PaLM 模型生成的输出之上,微调一个 220M 参数的 T5-Base 模型;
- 这种方式可以被视为一种为 LLM 添加额外参数的微调方法(Zhang et al., 2020;Lester et al., 2021)。
We present the experimental results under the two broad scenarios of having access to labeled datasets or unlabeled datasets in Figure 6 and Figure 7, respectively. We plot each method by their deployed model sizes for prediction (x-axis), and their corresponding task performances (y-axis).
我们在图6和图7中分别展示了两种广泛场景下的实验结果:当有标注数据可用时(图6)和当只有未标注数据可用时(图7)。我们以每种方法用于预测的模型大小作为横轴(x-axis),其对应的任务性能作为纵轴(y-axis)进行绘图。
Distilling step-by-step improves over standard baselines across varying model sizes used. In Figure 6 and Figure 7 respectively, we see that Distilling step-by-step consistently improves over standard finetuning and standard distillation across all sizes of T5 models. The improvements are most pronounced on ANLI, where Distilling step-by-step outperforms standard finetuning and distillation by an average of 8% and 13% on task accuracy respectively.
Distilling step-by-step 在使用不同大小的模型时,都优于标准基线方法。
分别在图6和图7中可以看到,无论使用哪种尺寸的 T5 模型(如 T5-Base、T5-Large 或 T5-XXL),Distilling step-by-step 方法始终优于标准微调(standard finetuning)和标准蒸馏(standard distillation)。
这种性能优势在 ANLI 任务上最为明显:
- Distilling step-by-step 相比标准微调,任务准确率平均提升了 8%;
- 相比标准蒸馏,准确率平均提升了 13%。
Distilling step-by-step outperforms LLMs by using much smaller task-specific models. In Figure 6 when human-labeled datasets are available, Distilling step-by-step can always outperform Few-shot CoT and PINTO tuning on all 4 datasets considered, by using much smaller T5 models. For instance, we can achieve better performances than 540B PaLM model’s Few-shot CoTwith 220M (over 2000× smaller) T5 model on eSNLI, 770M (over 700× smaller) T5 models on ANLI and SVAMP, and 11B (over 45× smaller)
T5 model on CQA. These results hold true even by further finetuning the 540B PaLM model on available labeled data with PINTO tuning3. In Figure 7, by only utilizing unlabeled examples, Distilling step-by-step also outperforms the teacher LLM on 3 out of 4 datasets. Specifically, Distilling step-by-step surpasses the 540B PaLM model’s Few-shot CoT performance by using 11B T5 with less than 3% of PaLM’s size. On SVAMP where the distilled model underperforms, we hypothesize that the performance gap is due to the
relatively small number of data points in the dataset(i.e., 800). In reaction, we propose to augment the
dataset with additional unlabeled examples to close the performance gap as shown in next.
在图6中,当有人工标注的数据可用时,Distilling step-by-step 始终能超越 Few-shot CoT 和 PINTO Tuning 这两种基于 LLM 的方法,在所考虑的 4 个数据集上均表现优异。例如:
- 在 e-SNLI 数据集上,仅使用 220M 参数的 T5 模型(比 540B PaLM 小超过 2000 倍),就能优于 PaLM 的 Few-shot CoT 表现;
- 在 ANLI 和 SVAMP 上,使用 770M 参数的 T5 模型(比 PaLM 小 700 多倍);
- 在 CQA 上,使用 11B 参数的 T5 模型(比 PaLM 小 45 倍)。
即使我们进一步使用 PINTO Tuning 对 540B PaLM 进行微调(如脚注3所述),这些结果依然成立。
在图7中,仅利用未标注样本的情况下,Distilling step-by-step 也在 4 个数据集中的 3 个上超越了教师 LLM(即 PaLM)。具体来说:
使用参数量不到 PaLM 3% 的 11B T5 模型,就能超过 PaLM 的 Few-shot CoT 表现。
只有在 SVAMP 数据集上,蒸馏出的小模型略逊于 LLM。我们认为这是由于该数据集的数据点相对较少(仅有约 800 条)。为此,我们提出可以通过增加未标注样本来扩充数据集,以缩小性能差距,这将在下一节展示。
Unlabeled data augmentation further improves Distilling step-by-step. We augment the SVAMP
training set with unlabeled examples from the ASDiv dataset (Miao et al., 2020). ASDiv dataset contains a total of 2, 305 examples, where each example is a math word problem similar to the ones in SVAMP. In Figure 7 on SVAMP, we show the performances of Distilling step-by-step and standard task distillation using 11B T5 model after augmenting the training set with ASDiv. We see the data augmentation much improves the performance for both Distilling step-by-step and standard task distillation. However, even with the added unlabeled examples, standard task distillation still underperforms Few-shot CoT. On the other hand, Distilling step-by-step is able to much more efficiently exploit the value of the added examples to achieve the same performance level of Few-shot CoT, again, using a T5 model of size less than 3% of the 540B PaLM.
未标注数据增强进一步提升了 Distilling step-by-step 的性能。
我们使用来自 ASDiv 数据集(Miao et al., 2020) 的未标注样本来扩充 SVAMP 的训练集。
ASDiv 数据集总共包含 2,305 条样本,每条样本都是一道与 SVAMP 类似的数学应用题。
在图7中关于 SVAMP 的实验部分,我们展示了在用 ASDiv 扩充训练集后,使用 11B 参数的 T5 模型 进行 Distilling step-by-step 和 标准任务蒸馏(standard task distillation) 的表现。
我们发现:
- 数据增强显著提升了两种方法的性能;
- 然而,即使加入了这些未标注样本,标准任务蒸馏的表现仍低于 Few-shot CoT;
- 相比之下,Distilling step-by-step 更高效地利用了新增样本,达到了与 Few-shot CoT 相当的性能水平 —— 而它使用的模型参数量还不到 5400亿参数 PaLM 模型的 3%。
4.3、 Outperforming LLMs using minimum model size and least training data
主要证明最小的训练数据和最小的模型大小是多少时,distilling step-by-setp得到的小模型能超过LLMs
4.4、 Further ablation studies
Specifically, we study (1) how different LLMs, from which the rationales are extracted, affect the effectiveness of Distilling step-by-step, and (2) how the multi-task training approach compares to other potential design choices in training small task-specific models with LLM rationales. Here, we fix the small task-specific models to be 220M T5 models, and utilize 100% of the data on all datasets.
具体来说,我们研究以下两个问题:
- 不同 LLM(从其生成推理链)如何影响 Distilling step-by-step 的效果;
- 在使用 LLM 生成的推理链训练小型任务专用模型时,多任务训练方法与其他可能的设计选择相比表现如何。
在此实验中,我们将小型任务专用模型固定为 220M 参数的 T5 模型,并在所有数据集上使用 100% 的完整数据 进行训练和评估。
Distilling step-by-step works with different sizes of decently trained LLMs. In addition to using 540B PaLM as the LLM, here we consider a relatively smaller LLM, 20B GPT-NeoX model (Black et al., 2022), from which we extract rationales for Distilling step-by-step. In Table 1, we see that when coupled with LLMs of different sizes, Distilling step-by-step can still provide performance improvements compared to standard finetuning. However, the performance lift is smaller when rationales are extracted from the 20B GPT-NeoX model instead of from the 540B PaLM. This can be due to the fact that the larger PaLM model provides higher-quality rationales that are more beneficial for learning the task.
Distilling step-by-step 方法在使用不同规模的、训练良好的 LLM 时依然有效。
除了使用 5400亿参数的 PaLM 模型,我们还考虑了一个相对较小的 LLM —— 200亿参数的 GPT-NeoX 模型(Black et al., 2022),并从中提取推理链用于 Distilling step-by-step。
从表1中可以看出,当与不同规模的 LLM 结合使用时,Distilling step-by-step 相比标准微调方法仍然能够带来性能提升。然而,当推理链来自于 20B 的 GPT-NeoX 模型 而不是 540B 的 PaLM 模型 时,性能提升幅度更小。这可能是因为更大的 PaLM 模型提供了质量更高的推理链,从而对任务学习更有帮助。
Multi-task training is much more effective than single-task rationale and label joint prediction. There are different possible ways to train task-specific models with LLM-rationales as output supervisions. One straightforward approach is to concatenate the rationale r^i\hat{r}_ir^i and label y^i\hat{y}_iy^i into a single sequence [r^i\hat{r}_ir^i, y^i\hat{y}_iy^i] and treat the entire sequence as the target output in training small models, as considered in (Magister et al., 2022; Ho et al., 2022):
Llabel=1N∑i=1Nl(f(xi),[r^i,y^i])(5)L_{label} = \frac{1}{N}\sum^N_{i=1} l(f(x_i),[\hat{r}_i,\hat{y}_i])(5)Llabel=N1i=1∑Nl(f(xi),[r^i,y^i])(5)
In Table 2, we compare this single-task training approach to our proposed multi-task training approach for utilizing LLM-rationales. We see that not only multi-task training consistently leads to better performance, single-task training with LLM-rationales can at times leads to worse performance than standard finetuning, e.g., on ANLI and CQA. In fact, similar results have also been observed in (Wiegreffe et al., 2021; Magister et al., 2022; Ho et al., 2022) that simply treating rationale and label predictions as a single joint task may harm the model’s performance on label prediction. This validates our use of the multi-task training approach, and highlights the need to treat the rationales carefully so as to unleash their actual benefits.
在表2中,我们将这种单任务训练方法与我们提出的多任务训练方法进行了对比,以评估它们在利用 LLM 生成的推理链(rationales)方面的效果。
我们发现:
多任务训练不仅始终表现更好,
而且在某些情况下,使用 LLM 推理链的单任务训练方法的表现甚至比标准微调还要差,例如在 ANLI 和 CQA 数据集上。
事实上,类似的结果也在 Wiegreffe et al., 2021、Magister et al., 2022 和 Ho et al., 2022 的研究中被观察到:将推理链和标签预测简单地合并为一个联合任务,可能会损害模型在标签预测上的性能。
这验证了我们采用多任务训练方法的合理性,并强调了在使用推理链时需要谨慎处理其引入方式,以真正释放其潜在优势。
5、 Discussion
We propose Distilling step-by-step to extract rationales from LLMs as informative supervision in training small task-specific models. We show that Distilling step-by-step reduces the training dataset required to curate task-specific smaller models; it also reduces the model size required to achieve, and even surpass, the original LLM’s performance. Distilling step-by-step proposes a resource-efficient training-to-deployment paradigm compared to existing methods. Further studies demonstrate the generalizability and the design choices made in Distilling step-by-step. Finally, we discuss the limitations, future directions and ethics statement of our work below.
我们提出了 Distilling step-by-step 方法,从大型语言模型(LLMs)中提取推理链(rationales),作为训练小型任务专用模型的有信息量的监督信号。
我们表明:
- Distilling step-by-step 能够减少构建任务专用小模型所需的训练数据量;
- 同时也能减小实现甚至超越原始 LLM 性能所需的模型规模;
- 与现有方法相比,Distilling step-by-step 提供了一种资源高效、从训练到部署更优的范式。
进一步的研究展示了该方法在不同任务上的泛化能力以及我们在设计中做出的关键选择。最后,我们在下文中讨论本工作的局限性、未来研究方向以及伦理声明。
Limitations
There are a number of limitations with our approach. First, we require users to produce a few example demonstrations (∼ 10-shot for all tasks) in order to use the few-shot CoT (Wei et al., 2022) prompting mechanism. This limitation can be improved by using recent advances that suggest that rationales can be elicited without any user-annotated demonstrations (Kojima et al., 2022). Second, training task-specific models with rationales incur slight training-time computation overhead. However, at test time, our multi-task design naturally avoids the computation overhead since it allows one to only predict labels without generating the rationales. Finally, while we observe success using LLM rationales, there is evidence that LLMs exhibit limited reasoning capability on more complex reasoning and planning tasks (Valmeekam et al., 2022). Future work should characterize how rationale quality affects Distilling step-by-step.
我们的方法存在一些局限性。
首先,我们需要用户提供少量示例(∼10-shot) 来构建 Few-shot CoT(思维链)提示机制(Wei et al., 2022)。这一限制可以通过最近提出的一些方法来缓解,例如 Kojima et al. (2022) 提出的方法,它可以在无需人工标注示例的情况下自动激发 LLM 生成推理链。
其次,在使用推理链训练任务专用模型时,会带来轻微的训练阶段计算开销。不过在测试阶段,由于我们采用的是多任务设计,可以只预测标签而不需要生成推理链,从而自然地避免了这部分开销。
最后,虽然我们观察到使用 LLM 推理链是有效的,但已有研究表明,LLM 在更复杂的推理与规划任务上表现有限(Valmeekam et al., 2022)。未来的研究应进一步探讨推理链质量对 Distilling step-by-step 方法的影响。