Self-Instruct - Aligning Language Model with Self Generated Instructions

本文為 “Self-Instruct: Aligning Language Model with Self Generated Instructions” (2022.12) 的論文重點摘要

論文全文參考

Self-Instruct: Aligning Language Model with Self Generated Instructionshttps://arxiv.org/abs/2212.10560

Open Source

Self-Instruct Githubhttps://github.com/yizhongw/self-instruct

Description

Goal

解決 LLM 過度依賴人類標註數據的問題 → 用 LLM 來"半自動"取代, 一種類似 teacher-student 思維

Contributions

Stanford Alpaca 是由 LLaMA 7B fine tuning 而來的, 在資料集使用了 OpenAI 的 text-davinci-003 模型基於 self-instruct 產生的
在初始的 GPT3 模型透過 self-instruct 自動化產生資料並訓練後, 與原本的 GPT3 比較性能提升了 33%, 與經由大量標註資料訓練的 InstructGPT_001 有類似相同的表現 (僅5%的性能落差)

Methodology

Self-instruct 是一種半自動的過程, 用 LLM 得到的 instruct signal 對 pretrained LM 進行 instruct tuning

Step 1: 通常是用人類標註的任務列表當 initial, 利用強 LLM 給定關於產生新任務的 prompt 來產出更多的任務列表

用 LLM 產生新任務的 prompt 範例

Come up with a series of tasks:
Task 1: Given my personality and the job, tell me if I would be suitable.
Task 2: Replace the placeholders in the given text with appropriate named entities.
Task 3: Which exercises are best for reducing belly fat at home?
Task 4:

Step 2: 基於產出的任務列表, 先做二元分類判斷要用哪一種 prompt 後, 再用度強 LLM 來產生 input-output pair

用 LLM 產生二元分類判斷的 prompt 範例

Can the following task be regarded as a classification task with finite output labels?
  
Task: Given my personality and the job, tell me if I would be suitable.
Is it classification?

[Case 1] 用強 LLM 產生分類問題的 input-output pair (output 優先, 避免結果歪掉)

Given the classification task definition and the class labels, generate an input that
corresponds to each of the class labels. If the task doesn’t require input, just generate the
correct class label.
  
Task: Classify the sentiment of the sentence into positive, negative, or mixed.
Class label: mixed
Sentence:

[Case 2] 用強 LLM 產生非分類問題的 input-output pair (input 優先)

Come up with examples for the following tasks. Try to generate multiple examples when possible.
If the task doesn’t require additional input, you can generate the output directly.
  
Task: Which exercises are best for reducing belly fat at home?
Output:

Step 3: 建立一個 Task Pool 作為 instruct tuning 的訓練來源, 並透過各種方法來過濾低質量或重複的 input-output pair
- 為了保證資料的多樣性, LLM 生成的 input-output pair 會比對 Task Pool 的資料, 若 ROUGE-L(原始資料,新資料) < 0.7 才會被加進 Task Pool 中
- 為了保證資料的多樣性, 濾除相同 input 但不同 output 的資料
- 為了保證資料的有用性 (可被 LLM 處理), 濾除包含特定關鍵字 (e.g., images, pictures, graphs) 的 instruct
- ROUGE-L 指標
  - Rouge-L 是用來評估自然語言處理系統產生的文本摘要品質的一種評估指標。
  - Rouge-L 的計算方式是基於最長公共子序列（Longest Common Subsequence, LCS）。
  - 在進行計算時，Rouge-L 將自動將摘要中的單詞轉換為字符，然後使用 LCS 算法計算摘要中的字符與原文中字符的匹配程度，從而得出 Rouge-L 得分。
Step 4: 組裝 input-output pair 成一個 prompt 餵入模型進行 fine tuning, 並給予 prefix 一定的隨機性

A high-level overview of SELF-INSTRUCT

Conclusion

此篇論文提出一種"半自動"產生 instruction 的方式, 並透過 LLM 產生訓練資料
也許當時使用的 LLM (text-davinci-001) 不夠強, 使得抽樣 self-instruct 所產出的數據經人工檢查後有高達 46% 的錯誤率
WizardLM 可能強烈受到此篇論文的啟發, 優化了處理資料的方式

	GPT (self-instruct)	WizardLM
instruction 產生方式	text-davinci-001	text-davinci-003
資料過濾方式	ROUGE-L + 滿滿的規則	LLM + 滿滿的規則