2023 LLM 產學技術交流會
本文為 2023 LLM 產學技術交流會
的演講筆記
Keynote (中研院院士 孔祥重 & 聯發科副總 林宗瑤)
Embedding vectors
- Large Language Models: text → Embedding
- Multimodal Models: (text, image) → text and image embeddings
Tackled by old master
- Forward Knowledge: Causes → Observed Effects
- 藉由觀察而有大量資料
- Inverse problems: Observed Effects (e.g.,Defect Image) → Causes
- 困難的任務 & 稀疏的資料
- With Forward Knowledge, 可以分為三種 Task
- Input Known or novel?
- Cluster seen images into embedding space
- Form mediod-text-prompt defined classes
- Classify the input or create a new class
- Incorporate new classes into knowledge base
- fine-tune model (e.g.,CLIP)
- Input Known or novel?
- 方法
- Static classifier (before 2021)
- Costly training & predict only seen classes
- Dynamic classifier (Recent)
- pre-trained Visual-language model (e.g.,CLIP) & text prompt defined downstream classifier for any target classes
- Static classifier (before 2021)
Sensitive to prompt details
- chosen prompts maybe misaligned to image class distributions
- Text Prompt Example
- “Orange cat wearing bowtie”: tie (20%), cat & tie (80%)
- “Orange cat wearing a bowtie”: cat & tie (100%)
- Text Prompt Example
Challenge to move LLM to edge
- Analytic AI vs. Generative AI
- 參數量: <10M vs. >1000M
- 推論算力: 1-10’s TOPS vs. 100’s-1000’s TOPS
- 以 LLaMA-7B 為例需要 40 TOPS for 512 words/1sec
- 頻寬: 70G/sec for 10 words/sec
- SW/HW co-optimization
- Method
- Pruning: take advantage of sparsity
- Quantization: enable low bitwidth from FP32 to INT4
- Compression: reduce memory footprint and decompressed in APU on-the-fly
- Benefit
- >60% memory footprint and access reduction
- >3X performance improvement
- 3 token/sec → 10 token/sec
- Drawback
- Has quality loss issue
- Method
專題演講 (陽交大教授 陳添福 & 中央教授 蔡宗翰 & 台大教授 李宏毅)
LLM
- LLM Pipeline: LLM → Fine-tune → Optimization → Deployment
- Open LLM
- Falcon-7B
- GPU usage: ~15GB
- Training token: 1.5T tokens
- Extra technology: FlashAttention and multi-query attention
- LLaMA 2-7B
- GPU usage: ~10GB
- Training token: 2.0T tokens
- Falcon-7B
- TAIDE (Trustworthy AI Dialogue Engine)
- Dataset
- Training Dataset (3.1B tokens)
- rm-static-zh
- alpaca-zh by NTU
- 教育部國語辭典
- reliable_source_news
- oots_zh_wiki
- 科技大擂台_訓練資料集 & 測試資料集
- Formosa Language Understanding Dataset (FLUD)
- Fine-tune Dataset (42w)
- Training Dataset (3.1B tokens)
- Model
- LLaMA2-13B-Chat → CP (3.1B tokens) → fine-tune (42w) → Taide-LLaMA2-13B-Chat
- Method
- multi-node training
- Deepspeed
- Dataset
- Custom LLM == Foundation model + Custom Fine-tuning data + target landing
- LLM + fine-tuning (PEFT) → Custom LLM + SFT data → Optimization → Deployment
- Training LLM efficiently
- LoRA (Low-Rank Adaptation)
- (IA)^3 (Infused Adapter by Inhibiting and Amplifying Inner Activation)
- UniPEFT (Unified PEFT)
- Compress LLM
- Knowledge distillation
- Attention-Guide Distillation: force student to pay more attention to what teacher focuses on
- Intermediate representation Distillation: student learns teacher’s inference process and output distribution
- Pruning
- LLM-Pruner
- ZipLM
- Wanda
- Sparsity
- SparseGPT
- Sparse-Quantized Representation (SpQR)
- SqueezeLLM
- Low-precision inference (FP16, BF16)
- Quantization
- INT8, INT4 by PTQ, QAT
- Knowledge distillation
- MoE (mixture-of-experts)
- Benefit
- Good for model parallelism
- Better on Knowledge-heavy tasks
- Drawback
- Worse on reasoning tasks
- slower at transferring knowledge
- Benefit
Speech ChatGPT
- Speech LM
- 有別於當前熱門的 LLM decoder 架構, 在語音任務中只用 decoder 架構無法有效的對齊 (e.g.,不同語調指到同類別), 所以採用了 encoder-decoder 架構
- 利用 text prompt 可以做各種語音分類任務, 這邊 prompt 可以用 embedding 的方式來替換 (透過 gradient descent 學習)
- 對於沒看過的標籤有一定的遷移學習能力, 但效果有限
Conclusion
- 此演講一大重點是首次公布’號稱’台版 ChatGPT
TAIDE
的訓練細節, 使用了 Meta 最新的 LLaMA2-13B-Chat 模型, 搭配數十張的 V100 經過 100 天的訓練而得到 Taide-LLaMA2-13B-Chat, 有以下特點- 號稱是台灣在地化的模型 (∵ 計算繁體中文平均機率值 > 簡體中文平均機率值)
- 號稱不會像 Taiwan-LLaMA2 一樣學中文後忘英文 (∵ 透過大量 chatGPT 產出的中英文 alignment 資料)
- 號稱是一個單純的模型, 不會像 Taiwan-LLaMA2 回答一些毒品相關的知識 (∵ Taiwan-LLaMA2 在預訓練資料階段學習各種色情和借貸廣告資料 zh_TW_c4)
- 號稱未來會釋出商用模型 TAIDE-LLaMA2-C