大模型微调:Alpaca-lora

Preface

用LoRA调了两个星期的LLaMA-7B, 记录并且分享一下我遇到的问题，总结一下经验

LLaMA介绍

LLaMA 全称 Large Language Model Meta AI , 是由 Meta团队 开发并开源的一系列大语言模型，后续许多模型像 Alpaca、Vicuna 是在LLaMA基础上微调而来。
LLaMA 目前有两个主要版本：

LLaMA-1（2023年2月发布）, 主要有7B、13B、33B、65B 四种参数规模, 训练 token 数达 1.4T。
LLaMA-2（2023年7月发布） ,主要有7B、13B、70B 三种参数规模，并增加 Chat 微调版本（LLaMA-2-Chat）。训练 token 数扩展至 2T，优化了数据质量和训练方法，同时模型完全开源（包括预训练和微调模型），可用于商业和研究。

LoRA介绍

to do…

微调数据集
Instruct数据

微调主要使用了Instruct数据，Instrcut数据主要包含instruction,input和output三部分。其中input部分可以为空，而instruction和 output部分是必需的。
本次微调用到的Instruct数据集名为alpaca-cleaned,该数据是对alpaca_data经过清洗过滤得到的。

数据集链接为：https://huggingface.co/datasets/yahma/alpaca-cleaned

数据分布

alpaca_data是由GPT3生成的，它采用了一种名为Self-Instruct的框架来半自动的生成每一条Instruct数据。数据集总共包含52K条Instruct数据，82K条实例.

数据主要分布如下图：

从这个饼图可以发现采用大模型GPT3生成的Insturct数据比较多样化，同时生成的质量也不错。那么具体是怎么生成的呢？

Self-Instruct

Self-Instruct的具体流程如下图所示，流程主要分为四部分:

第一步：Instruction Generation
第二步：判断LM生成的Instruction是否是分类任务
第三步：分类任务生成生成相应的Instance（Output-first）；非分类任务生成相应的Instance（Input-first）
第四步：过滤

Instruction Generation

首先通过人工的方式定义一批数据（175个种子任务），每个任务实际上就是有固定字段格式的文本： id，name，instruction，instances（包括可为空的input，output），is_classification.下面就是一个示例：

{"id":"seed_task_0", "name": "breakfast_suggestion","instruction": "Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?"，"instances"：[{"input": "", "output": "Yes, you can have 1 oatmeal banana protein shake and 4 strips of bacon. The oatmeal banana protein shake may contain 1/2 cup oatmeal，60 grams whey protein powder，1/2 medium banana，1tbsp flaxseed oil and 1/2 cup watter, totalling about 550 calories. The 4 strips of bacon contains about 200 calories."}],"is_classification":false}

最初人工定义的175个task会被放进任务池中。然后每次迭代会从任务池中抽出来总共8条指令，其中6条指令是人工编写的，剩下2条是上次迭代是语言模型模型生成的。
第一次迭代时是没有生成的指令，所以第一次迭代直接选取8条人工指令。

所以最后会以上图中展示的那样，将8个Task输给GPT3，然后GPT3根据这8个Task示例来生成新的Task.（Few-shot）

Classification

该步骤是使用GPT3去识别生成的指令是否是分类任务。因为后续处理分类和非分类任务需要两种不同的方法。

Can the following task be regarded as a classification task with finite output labels?
 Task: Given my personality and the job, tell me if I would be suitable.
 Is it classification? Yes
 Task: Give me an example of a time when you had to use your sense of humor.
 Is it classification? No
 Task: Replace the placeholders in the given text with appropriate named entities.
 Is it classification? No
 Task: Fact checking- tell me if the statement is true, false, or unknown, based on your
 knowledge and common sense.
 Is it classification? Yes
 ......
 Task: To make the pairs have the same analogy, write the fourth word.
 Is it classification? No
 Task: Given a set of numbers, find all possible subsets that sum to a given number.
 Is it classification? No
 Task: {instruction for the target task}

同第一步的Instruction Generation，输入给GPT3的提示模板同样采用了Few-shot的方式，输入中包含很多示例，然后让GPT判断生成的Instruciton是否是分类任务。

Instance Generation

该步骤是针对上面的Instruction生成相应的Instance.分类和非分类任务采用两种不同的Prompt构造方法.

Input-first：主要针对生成类任务，优先生成实例中的X，在产生相应的Y。
Output-first：主要针对分类任务，给定指令，先生成输出Y，再生成输入。避免先生成输入，再生成输出时结果单一。通俗的说就是正样本较多的情况下，避免模型把所有输入都标记（瞎蒙）为正样本（结果单一）。

Input-first的提示模板如下：

Come up with examples for the following tasks. Try to generate multiple examples when possible.
 If the task doesn’t require additional input, you can generate the output directly.
 Task: Converting 85 F to Celsius.
 Output: 85°F = 29.44°C
 Task: Sort the given list ascendingly.
 Example 1
 List: [10, 92, 2, 5,-4, 92, 5, 101]
 Output: [-4, 2, 5, 5, 10, 92, 92, 101]
 Example 2
 Input 2- List: [9.99, 10,-5,-1000, 5e6, 999]
 Output: [-1000,-5, 9.99, 10, 999, 5e6]
 Task: Suggest a better and more professional rephrasing of the following sentence.
 Example 1
 Sentence: This house is surprisingly not constructed very well, and you probably need more
 money to fix it after you buy it. If you ask me, I would suggest you to consider other
 candidates.
 Output: This house does not seem to be constructed well, so you may need to spend more money
 to fix it after you purchase it. I would suggest that you look at other properties.
 Example 2
 Sentence: Just so you know, we did an experiment last week and found really surprising results- language model can improve itself!
 Output: Our experiments last week demonstrated surprising results, proving that the language
 model can improve itself.
 ⋯
  Task: {Instruction for the target task}

Output-first的提示模板如下：

Given the classification task definition and the class labels, generate an input that
 corresponds to each of the class labels. If the task doesn’t require input, just generate the
 correct class label.
 Task: Classify the sentiment of the sentence into positive, negative, or mixed.
 Class label: mixed
 Sentence: I enjoy the flavor of the restaurant but their service is too slow.
 Class label: Positive
 Sentence: I had a great day today. The weather was beautiful and I spent time with friends.
 Class label: Negative
 Sentence: I was really disappointed by the latest superhero movie. I would not recommend it.
 Task: Given a dialogue, classify whether the user is satisfied with the service. You should
 respond with "Satisfied" or "Unsatisfied".
 Class label: Satisfied
 Dialogue:- Agent: Thank you for your feedback. We will work to improve our service in the future.- Customer: I am happy with the service you provided. Thank you for your help.
 Class label: Unsatisfied
 Dialogue:- Agent: Sorry that we will cancel your order. You will get a refund within 7 business days.- Customer: oh that takes too long. I want you to take quicker action on this.
 Task: Detect if the Reddit thread contains hate speech.
 Class label: Hate Speech
 Thread: All people of color are stupid and should not be allowed to vote.
 Class label: Not Hate Speech
 Thread: The best way to cook a steak on the grill.
 Task: Does the document supports the claim? Answer with "Support" or "Unsupport".
 Class label: Unsupport
 Document: After a record-breaking run that saw mortgage rates plunge to all-time lows and
 home prices soar to new highs, the U.S. housing market finally is slowing. While demand and
 price gains are cooling, any correction is likely to be a modest one, housing economists and
 analysts say. No one expects price drops on the scale of the declines experienced during the
 Great Recession.
 Claim: The US housing market is going to crash soon.
 Class label: Support
 Document: The U.S. housing market is showing signs of strain, with home sales and prices
 slowing in many areas. Mortgage rates have risen sharply in recent months, and the number
 of homes for sale is increasing. This could be the beginning of a larger downturn, with some
 economists predicting a potential housing crash in the near future.
 Claim: The US housing market is going to crash soon.
 ⋯
 Task: {instruction for the target task}

Filtering

这一步主要是将GPT3生成的一些低质量Instruction数据过滤掉.过滤的标准如下：

新生成的指令只有与种子池中的指令的 ROUGE-L 小于0.7时才会添加进入种子池；
排除一些无法被语言模型处理的指令，比如涉及图像、图片、图形的指令；
在给指令生成实例时，过滤掉那些完全相同的实例。

这里ROUGE-L是一个判断新生成指令和种子池指令相似度的指标。小于0.7说明新生成的指令和种子池中的指令相似度比较低，从而保证生成的多样性。

上面就是Self-Instruct的整体流程，通过不断迭代最终生成52K条Instruction Data.

训练

LLaMA-7B下载

微调首先需要将模型下载下来，LLaMA-7B模型下载链接：https://huggingface.co/baffo32/decapoda-research-llama-7B-hf

下载下来的模型可以发现一共包含33个shards。因为该模型的大小较大，直接保存为单个文件可能会导致存储空间不足或者读取速度较慢。所以模型通常通过多个分片文件来保存的，每个分片包含模型的一部分权重。模型文件会被切分成多个 .bin 文件，例如：

pytorch_model-00001-of-00033.bin
pytorch_model-00002-of-00033.bin

这种方式不仅可以优化存储管理，还能提升加载时的性能，尤其是在处理大型模型时。此外，分片存储还可以支持分布式训练和推理，让模型更易于在多个设备或服务器上进行加载。

Implementation Details

由于我使用的Slurm集群系统，我才用了下面的脚本来实现多卡训练.

#!/bin/bash
#SBATCH --job-name=alpaca
#SBATCH -p qdagexclu10
#SBATCH -N 1
#SBATCH -n 4
#SBATCH --gres=gpu:4
#SBATCH -o alpaca_offical.log

source /work/home/shiyan_dong/miniconda3/etc/profile.d/conda.sh
conda activate alpaca-lora

export PYTHONUNBUFFERED=1
torchrun --nproc_per_node 4 finetune.py

我微调超参数设置如下：

# model/data params
base_model: str = "./llama-7b",  # the only required argument
data_path: str = "alpaca_cleaned/alpaca_data_cleaned.json",
output_dir: str = "./lora-alpaca",
# training hyperparams
batch_size: int = 128,
micro_batch_size: int = 32,
num_epochs: int = 10,
learning_rate: float = 3e-4,
cutoff_len: int = 512,
val_set_size: int = 2000,
# lora hyperparams
lora_r: int = 16,
lora_alpha: int = 16,
lora_dropout: float = 0.05,
lora_target_modules: List[str] = [
     "q_proj",
      "k_proj",
     "v_proj",
      "o_proj"
],

我微调一共使用了4张A800 80GB GPU,训练时长为18小时32分钟,一共训练了10个epoch.
我的损失曲线如下：

推理

上述训练后会得到Lora的微调权重文件如下样式：

接着使用上述权重进行推理验证：

import os
import sys
import torch
import transformers
from peft import PeftModel
from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer

from utils.callbacks import Iteratorize, Stream
from utils.prompter import Prompter


if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

try:
    if torch.backends.mps.is_available():
        device = "mps"
except:  # noqa: E722
    pass


def main(
    load_8bit: bool = False,
    base_model: str = "./llama-7b",
    lora_weights: str = "./lora-alpaca",
    prompt_template: str = "",
    temperature: float = 0.1,
    top_p: float = 0.75,
    top_k: int = 40,
    num_beams: int = 4,
    max_new_tokens: int = 128,
    stream_output: bool = False,
):
    base_model = base_model or os.environ.get("BASE_MODEL", "")
    assert (
        base_model
    ), "Please specify a --base_model, e.g. --base_model='huggyllama/llama-7b'"

    prompter = Prompter(prompt_template)
    tokenizer = LlamaTokenizer.from_pretrained(base_model)

    if device == "cuda":
        model = LlamaForCausalLM.from_pretrained(
            base_model,
            load_in_8bit=load_8bit,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        model = PeftModel.from_pretrained(
            model,
            lora_weights,
            torch_dtype=torch.float16,
        )
    elif device == "mps":
        model = LlamaForCausalLM.from_pretrained(
            base_model,
            device_map={"": device},
            torch_dtype=torch.float16,
        )
        model = PeftModel.from_pretrained(
            model,
            lora_weights,
            device_map={"": device},
            torch_dtype=torch.float16,
        )
    else:
        model = LlamaForCausalLM.from_pretrained(
            base_model, device_map={"": device}, low_cpu_mem_usage=True
        )
        model = PeftModel.from_pretrained(
            model,
            lora_weights,
            device_map={"": device},
        )

    model.config.pad_token_id = tokenizer.pad_token_id = 0  # unk
    model.config.bos_token_id = 1
    model.config.eos_token_id = 2

    if not load_8bit:
        model.half()

    model.eval()
    if torch.__version__ >= "2" and sys.platform != "win32":
        model = torch.compile(model)

    def evaluate(
        instruction,
        input=None,
        temperature=0.1,
        top_p=0.75,
        top_k=40,
        num_beams=4,
        max_new_tokens=128,
        stream_output=False,
        **kwargs,
    ):
        prompt = prompter.generate_prompt(instruction, input)
        inputs = tokenizer(prompt, return_tensors="pt")
        input_ids = inputs["input_ids"].to(device)
        generation_config = GenerationConfig(
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
            **kwargs,
        )

        generate_params = {
            "input_ids": input_ids,
            "generation_config": generation_config,
            "return_dict_in_generate": True,
            "output_scores": True,
            "max_new_tokens": max_new_tokens,
        }

        if stream_output:
            def generate_with_callback(callback=None, **kwargs):
                kwargs.setdefault(
                    "stopping_criteria", transformers.StoppingCriteriaList()
                )
                kwargs["stopping_criteria"].append(
                    Stream(callback_func=callback)
                )
                with torch.no_grad():
                    model.generate(**kwargs)

            def generate_with_streaming(**kwargs):
                return Iteratorize(
                    generate_with_callback, kwargs, callback=None
                )

            with generate_with_streaming(**generate_params) as generator:
                for output in generator:
                    decoded_output = tokenizer.decode(output)

                    if output[-1] in [tokenizer.eos_token_id]:
                        break

                    yield prompter.get_response(decoded_output)
            return

        # Without streaming
        with torch.no_grad():
            generation_output = model.generate(
                input_ids=input_ids,
                generation_config=generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                max_new_tokens=max_new_tokens,
            )
        s = generation_output.sequences[0]
        output = tokenizer.decode(s)
        yield prompter.get_response(output)

    print("模型加载完成，进入交互模式。输入 'exit' 退出。")

    while True:
        instruction = input("\n请输入 instruction（或 exit 退出）:\n> ")
        if instruction.strip().lower() == "exit":
            print("退出程序。")
            break
        input_text = input("请输入 input（可留空）:\n> ")
        if input_text.strip().lower() == "exit":
            print("退出程序。")
            break
        print("\n模型生成中...\n")

        if stream_output:
            for response in evaluate(
                instruction,
                input=input_text if input_text else None,
                temperature=temperature,
                top_p=top_p,
                top_k=top_k,
                num_beams=num_beams,
                max_new_tokens=max_new_tokens,
                stream_output=True,
            ):
                print(response, end="", flush=True)
            print()
        else:
            for response in evaluate(
                instruction,
                input=input_text if input_text else None,
                temperature=temperature,
                top_p=top_p,
                top_k=top_k,
                num_beams=num_beams,
                max_new_tokens=max_new_tokens,
                stream_output=False,
            ):
                print(response)
                
if __name__ == "__main__":
    import fire
    fire.Fire(main)