[ggerganov/llama.cpp]llama ：添加 llm_build 辅助函数

通过函数重用代码

将一些通用代码分解为单独的函数：

[X]llm_build_inp_embd()
[X]llm_build_norm()
[X]llm_build_ffn()
[X]llm_build_k_shift()
[X]llm_build_kv_store()
[X]llm_build_qkv()

张量卸载改进

所有这些东西都是暂时的，因为我们很快将集成一个新的后端实现，该实现应该自动处理张量卸载，因此这些更改可以被视为迁移到新接口的准备步骤。

添加了卸载的健全性检查。在输出中查找以下消息：

llama_new_context_with_model: kv self size  =    2.75 MB
llama_build_graph: non-view tensors processed: 510/510                              <--- this is good
llama_new_context_with_model: compute buffer total size = 22.75 MB

llama_new_context_with_model: kv self size  =  256,00 MB
llama_build_graph: non-view tensors processed: 708/740                              <--- this is bad
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 76,66 MB

后者表示图中的某些张量尚未使用回调函数进行处理。这可能会导致推理过程中效率低下，可以通过启用以下功能进行调试LLAMA_OFFLOAD_DEBUG：

https://github.com/ggerganov/llama.cpp/blob/fc5a26aadea54e2bcf6dd384e1ca0c846575bc0c/llama.cpp#L5075-L5077

一些观察

Persimmon的Q、K、V实现非常繁琐。很有可能有什么地方不对劲

ALiBi 模型可能无法很好地与 K-shift 配合使用（也在上master）：

# MPT-7B, small context of 256 to trigger shift more often, starts generating junk after first shift
make -j main && ./main -m ./models/mpt-7b/ggml-model-q4_0.gguf -p "I believe the meaning of life is" --ignore-eos -c 256 -n 512 -t 8 -ngl 999 -s 1

ggerganov

您是否考虑过更面向对象的方法？基本上创建这样的类，然后从对象组成模型。

struct llm_norm {
    float eps;
    tensor weight;
    tensor bias;
    tensor forward(x) {
        return norm(x, eps) * weight + bias;
    }
};

需要明确的是，我知道这不是什么新鲜事，它本质上与 pytorch 和其他框架中所做的相同。这将是一次更大的重构，但可能会导致更加模块化且更易于扩展的设计。

slaren

是的，我可以看到好处。我们将在模型加载时初始化模块，然后图表将是非常短的前向调用。

我认为我们可以稍后执行此操作，因为准备好这些辅助函数后，可以forward(x)使用结构体的成员简单地调用它们。因此，这次重构的努力可能不会被浪费，并且将是迈向面向对象方法的第一步。

ggerganov

这更像 Pytorch。它们有一个功能性 API 来完成实际工作，然后可以静态调用它们或使用类成员等调用它们，使其更具可组合性。

monatis

我测试了perplexity在几个模型上运行：

Orca 3B - 没问题，结果完全相同。

CausalLM 14B - 没问题，结果完全相同。

米斯特拉尔和风-β：

pr: ggml_allocr_alloc: not enough space in the buffer (needed 29360128, largest block available 29358112)
GGML_ASSERT: ggml-alloc.c:148: !"not enough space in the buffer"

海豚-2.1 70B：

pr: ggml_allocr_alloc: not enough space in the buffer (needed 58720256, largest block available 58718240)
GGML_ASSERT: ggml-alloc.c:148: !"not enough space in the buffer"

在这两种情况下，差异都很小，都相差 2,016 年。

KerfuffleV2

@KerfuffleV2 太棒了！感谢您的测试 - 午餐后我会解决这个问题。

ggerganov

@KerfuffleV2 这是第一个模型的存储库吗？

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

另外，您使用哪个命令？

ggerganov

https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/

我使用 Q5_K_M 进行了测试，但我尝试了另一个经过 Q2_K 量化的 Mistral 模型，并得到了相同的错误：

ggml_allocr_alloc: not enough space in the buffer (needed 29360128, largest block available 29358112)
GGML_ASSERT: ggml-alloc.c:148: !"not enough space in the buffer"

使用似乎也-ngl 0没有帮助（尽管这是用 HIPBLAS 构建的）。

70B 是https://huggingface.co/TheBloke/Dolphin-2.1-70B-GGUF/

命令：

./perplexity.pr -f /raid/vantec/ai/wikitext-2-raw/wiki.test.raw -m /blah/openhermes-2-mistral-7b.q2_k.gguf -ngl 0 --log-disable

编译为：

make -j8 GPU_TARGETS=gfx1030 LLAMA_HIPBLAS=1 perplexity

系统是x86 Linux。

KerfuffleV2

该ggml-alloc问题应通过 2926ef6 解决

ggerganov

重构的另一个机会可能是输入部分，它在每个模型中似乎都是相同的：

    if (batch.token) {
        struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
        cb(inp_tokens, "inp_tokens", -1);

        embd = ggml_get_rows(ctx0, model.tok_embeddings, inp_tokens);
    } else {
#ifdef GGML_USE_MPI
        GGML_ASSERT(false && "not implemented");
#endif

        embd = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, n_tokens);
    }
    cb(embd, "inp_embd", -1);

我注意到llm_build_persimmon缺少 MPI 断言，因此为此创建一个通用函数将减少出现此类错误的机会。

slaren

添加llm_build_inp_embd()到 7923b70

ggerganov

这非常接近您可以从 JSON 文件或其他文件加载模型定义的位置。事实上，从目前的变化来看，这可能是可能的。

我正在研究 LLaMA 的定义，并为了自己的娱乐而搞乱了一些：

pre_repeating {
  cond has_batch_token {
    new_tensor inp_tokens 1 i32 param:n_tokens
    ggml_op get_rows inp_tokens inpL model_tensor:tok_embeddings inp_tokens
  }
  not_cond has_batch_token {
    new_tensor inpL 2 f32 model_param:n_embed param:n_tokens 
  }
  new_tensor inp_pos 1 i32 param:n_tokens
  new_tensor KQ_scale 1 f32 1
  new_tensor KQ_mask 3 f32 param:n_kv param:n_tokens 1
  rope_shift_if_needed LLM_ROPE
}

repeating {
  bind_tensor inpL inpSA
  llm_op build_norm model_layers:attn_norm LLM_NORM_RMS
}

post_repeating {
}

这不是 JSON，但很容易看出它是如何编写的。

KerfuffleV2

[ggerganov/llama.cpp]llama ：添加 llm_build 辅助函数

回答

相关问题