将一些通用代码分解为单独的函数:
- [X]
llm_build_inp_embd()
- [X]
llm_build_norm()
- [X]
llm_build_ffn()
- [X]
llm_build_k_shift()
- [X]
llm_build_kv_store()
- [X]
llm_build_qkv()
所有这些东西都是暂时的,因为我们很快将集成一个新的后端实现,该实现应该自动处理张量卸载,因此这些更改可以被视为迁移到新接口的准备步骤。
添加了卸载的健全性检查。在输出中查找以下消息:
llama_new_context_with_model: kv self size = 2.75 MB
llama_build_graph: non-view tensors processed: 510/510 <--- this is good
llama_new_context_with_model: compute buffer total size = 22.75 MB
llama_new_context_with_model: kv self size = 256,00 MB
llama_build_graph: non-view tensors processed: 708/740 <--- this is bad
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 76,66 MB
后者表示图中的某些张量尚未使用回调函数进行处理。这可能会导致推理过程中效率低下,可以通过启用以下功能进行调试LLAMA_OFFLOAD_DEBUG
:
- Persimmon的Q、K、V实现非常繁琐。很有可能有什么地方不对劲
- ALiBi 模型可能无法很好地与 K-shift 配合使用(也在 上
master
):# MPT-7B, small context of 256 to trigger shift more often, starts generating junk after first shift make -j main && ./main -m ./models/mpt-7b/ggml-model-q4_0.gguf -p "I believe the meaning of life is" --ignore-eos -c 256 -n 512 -t 8 -ngl 999 -s 1