[ggerganov/llama.cpp]最近我的多 GPU 坏了。ggml-cuda.cu:7068: 参数无效

先决条件

在提交问题之前，请您自己回答以下问题。

带有 python 绑定的 Git llama.cpp。

预期行为

推理就像以前一样。

目前的行为

推理失败并且 llama.cpp 崩溃。

环境和背景

蟒蛇3.10 / CUDA 11.8

失败信息（针对错误）


llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.26 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =  140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 39362.61 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 1280.00 MB
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_build_graph: non-view tensors processed: 1844/1844
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 41210.61 MB (model: 39362.61 MB, context: 1848.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
2023-11-02 17:16:43 INFO:Loaded the model in 37.54 seconds.
Enabled NVLINK P2P 0->1
Enabled NVLINK P2P 1->0

CUDA error 1 at /home/supermicro/ai/llama-cpp-python-gguf-cuda/vendor/llama.cpp/ggml-cuda.cu:7068: invalid argument
current device: 1

相关代码

正如你所看到的，我有一些 Nvllink 的 printf，所以行号有点偏离，但这里是设置位置的片段。


                // copy src0, src1 to device if necessary
                if (src1->backend == GGML_BACKEND_GPU && src1_is_contiguous) {
                    if (id != g_main_device) {
                        if (convert_src1_to_q8_1) {
                            char * src1_ddq_i_source = src1_ddq[g_main_device] + src1_ddq_i_offset;
                     ****>       CUDA_CHECK(cudaMemcpyAsync(src1_ddq_i, src1_ddq_i_source, src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs,
                                                    cudaMemcpyDeviceToDevice, stream));
                        } else {
                            float * src1_ddf_i_source = (float *) src1_extra->data_device[g_main_device];
                            src1_ddf_i_source += (i0*ne11 + src1_col_0) * ne10;
                            CUDA_CHECK(cudaMemcpyAsync(src1_ddf_i, src1_ddf_i_source, src1_ncols*ne10*sizeof(float),
                                                    cudaMemcpyDeviceToDevice, stream));
                        }
                    }

cudamemcpy async 的参数之一无效。我还没查过是哪一个做的。前一天，它试图在加载模型后分配 5TB 的系统 RAM，但随后的提交修复了该问题。等了一会儿，看看是否会发生这种情况，因为代码太新了，我无法从该机器访问 github，所以我必须将日志带到这里。

P40 和 3090 都可以做到这一点，并且与我是否强制 MMQ 无关。

Ph0rk0z

我遇到了同样的问题。Llama 2 70B，8 位量化。2x A100。编译为：

make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=compute_80

命令：

./main -ngl 83 -m ../transformers_cache/llama-2-70b.Q8_0.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Never gonna give"

失败：

CUDA error 1 at ggml-cuda.cu:7044: invalid argument current device: 1

而设置 -ngl 0 并完全在 CPU 上运行它运行良好（如果缓慢）。

neel-alex

我认为这可能与我对 CUDA 内存池的更改有关。合并 #3931 后，尝试使用 GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL 重新编译并仔细检查。

young-developer

@Ph0rk0z 你能平分发生失败的提交吗？

ggerganov

@ggerganov 我看到了同样的错误。git bisect显示提交 d6069051de7165a4e06662c89257f5d2905bb156 (#3903) 似乎是罪魁祸首。

PS：根据https://github.com/ggerganov/llama.cpp/pull/2470#issuecomment-1769068705我正在使用LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0. 但如果没有该选项，也会发生同样的错误。

sgoll

我的已经坏了：https ://github.com/ggerganov/llama.cpp/pull/2268

首先，它会崩溃，就像加载模型时设置过高的 n_batch 一样。即，尝试分配大量系统内存。内存池提交后，会出现上述错误。内存不足的 PR 并不能解决这个问题，但至少可以避免崩溃。

Ph0rk0z

值得一提的是，我也在新版本的 llama.cpp 中看到了这一点。我正在通过 llama_cpp_python 包构建！

(task, pid=12595) ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
(task, pid=12595) ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
(task, pid=12595) ggml_init_cublas: found 4 CUDA devices:
(task, pid=12595)   Device 0: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 1: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 2: Tesla T4, compute capability 7.5
(task, pid=12595)   Device 3: Tesla T4, compute capability 7.5
...
(task, pid=12595) CUDA error 1 at /tmp/pip-install-bxeyyykh/llama-cpp-python_262979da943c43fa9967b3c0a61f8580/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument
(task, pid=12595) current device: 1

yourbuddyconner

cuda 12.3 出现同样的错误：

ggml_init_cublas：GGML_CUDA_FORCE_MMQ：无 ggml_init_cublas：CUDA_USE_TENSOR_CORES：是 ggml_init_cublas：找到 4 个 CUDA 设备：设备 0：NVIDIA A10，计算能力 8.6 设备 1：NVIDIA A10，计算能力 8.6 设备 2：NVIDIA A10，计算能力 8.6 设备 3：NVIDIA A10，计算能力 8.6

... llm_load_tensors：使用 CUDA 进行 GPU 加速 ggml_cuda_set_main_device：使用设备 0 (NVIDIA A10) 作为主设备 llm_load_tensors：需要内存 = 86.05 MB llm_load_tensors：将 32 个重复层卸载到 GPU llm_load_tensors：将非重复层卸载到 GPU llm_load_tensors：卸载 35 /35 层到 GPU llm_load_tensors：使用的 VRAM：4807.06 MB ...................................... ...................................................... .......... llama_new_context_with_model: n_ctx = 3900 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: 将 v 缓存卸载到 GPU llama_kv_cache_init: 将 k 缓存卸载到 GPU llama_kv_cache_init: VRAM kv self = 487.50 MB llama_new_context_with_model: kv自身大小= 487.50 MB llama_build_graph：处理的非视图张量：740/740 llama_new_context_with_model：计算缓冲区总大小= 282.00 MB llama_new_context_with_model：VRAM临时缓冲区：275.37 MB llama_new_context_with_model：使用的总VRAM：5569.93 MB（模型：4807.06） MB，上下文：762.87 MB) AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | 霓虹灯 = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

/tmp/pip-install-5bufkrr/llama-cpp-python_9a816a9490ba42a78dfd85cdba57cabf/vendor/llama.cpp/ggml-cuda.cu:7036 处的 CUDA 错误 1：无效参数当前设备：1

moatftw

使用 Python 包的 2 x T4 也出现同样的错误。我在重新部署 Kubernetes 生产环境时发生了这种情况。我必须快速降级到 1 个 GPU 才能恢复环境。我确实需要尽快修复，因为 1 个 GPU 无法很好地处理高峰时间的负载。

riley-access-labs

请测试https://github.com/ggerganov/llama.cpp/pull/3931的更改。CUDA 池现在是可选的。

young-developer

恢复 cuda 池内容后，它似乎再次工作。

Ph0rk0z

我遇到了同样的错误：我应该安装特定版本吗？我安装了： CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir ........................ ...................................................... ...................................... llama_new_context_with_model: n_ctx = 3000 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model：kv 自大小 = 1500.00 MB llama_build_graph：处理的非视图张量：740/740 llama_new_context_with_model：计算缓冲区总大小 = 10.02 MB llama_new_context_with_model：VRAM 临时缓冲区：3.40 MB llama_new_context_with_model：使用的总 VRAM：3170 .43 MB（型号：3167.03 MB ，上下文：3.40 MB) AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | 霓虹灯 = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 检索文档所需的时间为 0.6446716785430908

/tmp/pip-install-1ypw1658/llama-cpp-python_1c1bc0be5c7249408c254fa56f97252b/vendor/llama.cpp/ggml-cuda.cu:7036 处的 CUDA 错误 1：参数当前设备无效：1

RachelShalom

@RachelShalom 尝试重新测试最新版本。

young-developer

我几个小时前安装了 llama cpp 并收到了这个错误。我假设我安装了最新的。除非此处提到的修复不在发行版中

RachelShalom

你安装了 llama.cpp 还是 llama-cpp-python ？我真的不知道 llama.cpp 传播到 llama-cpp-python 的速度有多快。

ccbadd

python使用这个：CMAKE_ARGS =“-DLLAMA_CUBLAS = on”FORCE_CMAKE = 1 pip install llama-cpp-python

我正在使用 lancgchain 来加载模型。我更新了 langchain，现在出现了一个新错误：

/tmp/pip-install-qcfy69x9/llama-cpp-python_d60a2a3fe09943d5b39a16dab77b98a7/vendor/llama.cpp/ggml-cuda.cu:7043 处的 CUDA 错误 222：提供的 PTX 是使用不受支持的工具链编译的。当前设备：0

RachelShalom

我将 llama-cpp-python 与 langchain 一起使用，并得到了相同的错误：我安装了： CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_FLAGS='-DGGML_CUDA_FORCE_CUSTOM_MEMORY_POOL'" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir 并且我还将 langchain 升级到 0.0.330

这是输出：

“ ggml_init_cublas：GGML_CUDA_FORCE_MMQ：否 ggml_init_cublas：CUDA_USE_TENSOR_CORES：是 ggml_init_cublas：找到 2 个 CUDA 设备： 设备 0：NVIDIA GeForce RTX 3060，计算能力 8.6 设备 1：NVIDIA GeForce GTX 1080 Ti，计算能力 6.1

...

llm_load_tensors：ggml ctx 大小 = 0.11 MB llm_load_tensors：使用 CUDA 进行 GPU 加速 ggml_cuda_set_main_device：使用设备 0 (NVIDIA GeForce RTX 3060) 作为主设备 llm_load_tensors：所需内存 = 172.97 MB llm_load_tensors：将 32 个重复层卸载到 GPU llm_load_tensors：已卸载32/35层到 GPU llm_load_tensors：使用的 VRAM：3718.38 MB ................................................ ...................................................... ........ llama_new_context_with_model：n_ctx = 512 llama_new_context_with_model：freq_base = 10000.0 llama_new_context_with_model：freq_scale = 1 llama_new_context_with_model：kv自身大小= 256.00 MB llama_build_graph：处理的非视图张量：740/740 llama _new_context_with_model：计算缓冲区总大小 = 7.18 MB llama_new_context_with_model：VRAM 暂存缓冲区：0.55 MB llama_new_context_with_model：使用的总 VRAM：3718.93 MB（模型：3718.38 MB，上下文：0.55 MB）

/tmp/pip-install-2o911nrr/llama-cpp-python_7b2f2508c89b451280d9116461f3c9cf/vendor/llama.cpp/ggml-cuda.cu:7036处的CUDA错误1：无效参数当前设备：1“

我有两张不同的卡，它们与编译后的 llama.cpp 配合得很好。但是当我尝试使用 llama-cpp-python 时出现错误。:(

davidleo1984

我也在使用 llama.cpp python，我只是 git pull 而不是使用他精心挑选的修订版。有时这是好的，有时是坏的。

Ph0rk0z

我在使用https://github.com/ggerganov/llama.cpp/pull/3586设置 2x A100 80GB PCIe 时遇到同样的问题。CUDA_VISIBLE_DEVICES=1与适合的模型一起运行。构建LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0没有帮助。我的设置适用于#3901。将尝试看看我是否设法找到一个破坏它的提交（例如#3903，正如线程中所怀疑的那样）。

jezzarax

[ggerganov/llama.cpp]最近我的多 GPU 坏了。ggml-cuda.cu:7068: 参数无效

回答

相关问题