在提交问题之前,请您自己回答以下问题。
带有 python 绑定的 Git llama.cpp。
预期行为推理就像以前一样。
目前的行为推理失败并且 llama.cpp 崩溃。
环境和背景蟒蛇3.10 / CUDA 11.8
失败信息(针对错误)
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.26 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required = 140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 39362.61 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 1280.00 MB
llama_new_context_with_model: kv self size = 1280.00 MB
llama_build_graph: non-view tensors processed: 1844/1844
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 41210.61 MB (model: 39362.61 MB, context: 1848.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
2023-11-02 17:16:43 INFO:Loaded the model in 37.54 seconds.
Enabled NVLINK P2P 0->1
Enabled NVLINK P2P 1->0
CUDA error 1 at /home/supermicro/ai/llama-cpp-python-gguf-cuda/vendor/llama.cpp/ggml-cuda.cu:7068: invalid argument
current device: 1
相关代码
正如你所看到的,我有一些 Nvllink 的 printf,所以行号有点偏离,但这里是设置位置的片段。
// copy src0, src1 to device if necessary
if (src1->backend == GGML_BACKEND_GPU && src1_is_contiguous) {
if (id != g_main_device) {
if (convert_src1_to_q8_1) {
char * src1_ddq_i_source = src1_ddq[g_main_device] + src1_ddq_i_offset;
****> CUDA_CHECK(cudaMemcpyAsync(src1_ddq_i, src1_ddq_i_source, src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs,
cudaMemcpyDeviceToDevice, stream));
} else {
float * src1_ddf_i_source = (float *) src1_extra->data_device[g_main_device];
src1_ddf_i_source += (i0*ne11 + src1_col_0) * ne10;
CUDA_CHECK(cudaMemcpyAsync(src1_ddf_i, src1_ddf_i_source, src1_ncols*ne10*sizeof(float),
cudaMemcpyDeviceToDevice, stream));
}
}
cudamemcpy async 的参数之一无效。我还没查过是哪一个做的。前一天,它试图在加载模型后分配 5TB 的系统 RAM,但随后的提交修复了该问题。等了一会儿,看看是否会发生这种情况,因为代码太新了,我无法从该机器访问 github,所以我必须将日志带到这里。
P40 和 3090 都可以做到这一点,并且与我是否强制 MMQ 无关。