应该在没有 CUDA 运行时但model.n_gpu_layers = 0
.
master 中的当前行为在非 cuda 机器上抛出以下错误GGML_USE_CUBLAS=ON
中央处理器
CUDA_VISIBLE_DEVICES=-1 ./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128
CUDA error 100 at /home/ubuntu/sky_workdir/llama.cpp/ggml-cuda.cu:5830: no CUDA-capable device is detected
current device: 48
这个公关
中央处理器
CUDA_VISIBLE_DEVICES=-1 ./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128
llama_print_timings: load time = 198.51 ms
llama_print_timings: sample time = 100.40 ms / 128 runs ( 0.78 ms per token, 1274.95 tokens per second)
llama_print_timings: prompt eval time = 1383.23 ms / 20 tokens ( 69.16 ms per token, 14.46 tokens per second)
llama_print_timings: eval time = 9991.21 ms / 127 runs ( 78.67 ms per token, 12.71 tokens per second)
llama_print_timings: total time = 11540.18 ms
CPU 但请求 ngl > 0
CUDA_VISIBLE_DEVICES=-1 ./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 999
CUDA error 100 at /home/ubuntu/sky_workdir/llama.cpp/ggml-cuda.cu:478: no CUDA-capable device is detected
CUDA
./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 999
llama_print_timings: load time = 397.99 ms
llama_print_timings: sample time = 101.63 ms / 128 runs ( 0.79 ms per token, 1259.51 tokens per second)
llama_print_timings: prompt eval time = 54.91 ms / 20 tokens ( 2.75 ms per token, 364.23 tokens per second)
llama_print_timings: eval time = 1768.41 ms / 127 runs ( 13.92 ms per token, 71.82 tokens per second)
llama_print_timings: total time = 1979.19 ms