[THUDM/ChatGLM-6B][Document] 更新Mac部署说明

2024-05-20 340 views
0
更新Mac部署说明
  • Type: Document
  • FILES: README.md; README_en.md
  • Keywords: OPENMP; MPS
具体更新内容

chatglm-6b-int4量化模型为例,做如下配置:

  • 安装libomp的步骤;
  • 对量化后模型等配置gcc编译项,并启用OMP加速推理;
  • 量化后模型启用MPS(然后失败)的解释。

Mac 启用OMP涉及https://huggingface.co/THUDM/chatglm-6b-int4quantization.py的修改由于需要手动安装一些依赖,不单独commit,而直接描述在了说明中。

已经验证环境:

Mac M1 Ultra 128GB Mac OS: 13.3.1 GCC: Apple clang version 14.0.3 (clang-1403.0.22.14.1) conda 23.3.1 torch (two versions, with MPS)

  • '2.0.0';
  • '2.1.0.dev20230502'

回答

6

我的系统也是 MacOS 13.3.1的,用半精度进行 MPS 计算没有问题。你用半精度计算会报什么错?

6
  1. 某些情况需要half() 改成float(), 这个有些issue里面已经说了;
  2. 载入量化后的模型to("mps")失效
# eg: web_demo.py
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps')
model = model.eval()

非量化的issue-462里面你也回复了(6B没啥问题,只是chatglm-6b-int4会有问题),原因在quantization_code 这个文件(bz2了一个ELF/so文件)里面是NV的,当前mps不起作用。至于这个要启用的话,量化的代码估计要动很多。

error log
--- Logging error --- Traceback (most recent call last): File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 19, in from cpm_kernels.kernels.base import LazyKernelCModule, KernelFunction, round_up File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/__init__.py", line 1, in from . import library File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/library/__init__.py", line 1, in from . import nvrtc File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/library/nvrtc.py", line 5, in nvrtc = Lib("nvrtc") File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/library/base.py", line 59, in __init__ raise RuntimeError("Unknown platform: %s" % sys.platform) RuntimeError: Unknown platform: darwin During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/yifanyang/miniconda3/lib/python3.10/logging/__init__.py", line 1100, in emit msg = self.format(record) File "/Users/yifanyang/miniconda3/lib/python3.10/logging/__init__.py", line 943, in format return fmt.format(record) File "/Users/yifanyang/miniconda3/lib/python3.10/logging/__init__.py", line 678, in format record.message = record.getMessage() File "/Users/yifanyang/miniconda3/lib/python3.10/logging/__init__.py", line 368, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "/Users/yifanyang/Git/ChatGLM-6B/web_demo.py", line 6, in model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps') File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/transformers-4.29.0.dev0-py3.10.egg/transformers/models/auto/auto_factory.py", line 463, in from_pretrained return model_class.from_pretrained( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/transformers-4.29.0.dev0-py3.10.egg/transformers/modeling_utils.py", line 2637, in from_pretrained model = cls(config, *model_args, **model_kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1061, in __init__ self.quantize(self.config.quantization_bit, self.config.quantization_embeddings, use_quantization_cache=True, empty_init=True) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1424, in quantize from .quantization import quantize, QuantizedEmbedding, QuantizedLinear, load_cpu_kernel File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 46, in logger.warning("Failed to load cpm_kernels:", exception) Message: 'Failed to load cpm_kernels:' Arguments: (RuntimeError('Unknown platform: darwin'),) No compiled kernel found. Compiling kernels : /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.c Compiling gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.c -shared -o /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.so Load kernel : /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.so Setting CPU quantization kernel threads to 10 Using quantization cache Applying quantization to glm layers Running on local URL: http://127.0.0.1:7860 To create a public link, set `share=True` in `launch()`. Traceback (most recent call last): File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 395, in run_predict output = await app.get_blocks().process_api( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1193, in process_api result = await self.call_function( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 930, in call_function prediction = await anyio.to_thread.run_sync( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, *args) File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/utils.py", line 491, in async_iteration return next(iterator) File "/Users/yifanyang/Git/ChatGLM-6B/web_demo.py", line 61, in predict for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p, File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1311, in stream_chat for outputs in self.stream_generate(**inputs, **gen_kwargs): File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1388, in stream_generate outputs = self( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1190, in forward transformer_outputs = self.transformer( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 996, in forward layer_ret = layer( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 627, in forward attention_outputs = self.attention( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 445, in forward mixed_raw_layer = self.query_key_value(hidden_states) File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 391, in forward output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width) File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 56, in forward weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 274, in extract_weight_to_half func = kernels.int4WeightExtractionHalf AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'
8
  1. 某些情况需要half() 改成float(), 这个有些issue里面已经说了;
  2. 载入量化后的模型to("mps")失效
# eg: web_demo.py
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps')
model = model.eval()

非量化的issue-462里面你也回复了(6B没啥问题,只是chatglm-6b-int4会有问题),原因在quantization_code 这个文件(bz2了一个ELF/so文件)里面是NV的,当前mps不起作用。至于这个要启用的话,量化的代码估计要动很多。

error log

需要从 .half() 改成 .float() 这个问题以前是因为 PyTorch 在 MPS 后端的 baddbmm 实现有问题,现在应该已经修复了。你现在还能复现这个问题吗?

1
  1. 某些情况需要half() 改成float(), 这个有些issue里面已经说了;
  2. 载入量化后的模型to("mps")失效
# eg: web_demo.py
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps')
model = model.eval()

非量化的issue-462里面你也回复了(6B没啥问题,只是chatglm-6b-int4会有问题),原因在quantization_code 这个文件(bz2了一个ELF/so文件)里面是NV的,当前mps不起作用。至于这个要启用的话,量化的代码估计要动很多。

error log

需要从 .half() 改成 .float() 这个问题以前是因为 PyTorch 在 MPS 后端的 baddbmm 实现有问题,现在应该已经修复了。你现在还能复现这个问题吗?

用了最新(2023/05)的pytorch-nightly没有这个问题。详细版本号和测试情况见下面:

torch version status
2.1.0.dev20230502 half(), ✓float()
2.0.0 x half(), ✓float()

torch==2.0.0中(大陆广泛使用的anaconda镜像没有同步pytorch-nightly,安装不注意会触发这个问题),会有MPS的bug。

``` Python 3.10.10 (main, Mar 21 2023, 13:41:05) [Clang 14.0.6 ] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.__version__ '2.0.0' >>> torch.backends.mps.is_available() True ``` ---------------- error logs ---------------- loc("varianceEps"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/97f6331a-ba75-11ed-a4bc-863efbbaf80d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<1x5x1xf16>' and 'tensor<1xf32>' are not broadcast compatible
9

能跑但是很慢,请问要怎么解决,mac m1

9

能跑但是很慢,请问要怎么解决,mac m1

只用了CPU

如题,这个issue上文解释了为什么量化模型MPS调用有问题

内存不够

可能要看下内存(显存)。我的M1配置是64GB(Macbook pro M1 Max)和128GB(Mac Studio),会观测到显存占用是比较高的,但(上下文)token数没那么大的时候,问题不大。

运行的时候监控下内存占用,比如:

 while :; do clear; top -l 1 | grep "python" | awk '{print "MEM="$9 "\tRPRVT="$10}'; sleep 2; done

把里面的python 换成你用的bash调用命令的关键字, ctrl+c中断。看看到底有多少内存占用。

至于细节,需要更多日志看了。Mac运行推理只是一个可行解而已。

综上,可能原因:

  • int8/int4等量化模型可以摸索下,但这些量化后的模型在Mac当前应该只能用CPU调用,自然慢;
  • 内存不够频繁用swap。