[THUDM/ChatGLM-6B][Document] 更新Mac部署说明

更新Mac部署说明

Type: Document
FILES: README.md; README_en.md
Keywords: OPENMP; MPS

具体更新内容

以chatglm-6b-int4量化模型为例，做如下配置：

安装libomp的步骤;
对量化后模型等配置gcc编译项，并启用OMP加速推理；
量化后模型启用MPS（然后失败）的解释。

Mac 启用OMP涉及https://huggingface.co/THUDM/chatglm-6b-int4中quantization.py的修改由于需要手动安装一些依赖，不单独commit，而直接描述在了说明中。

已经验证环境：

Mac M1 Ultra 128GB Mac OS: 13.3.1 GCC: Apple clang version 14.0.3 (clang-1403.0.22.14.1) conda 23.3.1 torch (two versions, with MPS)

'2.0.0';
'2.1.0.dev20230502'

yfyang86

我的系统也是 MacOS 13.3.1的，用半精度进行 MPS 计算没有问题。你用半精度计算会报什么错？

duzx16

某些情况需要half() 改成float(), 这个有些issue里面已经说了；
载入量化后的模型to("mps")失效

# eg: web_demo.py
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps')
model = model.eval()

非量化的issue-462里面你也回复了（6B没啥问题，只是chatglm-6b-int4会有问题），原因在quantization_code 这个文件（bz2了一个ELF/so文件）里面是NV的，当前mps不起作用。至于这个要启用的话，量化的代码估计要动很多。

error log

--- Logging error --- Traceback (most recent call last): File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 19, in from cpm_kernels.kernels.base import LazyKernelCModule, KernelFunction, round_up File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/__init__.py", line 1, in from . import library File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/library/__init__.py", line 1, in from . import nvrtc File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/library/nvrtc.py", line 5, in nvrtc = Lib("nvrtc") File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/library/base.py", line 59, in __init__ raise RuntimeError("Unknown platform: %s" % sys.platform) RuntimeError: Unknown platform: darwin During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/yifanyang/miniconda3/lib/python3.10/logging/__init__.py", line 1100, in emit msg = self.format(record) File "/Users/yifanyang/miniconda3/lib/python3.10/logging/__init__.py", line 943, in format return fmt.format(record) File "/Users/yifanyang/miniconda3/lib/python3.10/logging/__init__.py", line 678, in format record.message = record.getMessage() File "/Users/yifanyang/miniconda3/lib/python3.10/logging/__init__.py", line 368, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "/Users/yifanyang/Git/ChatGLM-6B/web_demo.py", line 6, in model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps') File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/transformers-4.29.0.dev0-py3.10.egg/transformers/models/auto/auto_factory.py", line 463, in from_pretrained return model_class.from_pretrained( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/transformers-4.29.0.dev0-py3.10.egg/transformers/modeling_utils.py", line 2637, in from_pretrained model = cls(config, *model_args, **model_kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1061, in __init__ self.quantize(self.config.quantization_bit, self.config.quantization_embeddings, use_quantization_cache=True, empty_init=True) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1424, in quantize from .quantization import quantize, QuantizedEmbedding, QuantizedLinear, load_cpu_kernel File "", line 1027, in _find_and_load File "", line 1006, in _find_and_load_unlocked File "", line 688, in _load_unlocked File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 46, in logger.warning("Failed to load cpm_kernels:", exception) Message: 'Failed to load cpm_kernels:' Arguments: (RuntimeError('Unknown platform: darwin'),) No compiled kernel found. Compiling kernels : /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.c Compiling gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.c -shared -o /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.so Load kernel : /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.so Setting CPU quantization kernel threads to 10 Using quantization cache Applying quantization to glm layers Running on local URL: http://127.0.0.1:7860 To create a public link, set `share=True` in `launch()`. Traceback (most recent call last): File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 395, in run_predict output = await app.get_blocks().process_api( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1193, in process_api result = await self.call_function( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 930, in call_function prediction = await anyio.to_thread.run_sync( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, *args) File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/utils.py", line 491, in async_iteration return next(iterator) File "/Users/yifanyang/Git/ChatGLM-6B/web_demo.py", line 61, in predict for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p, File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1311, in stream_chat for outputs in self.stream_generate(**inputs, **gen_kwargs): File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1388, in stream_generate outputs = self( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1190, in forward transformer_outputs = self.transformer( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 996, in forward layer_ret = layer( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 627, in forward attention_outputs = self.attention( File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 445, in forward mixed_raw_layer = self.query_key_value(hidden_states) File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 391, in forward output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width) File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 56, in forward weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width) File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 274, in extract_weight_to_half func = kernels.int4WeightExtractionHalf AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

yfyang86

某些情况需要half() 改成float(), 这个有些issue里面已经说了；

载入量化后的模型to("mps")失效
# eg: web_demo.py
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps')
model = model.eval()
非量化的issue-462里面你也回复了（6B没啥问题，只是chatglm-6b-int4会有问题），原因在quantization_code 这个文件（bz2了一个ELF/so文件）里面是NV的，当前mps不起作用。至于这个要启用的话，量化的代码估计要动很多。
error log

需要从 .half() 改成 .float() 这个问题以前是因为 PyTorch 在 MPS 后端的 baddbmm 实现有问题，现在应该已经修复了。你现在还能复现这个问题吗？

duzx16

某些情况需要half() 改成float(), 这个有些issue里面已经说了；

载入量化后的模型to("mps")失效
# eg: web_demo.py
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps')
model = model.eval()
非量化的issue-462里面你也回复了（6B没啥问题，只是chatglm-6b-int4会有问题），原因在quantization_code 这个文件（bz2了一个ELF/so文件）里面是NV的，当前mps不起作用。至于这个要启用的话，量化的代码估计要动很多。
error log
需要从 .half() 改成 .float() 这个问题以前是因为 PyTorch 在 MPS 后端的 baddbmm 实现有问题，现在应该已经修复了。你现在还能复现这个问题吗？

用了最新（2023/05）的pytorch-nightly没有这个问题。详细版本号和测试情况见下面：

torch version	status
2.1.0.dev20230502	✓`half()`, ✓`float()`
2.0.0	x `half()`, ✓`float()`

torch==2.0.0中（大陆广泛使用的anaconda镜像没有同步pytorch-nightly，安装不注意会触发这个问题），会有MPS的bug。

``` Python 3.10.10 (main, Mar 21 2023, 13:41:05) [Clang 14.0.6 ] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.__version__ '2.0.0' >>> torch.backends.mps.is_available() True ``` ---------------- error logs ---------------- loc("varianceEps"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/97f6331a-ba75-11ed-a4bc-863efbbaf80d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<1x5x1xf16>' and 'tensor<1xf32>' are not broadcast compatible

yfyang86

能跑但是很慢，请问要怎么解决，mac m1

zhaozhiming

能跑但是很慢，请问要怎么解决，mac m1

只用了CPU

如题，这个issue上文解释了为什么量化模型MPS调用有问题

内存不够

可能要看下内存（显存）。我的M1配置是64GB（Macbook pro M1 Max）和128GB（Mac Studio），会观测到显存占用是比较高的，但（上下文）token数没那么大的时候，问题不大。

运行的时候监控下内存占用，比如：

 while :; do clear; top -l 1 | grep "python" | awk '{print "MEM="$9 "\tRPRVT="$10}'; sleep 2; done

把里面的python 换成你用的bash调用命令的关键字， ctrl+c中断。看看到底有多少内存占用。

至于细节，需要更多日志看了。Mac运行推理只是一个可行解而已。

综上，可能原因：

int8/int4等量化模型可以摸索下，但这些量化后的模型在Mac当前应该只能用CPU调用，自然慢；
内存不够频繁用swap。

yfyang86

[THUDM/ChatGLM-6B][Document] 更新Mac部署说明

回答

相关问题