[THUDM/ChatGLM-6B][BUG/Help] Windows下使用多卡P-tuning微调,出现OOM

2024-05-20 516 views
5

Windows Server下使用多卡微调出现OOM 报错信息: OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 22.50 GiB total capacity; 19.86 GiB already allocated; 0 bytes free; 19.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

一、错误复现 1.微调批处理文件 cd ptuning SET CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --do_train --train_file ..\answers.json --validation_file ..\dev.json --prompt_column prompt --response_column response --overwrite_cache --model_name_or_path ..\model --output_dir ..\output --overwrite_output_dir --max_source_length 256 --max_target_length 256 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --predict_with_generate --max_steps 500 --logging_steps 10 --save_steps 50 --learning_rate 2e-2 --pre_seq_len 128 pause 2.训练数据报错 Running tokenizer on train dataset 100%完成,inputs训练集内容后,出现OOM报错。 报错信息: OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 22.50 GiB total capacity; 19.86 GiB already allocated; 0 bytes free; 19.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 二、排查 1.微调批处理文件 cd ptuning SET CUDA_VISIBLE_DEVICES=0 python main.py --do_train --train_file ..\answers.json --validation_file ..\dev.json --prompt_column prompt --response_column response --overwrite_cache --model_name_or_path ..\model --output_dir ..\output --overwrite_output_dir --max_source_length 256 --max_target_length 256 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --predict_with_generate --max_steps 500 --logging_steps 10 --save_steps 50 --learning_rate 2e-2 --pre_seq_len 128 pause

训练正常,未出现OOM,就是慢点

Environment
- OS:Windows Server 2019
- Python:3.10.9
- Transformers:4.27.1
- PyTorch:2.0.0+cu118
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

其它信息
-Cuda版本:11.8
-显卡:Tesla P40*8
-是否使用Anaconda:否
-训练集大小:34.7MB

回答

6

用deepspeed启动多卡模式

0

a100 80G*2,OOM deepspeed --num_gpus=2 --master_port $MASTER_PORT main.py \ --deepspeed deepspeed.json \ --do_train \ --train_file ../data/2w.csv \ --test_file ../data/2k.csv \ --prompt_column prompts \ --response_column output \ --overwrite_cache \ --model_name_or_path ../chatglm-6b \ --output_dir ./output/xw-chatglm-6b-ft-$LR \ --overwrite_output_dir \ --max_source_length 64 \ --max_target_length 64 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --predict_with_generate \ --max_steps 10000 \ --logging_steps 100 \ --save_steps 5000 \ --learning_rate $LR \ --fp16

用deepspeed启动多卡模式

4

已解决,感谢两位 @zhanshijinwat @xiamaozi11

5

已解决,感谢两位 @zhanshijinwat @xiamaozi11

大佬是在windows下使用deepspeed训成功的吗?

2

@0x0019 @zhanshijinwat @xiamaozi11 大佬们,请问使用了deepspeed还是p-tunning微调吗?这是执行的是ds_train_finetune.sh文件吗?本人小白,麻烦大佬们指导一下,谢谢