Windows Server下使用多卡微调出现OOM 报错信息: OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 22.50 GiB total capacity; 19.86 GiB already allocated; 0 bytes free; 19.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
一、错误复现 1.微调批处理文件 cd ptuning SET CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --do_train --train_file ..\answers.json --validation_file ..\dev.json --prompt_column prompt --response_column response --overwrite_cache --model_name_or_path ..\model --output_dir ..\output --overwrite_output_dir --max_source_length 256 --max_target_length 256 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --predict_with_generate --max_steps 500 --logging_steps 10 --save_steps 50 --learning_rate 2e-2 --pre_seq_len 128 pause 2.训练数据报错 Running tokenizer on train dataset 100%完成,inputs训练集内容后,出现OOM报错。 报错信息: OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 22.50 GiB total capacity; 19.86 GiB already allocated; 0 bytes free; 19.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 二、排查 1.微调批处理文件 cd ptuning SET CUDA_VISIBLE_DEVICES=0 python main.py --do_train --train_file ..\answers.json --validation_file ..\dev.json --prompt_column prompt --response_column response --overwrite_cache --model_name_or_path ..\model --output_dir ..\output --overwrite_output_dir --max_source_length 256 --max_target_length 256 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --predict_with_generate --max_steps 500 --logging_steps 10 --save_steps 50 --learning_rate 2e-2 --pre_seq_len 128 pause
训练正常,未出现OOM,就是慢点
Environment- OS:Windows Server 2019
- Python:3.10.9
- Transformers:4.27.1
- PyTorch:2.0.0+cu118
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True
其它信息
-Cuda版本:11.8
-显卡:Tesla P40*8
-是否使用Anaconda:否
-训练集大小:34.7MB