使用432g的单机多卡微调直接oom,修改了train.sh中的CUDA_VISIBLE_DEVICES=0,1,2,3
Traceback (most recent call last):
File "main.py", line 431, in
main()
File "main.py", line 370, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/data/ChatGLM-6B-main/ptuning/trainer.py", line 1635, in train
return inner_training_loop(
File "/data/ChatGLM-6B-main/ptuning/trainer.py", line 1904, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/data/ChatGLM-6B-main/ptuning/trainer.py", line 2665, in training_step
loss.backward()
File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, args)
File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 34, in backward
return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 45, in forward
return comm.reduce_addcoalesced(grads, destination)
File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 143, in reduce_add_coalesced
flat_result = reduce_add(flat_tensors, destination)
File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 95, in reduce_add
result = torch.empty_like(inputs[root_index])
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 31.75 GiB total capacity; 30.53 GiB already allocated; 87.69 MiB free; 30.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Environment
- OS:ubuntu 18.04
- Python:3.8.15
- Transformers:4.28.1
- PyTorch:1.13.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :