Skip to content

Train on single H200 possible? #46

@JRGit4UE

Description

@JRGit4UE

For a "hard-core" test, is it possible to train e.g. spider (7000 sequences) on a single H200 system (141GB GPU, 700GB RAM 32 Cores) as well?

I had to adapt the code to use Python 3.13, transformers 5 and flash-attn-4
(transformers 5 lacks tokenizer.batch_encode_plus() , so I adjusted it for both, transformers 4 and 5 with

if hasattr(tokenizer, 'batch_encode_plus'):
  # use existing code
  result = tokenizer.batch_encode_plus(sequences, ..)
else:
  result = tokenizer(sequences, ..)

and the accelerate_config_7b.yaml num_processes: 1

Unfortunately, after 63 batches doing accelerator.backward(loss), the GPU runs out of memory.
So my humble question is: what changes in the configs must be done, in order to keep the GPU alive while training?
or what changes (apart from use_cpu: true are needed to switch to CPU training?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions