Skip to content

Add parallel eval runner for understanding benchmarks (如果有多卡的话,可以用cli多卡并行推在理解benchmark)#5

Open
MqLeet wants to merge 1 commit into
AIFrontierLab:mainfrom
MqLeet:feat/distributed-eval-runner
Open

Add parallel eval runner for understanding benchmarks (如果有多卡的话,可以用cli多卡并行推在理解benchmark)#5
MqLeet wants to merge 1 commit into
AIFrontierLab:mainfrom
MqLeet:feat/distributed-eval-runner

Conversation

@MqLeet
Copy link
Copy Markdown

@MqLeet MqLeet commented Apr 27, 2026

Hi @ApiaoSamaa ,感谢你们开源的优秀工作!我是Jindong老师的粉丝,前几天在用torchumm来做理解任务的的bench,自己适配了一下多卡环境下的并行推理,特意提个PR支持一下工作~

Summary

Introduces a small, focused runner for distributed sharded inference and refactors the 5 understanding-class eval CLIs (mmbench, mme, mmmu, mathvista, mmvet) to use it. Lifts the distributed-init / round-robin sharding / per-rank JSONL checkpoint / rank-0 merge boilerplate into one place.

After this PR, all 5 CLIs can run under torchrun --nproc_per_node=N for data-parallel evaluation, and single-card behavior is preserved (final user-facing output files are byte-equivalent for benchmarks that already had a defined output format).

PYTHONPATH=src torchrun --nproc_per_node="${GPUS}" --master_port="${MASTER_PORT}" -m umm.cli.main eval --config configs/eval/mmbench/mmbench_show_o.yaml

What's added?

distributed runner changes

  • src/umm/eval/distributed.py — DistInfo dataclass, dist init/barrier/ all-reduce, rank-shard path, glob-based shard merge/cleanup. Lazy torch import so single-card callers pay no import cost.
  • src/umm/eval/runner.py — run_sharded_inference(): round-robin sample assignment by sample_idx, per-rank JSONL shard append (flush+fsync), resume via caller-supplied done_ids, optional global max_samples cap. Accepts an infer_fn callable so the runner is unit-testable without a real model.

per-cli changes

All 5 CLIs (mmbench, mme, mmmu, mathvista, mmvet) gain parallel support and share the same shape:

  dist_info = maybe_init_distributed()
  shard = rank_shard_path(checkpoint, dist_info.rank, dist_info.world_size)
  done_ids = {... from load_shard_items(shard)} if resume else set()                                                                                  
  n = run_sharded_inference(infer_fn=pipeline.run, ...)                                                                                               
  barrier(dist_info)                                                                                                                                  
  if dist_info.rank == 0:                                                                                                                             
      merged = merge_shards(checkpoint)                                                                                                               
      # benchmark-specific output formatting
      cleanup_shards(checkpoint)

Usage

  • Single-GPU (unchanged)
    PYTHONPATH=src python -m umm.cli.main eval --config <cfg>.yaml

  • Multi-GPUs
    PYTHONPATH=src torchrun --nproc_per_node=8 -m umm.cli.main eval --config <cfg>.yaml

TIPS

我已经在mme, mmbench, mmmu, mathvista, mmvet上适配了,但是没有在生成任务上做适配。此外可能还有一些follow-up PRs:

Backbone LOCAL_RANK adaptation. 需要给所有的backnones/models/apdater.py文件做一下gpu device分配:

    def _get_runtime_device(self):
        import torch

        if not torch.cuda.is_available():
            return torch.device("cpu")
        local_rank = int(os.environ.get("LOCAL_RANK", 0))
        if 0 <= local_rank < torch.cuda.device_count():
            torch.cuda.set_device(local_rank)
            return torch.device(f"cuda:{local_rank}")
        return torch.device("cuda")

并且替换

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device = self._get_runtime_device()

Introduces two small modules under src/umm/eval/ that lift the
distributed-init / sharding / shard-merge boilerplate out of each
understanding-class eval CLI.

* src/umm/eval/distributed.py — DistInfo dataclass, dist init/barrier/
  all-reduce, rank-shard path, glob-based shard merge/cleanup. Lazy
  torch import so single-card callers pay no import cost.
* src/umm/eval/runner.py — run_sharded_inference(): round-robin sample
  assignment by sample_idx, per-rank JSONL shard append (flush+fsync),
  resume via caller-supplied done_ids, optional global max_samples cap.
  Accepts an infer_fn callable so the runner is unit-testable without a
  real model.

Refactors mmbench/mme/mmmu/mathvista/mmvet eval CLIs to use the runner.
mathvista and mmvet gain parallel support; the others have their
duplicated dist plumbing replaced.

The runner does only "shard inference + shard merge". Each CLI keeps
its post-processing (Excel/JSON output, calculation.py invocation,
mathvista's LLM extraction) behind `if rank == 0:`.

Behavior in single-card mode: final user-facing outputs are identical.
Mid-run checkpoint format changes (mme TSV→JSONL during run, mathvista
and mmvet dict-JSON→JSONL) so a partial run with the prior code cannot
be resumed by this code; fresh runs work identically.

Out of scope (follow-up PRs): each backbone adapter's LOCAL_RANK
handling — only show_o currently honors LOCAL_RANK, others default to
cuda:0 or device_map="auto" and need adaptation before they work
correctly under torchrun multi-rank.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ApiaoSamaa ApiaoSamaa self-assigned this Apr 27, 2026
Copy link
Copy Markdown
Collaborator

@ApiaoSamaa ApiaoSamaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the parallel-eval contribution — the distributed.py / runner.py
structure is clean and the single-card backward compatibility is preserved.
However, I'd like to request changes before this can merge, based on
end-to-end testing on Modal with bagel + 2× A100-80GB. Several issues
need to be addressed:

Blockers

1. mmbench_eval.py is a large functional regression

The diff removes ~600 lines of VLMEvalKit-compatible scoring infrastructure
that exists in main:

  • _can_infer* exact-match extraction (mmbench_eval.py:119-159)
  • _JudgeBundle Qwen3-32B LLM judge (mmbench_eval.py:229+)
  • _prefetch_circular_group / _eval_circular_group (circular evaluation,
    4 rotations must all be correct)
  • _report_acc (per-split, per-l2-category accuracy)
  • mode: full/generate/score two-phase flow + llm_extract config block
  • MMBench_TEST_EN_V11 / MMBench_TEST_CN_V11 dataset entries

After this PR, mmbench would only produce generation output with no
accuracy numbers and no V11 test splits.

Suggested fix: keep the two-phase architecture; shard only the
generation phase, leave scoring (judge / circular eval) intact on rank 0.

2. Backbone adapters are not LOCAL_RANK-aware

The PR description acknowledges this as a follow-up item. In practice it
means that on main today, only bagel works under multi-card —
bagel/adapter.py:356-368 is the only place that reads LOCAL_RANK and
pins memory to the assigned GPU. Every other backbone (show_o v1/v2,
janus_pro, emu3, omnigen2, emu3_5, mmada, ovis_u1, blip3o,
deepgen, tokenflow, janus_flow) either hardcodes cuda:0 /
.cuda() or defaults device_map="auto" without checking LOCAL_RANK.
Under torchrun --nproc_per_node=N, all N ranks load to cuda:0 and OOM.

This needs to be resolved before merge, and we see two reasonable paths
— happy to go with either:

  • Extend this PR (or a follow-up PR from you) to
    cover the adapter side. The cleanest one-shot fix is a single early
    os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["LOCAL_RANK"] in
    umm/__init__.py (guarded on WORLD_SIZE > 1), which makes every
    hardcoded-cuda:0 adapter automatically pin to its rank's GPU. No
    per-adapter changes required. Bagel's existing clamp logic at
    adapter.py:359 already handles the single-visible-GPU case
    correctly. Or we land the adapter-side fix in our fork first,
    then come back to merge this PR once that's in. In that case I'd ask
    you to either pause this PR or rebase on top of our adapter changes
    before re-requesting review, so the multi-card path is end-to-end
    testable when it lands. Let us know which you prefer.

3. Scoring-phase GPU allocation conflict

The PR's pattern (gen sharded, rank 0 runs scoring in-process) breaks
two-phase eval configs that rely on vLLM tensor parallelism for the
judge model (e.g. wise, geneval, imgedit, gedit, ueval,
unified_bench with score_gpu: A100-80GB:3 + Qwen2.5-VL-72B). When the
launcher uses torchrun for the score phase, N copies of the 72B judge
are loaded in parallel, OOMing immediately. The mathvista in-process
LLM extraction has a similar issue: rank 0's CUDA_VISIBLE_DEVICES
is already pinned to one GPU, so device_map="auto" for the extractor
can't spread across the original GPU set.

Suggested fix: torchrun must only wrap the generation phase of
sharded benchmarks. Scoring phases that use vLLM/HF auto-sharding must
launch as a single-process job that sees all GPUs.

4. merge_shards dedup is a no-op

In distributed.py, dedup = (key, id(item))id(item) is unique
per Python object, so the dedup set never matches, even when re-running
with a smaller world_size than a prior run (the documented use case
in the docstring). Fix: dedup = key.

5. init_process_group missing device_id

PyTorch 2.x emits a "rank-to-GPU mapping currently unknown" warning at
every dist.barrier() call and can hang if mapping is incorrect. Easy
fix: pass device_id=torch.device(f"cuda:{local_rank}") to
init_process_group.

Nits

  • max_samples semantics changed silently: previously per-process
    count, now global iteration index. Multi-card with max_samples=10
    now means ~10 total samples, not 10 per rank. Worth a note in the
    README.
  • run_sharded_inference mutates the caller's done_ids set in place.
    Consider done = set(done_ids) for safety.
  • tqdm progress only on rank 0 displays iteration count (including
    skipped samples), not actual inferences. Slightly misleading.

Recommendation

I'd suggest splitting this into smaller, mergeable pieces:

  1. PR-A: distributed.py + runner.py + the 4 sharded CLIs
    (mathvista / mme / mmmu / mmvet), with the dedup and
    device_id fixes. No mmbench changes.
  2. PR-B: A separate CUDA_VISIBLE_DEVICES = LOCAL_RANK early-init hook
    so all adapters work multi-card without per-adapter patches.
  3. PR-C: Either drop the mmbench changes entirely, or rework them to
    preserve the circular-eval / LLM-judge scoring infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants