Add parallel eval runner for understanding benchmarks (如果有多卡的话，可以用cli多卡并行推在理解benchmark) by MqLeet · Pull Request #5 · AIFrontierLab/TorchUMM

MqLeet · 2026-04-27T05:59:45Z

Hi @ApiaoSamaa ，感谢你们开源的优秀工作！我是Jindong老师的粉丝，前几天在用torchumm来做理解任务的的bench，自己适配了一下多卡环境下的并行推理，特意提个PR支持一下工作～

Summary

Introduces a small, focused runner for distributed sharded inference and refactors the 5 understanding-class eval CLIs (mmbench, mme, mmmu, mathvista, mmvet) to use it. Lifts the distributed-init / round-robin sharding / per-rank JSONL checkpoint / rank-0 merge boilerplate into one place.

After this PR, all 5 CLIs can run under torchrun --nproc_per_node=N for data-parallel evaluation, and single-card behavior is preserved (final user-facing output files are byte-equivalent for benchmarks that already had a defined output format).

PYTHONPATH=src torchrun --nproc_per_node="${GPUS}" --master_port="${MASTER_PORT}" -m umm.cli.main eval --config configs/eval/mmbench/mmbench_show_o.yaml

What's added?

distributed runner changes

src/umm/eval/distributed.py — DistInfo dataclass, dist init/barrier/ all-reduce, rank-shard path, glob-based shard merge/cleanup. Lazy torch import so single-card callers pay no import cost.
src/umm/eval/runner.py — run_sharded_inference(): round-robin sample assignment by sample_idx, per-rank JSONL shard append (flush+fsync), resume via caller-supplied done_ids, optional global max_samples cap. Accepts an infer_fn callable so the runner is unit-testable without a real model.

per-cli changes

All 5 CLIs (mmbench, mme, mmmu, mathvista, mmvet) gain parallel support and share the same shape:

  dist_info = maybe_init_distributed()
  shard = rank_shard_path(checkpoint, dist_info.rank, dist_info.world_size)
  done_ids = {... from load_shard_items(shard)} if resume else set()                                                                                  
  n = run_sharded_inference(infer_fn=pipeline.run, ...)                                                                                               
  barrier(dist_info)                                                                                                                                  
  if dist_info.rank == 0:                                                                                                                             
      merged = merge_shards(checkpoint)                                                                                                               
      # benchmark-specific output formatting
      cleanup_shards(checkpoint)

Usage

Single-GPU (unchanged)
PYTHONPATH=src python -m umm.cli.main eval --config <cfg>.yaml
Multi-GPUs
PYTHONPATH=src torchrun --nproc_per_node=8 -m umm.cli.main eval --config <cfg>.yaml

TIPS

我已经在mme, mmbench, mmmu, mathvista, mmvet上适配了，但是没有在生成任务上做适配。此外可能还有一些follow-up PRs:

Backbone LOCAL_RANK adaptation. 需要给所有的backnones/models/apdater.py文件做一下gpu device分配：

    def _get_runtime_device(self):
        import torch

        if not torch.cuda.is_available():
            return torch.device("cpu")
        local_rank = int(os.environ.get("LOCAL_RANK", 0))
        if 0 <= local_rank < torch.cuda.device_count():
            torch.cuda.set_device(local_rank)
            return torch.device(f"cuda:{local_rank}")
        return torch.device("cuda")

并且替换

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

为

device = self._get_runtime_device()

Introduces two small modules under src/umm/eval/ that lift the distributed-init / sharding / shard-merge boilerplate out of each understanding-class eval CLI. * src/umm/eval/distributed.py — DistInfo dataclass, dist init/barrier/ all-reduce, rank-shard path, glob-based shard merge/cleanup. Lazy torch import so single-card callers pay no import cost. * src/umm/eval/runner.py — run_sharded_inference(): round-robin sample assignment by sample_idx, per-rank JSONL shard append (flush+fsync), resume via caller-supplied done_ids, optional global max_samples cap. Accepts an infer_fn callable so the runner is unit-testable without a real model. Refactors mmbench/mme/mmmu/mathvista/mmvet eval CLIs to use the runner. mathvista and mmvet gain parallel support; the others have their duplicated dist plumbing replaced. The runner does only "shard inference + shard merge". Each CLI keeps its post-processing (Excel/JSON output, calculation.py invocation, mathvista's LLM extraction) behind `if rank == 0:`. Behavior in single-card mode: final user-facing outputs are identical. Mid-run checkpoint format changes (mme TSV→JSONL during run, mathvista and mmvet dict-JSON→JSONL) so a partial run with the prior code cannot be resumed by this code; fresh runs work identically. Out of scope (follow-up PRs): each backbone adapter's LOCAL_RANK handling — only show_o currently honors LOCAL_RANK, others default to cuda:0 or device_map="auto" and need adaptation before they work correctly under torchrun multi-rank. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ApiaoSamaa

Thanks for the parallel-eval contribution — the distributed.py / runner.py
structure is clean and the single-card backward compatibility is preserved.
However, I'd like to request changes before this can merge, based on
end-to-end testing on Modal with bagel + 2× A100-80GB. Several issues
need to be addressed:

Blockers

1. `mmbench_eval.py` is a large functional regression

The diff removes ~600 lines of VLMEvalKit-compatible scoring infrastructure
that exists in main:

_can_infer* exact-match extraction (mmbench_eval.py:119-159)
_JudgeBundle Qwen3-32B LLM judge (mmbench_eval.py:229+)
_prefetch_circular_group / _eval_circular_group (circular evaluation,
4 rotations must all be correct)
_report_acc (per-split, per-l2-category accuracy)
mode: full/generate/score two-phase flow + llm_extract config block
MMBench_TEST_EN_V11 / MMBench_TEST_CN_V11 dataset entries

After this PR, mmbench would only produce generation output with no
accuracy numbers and no V11 test splits.

Suggested fix: keep the two-phase architecture; shard only the
generation phase, leave scoring (judge / circular eval) intact on rank 0.

2. Backbone adapters are not LOCAL_RANK-aware

The PR description acknowledges this as a follow-up item. In practice it
means that on main today, only bagel works under multi-card —
bagel/adapter.py:356-368 is the only place that reads LOCAL_RANK and
pins memory to the assigned GPU. Every other backbone (show_o v1/v2,
janus_pro, emu3, omnigen2, emu3_5, mmada, ovis_u1, blip3o,
deepgen, tokenflow, janus_flow) either hardcodes cuda:0 /
.cuda() or defaults device_map="auto" without checking LOCAL_RANK.
Under torchrun --nproc_per_node=N, all N ranks load to cuda:0 and OOM.

This needs to be resolved before merge, and we see two reasonable paths
— happy to go with either:

Extend this PR (or a follow-up PR from you) to
cover the adapter side. The cleanest one-shot fix is a single early
os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["LOCAL_RANK"] in
umm/__init__.py (guarded on WORLD_SIZE > 1), which makes every
hardcoded-cuda:0 adapter automatically pin to its rank's GPU. No
per-adapter changes required. Bagel's existing clamp logic at
adapter.py:359 already handles the single-visible-GPU case
correctly. Or we land the adapter-side fix in our fork first,
then come back to merge this PR once that's in. In that case I'd ask
you to either pause this PR or rebase on top of our adapter changes
before re-requesting review, so the multi-card path is end-to-end
testable when it lands. Let us know which you prefer.

3. Scoring-phase GPU allocation conflict

The PR's pattern (gen sharded, rank 0 runs scoring in-process) breaks
two-phase eval configs that rely on vLLM tensor parallelism for the
judge model (e.g. wise, geneval, imgedit, gedit, ueval,
unified_bench with score_gpu: A100-80GB:3 + Qwen2.5-VL-72B). When the
launcher uses torchrun for the score phase, N copies of the 72B judge
are loaded in parallel, OOMing immediately. The mathvista in-process
LLM extraction has a similar issue: rank 0's CUDA_VISIBLE_DEVICES
is already pinned to one GPU, so device_map="auto" for the extractor
can't spread across the original GPU set.

Suggested fix: torchrun must only wrap the generation phase of
sharded benchmarks. Scoring phases that use vLLM/HF auto-sharding must
launch as a single-process job that sees all GPUs.

4. `merge_shards` dedup is a no-op

In distributed.py, dedup = (key, id(item)) — id(item) is unique
per Python object, so the dedup set never matches, even when re-running
with a smaller world_size than a prior run (the documented use case
in the docstring). Fix: dedup = key.

5. `init_process_group` missing `device_id`

PyTorch 2.x emits a "rank-to-GPU mapping currently unknown" warning at
every dist.barrier() call and can hang if mapping is incorrect. Easy
fix: pass device_id=torch.device(f"cuda:{local_rank}") to
init_process_group.

Nits

max_samples semantics changed silently: previously per-process
count, now global iteration index. Multi-card with max_samples=10
now means ~10 total samples, not 10 per rank. Worth a note in the
README.
run_sharded_inference mutates the caller's done_ids set in place.
Consider done = set(done_ids) for safety.
tqdm progress only on rank 0 displays iteration count (including
skipped samples), not actual inferences. Slightly misleading.

Recommendation

I'd suggest splitting this into smaller, mergeable pieces:

PR-A: distributed.py + runner.py + the 4 sharded CLIs
(mathvista / mme / mmmu / mmvet), with the dedup and
device_id fixes. No mmbench changes.
PR-B: A separate CUDA_VISIBLE_DEVICES = LOCAL_RANK early-init hook
so all adapters work multi-card without per-adapter patches.
PR-C: Either drop the mmbench changes entirely, or rework them to
preserve the circular-eval / LLM-judge scoring infrastructure.

ApiaoSamaa self-assigned this Apr 27, 2026

ApiaoSamaa reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel eval runner for understanding benchmarks (如果有多卡的话，可以用cli多卡并行推在理解benchmark)#5

Add parallel eval runner for understanding benchmarks (如果有多卡的话，可以用cli多卡并行推在理解benchmark)#5
MqLeet wants to merge 1 commit into
AIFrontierLab:mainfrom
MqLeet:feat/distributed-eval-runner

MqLeet commented Apr 27, 2026

Uh oh!

ApiaoSamaa left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MqLeet commented Apr 27, 2026

Summary

What's added?

distributed runner changes

per-cli changes

Usage

TIPS

Uh oh!

ApiaoSamaa left a comment

Choose a reason for hiding this comment

Blockers

1. mmbench_eval.py is a large functional regression

2. Backbone adapters are not LOCAL_RANK-aware

3. Scoring-phase GPU allocation conflict

4. merge_shards dedup is a no-op

5. init_process_group missing device_id

Nits

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `mmbench_eval.py` is a large functional regression

4. `merge_shards` dedup is a no-op

5. `init_process_group` missing `device_id`