Add parallel eval runner for understanding benchmarks (如果有多卡的话,可以用cli多卡并行推在理解benchmark)#5
Conversation
Introduces two small modules under src/umm/eval/ that lift the distributed-init / sharding / shard-merge boilerplate out of each understanding-class eval CLI. * src/umm/eval/distributed.py — DistInfo dataclass, dist init/barrier/ all-reduce, rank-shard path, glob-based shard merge/cleanup. Lazy torch import so single-card callers pay no import cost. * src/umm/eval/runner.py — run_sharded_inference(): round-robin sample assignment by sample_idx, per-rank JSONL shard append (flush+fsync), resume via caller-supplied done_ids, optional global max_samples cap. Accepts an infer_fn callable so the runner is unit-testable without a real model. Refactors mmbench/mme/mmmu/mathvista/mmvet eval CLIs to use the runner. mathvista and mmvet gain parallel support; the others have their duplicated dist plumbing replaced. The runner does only "shard inference + shard merge". Each CLI keeps its post-processing (Excel/JSON output, calculation.py invocation, mathvista's LLM extraction) behind `if rank == 0:`. Behavior in single-card mode: final user-facing outputs are identical. Mid-run checkpoint format changes (mme TSV→JSONL during run, mathvista and mmvet dict-JSON→JSONL) so a partial run with the prior code cannot be resumed by this code; fresh runs work identically. Out of scope (follow-up PRs): each backbone adapter's LOCAL_RANK handling — only show_o currently honors LOCAL_RANK, others default to cuda:0 or device_map="auto" and need adaptation before they work correctly under torchrun multi-rank. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ApiaoSamaa
left a comment
There was a problem hiding this comment.
Thanks for the parallel-eval contribution — the distributed.py / runner.py
structure is clean and the single-card backward compatibility is preserved.
However, I'd like to request changes before this can merge, based on
end-to-end testing on Modal with bagel + 2× A100-80GB. Several issues
need to be addressed:
Blockers
1. mmbench_eval.py is a large functional regression
The diff removes ~600 lines of VLMEvalKit-compatible scoring infrastructure
that exists in main:
_can_infer*exact-match extraction (mmbench_eval.py:119-159)_JudgeBundleQwen3-32B LLM judge (mmbench_eval.py:229+)_prefetch_circular_group/_eval_circular_group(circular evaluation,
4 rotations must all be correct)_report_acc(per-split, per-l2-category accuracy)mode: full/generate/scoretwo-phase flow +llm_extractconfig blockMMBench_TEST_EN_V11/MMBench_TEST_CN_V11dataset entries
After this PR, mmbench would only produce generation output with no
accuracy numbers and no V11 test splits.
Suggested fix: keep the two-phase architecture; shard only the
generation phase, leave scoring (judge / circular eval) intact on rank 0.
2. Backbone adapters are not LOCAL_RANK-aware
The PR description acknowledges this as a follow-up item. In practice it
means that on main today, only bagel works under multi-card —
bagel/adapter.py:356-368 is the only place that reads LOCAL_RANK and
pins memory to the assigned GPU. Every other backbone (show_o v1/v2,
janus_pro, emu3, omnigen2, emu3_5, mmada, ovis_u1, blip3o,
deepgen, tokenflow, janus_flow) either hardcodes cuda:0 /
.cuda() or defaults device_map="auto" without checking LOCAL_RANK.
Under torchrun --nproc_per_node=N, all N ranks load to cuda:0 and OOM.
This needs to be resolved before merge, and we see two reasonable paths
— happy to go with either:
- Extend this PR (or a follow-up PR from you) to
cover the adapter side. The cleanest one-shot fix is a single early
os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["LOCAL_RANK"]in
umm/__init__.py(guarded onWORLD_SIZE > 1), which makes every
hardcoded-cuda:0adapter automatically pin to its rank's GPU. No
per-adapter changes required. Bagel's existingclamplogic at
adapter.py:359already handles the single-visible-GPU case
correctly. Or we land the adapter-side fix in our fork first,
then come back to merge this PR once that's in. In that case I'd ask
you to either pause this PR or rebase on top of our adapter changes
before re-requesting review, so the multi-card path is end-to-end
testable when it lands. Let us know which you prefer.
3. Scoring-phase GPU allocation conflict
The PR's pattern (gen sharded, rank 0 runs scoring in-process) breaks
two-phase eval configs that rely on vLLM tensor parallelism for the
judge model (e.g. wise, geneval, imgedit, gedit, ueval,
unified_bench with score_gpu: A100-80GB:3 + Qwen2.5-VL-72B). When the
launcher uses torchrun for the score phase, N copies of the 72B judge
are loaded in parallel, OOMing immediately. The mathvista in-process
LLM extraction has a similar issue: rank 0's CUDA_VISIBLE_DEVICES
is already pinned to one GPU, so device_map="auto" for the extractor
can't spread across the original GPU set.
Suggested fix: torchrun must only wrap the generation phase of
sharded benchmarks. Scoring phases that use vLLM/HF auto-sharding must
launch as a single-process job that sees all GPUs.
4. merge_shards dedup is a no-op
In distributed.py, dedup = (key, id(item)) — id(item) is unique
per Python object, so the dedup set never matches, even when re-running
with a smaller world_size than a prior run (the documented use case
in the docstring). Fix: dedup = key.
5. init_process_group missing device_id
PyTorch 2.x emits a "rank-to-GPU mapping currently unknown" warning at
every dist.barrier() call and can hang if mapping is incorrect. Easy
fix: pass device_id=torch.device(f"cuda:{local_rank}") to
init_process_group.
Nits
max_samplessemantics changed silently: previously per-process
count, now global iteration index. Multi-card withmax_samples=10
now means ~10 total samples, not 10 per rank. Worth a note in the
README.run_sharded_inferencemutates the caller'sdone_idsset in place.
Considerdone = set(done_ids)for safety.- tqdm progress only on rank 0 displays iteration count (including
skipped samples), not actual inferences. Slightly misleading.
Recommendation
I'd suggest splitting this into smaller, mergeable pieces:
- PR-A:
distributed.py+runner.py+ the 4 sharded CLIs
(mathvista/mme/mmmu/mmvet), with the dedup and
device_idfixes. No mmbench changes. - PR-B: A separate
CUDA_VISIBLE_DEVICES = LOCAL_RANKearly-init hook
so all adapters work multi-card without per-adapter patches. - PR-C: Either drop the mmbench changes entirely, or rework them to
preserve the circular-eval / LLM-judge scoring infrastructure.
Hi @ApiaoSamaa ,感谢你们开源的优秀工作!我是Jindong老师的粉丝,前几天在用torchumm来做理解任务的的bench,自己适配了一下多卡环境下的并行推理,特意提个PR支持一下工作~
Summary
Introduces a small, focused runner for distributed sharded inference and refactors the 5 understanding-class eval CLIs (mmbench, mme, mmmu, mathvista, mmvet) to use it. Lifts the distributed-init / round-robin sharding / per-rank JSONL checkpoint / rank-0 merge boilerplate into one place.
After this PR, all 5 CLIs can run under torchrun --nproc_per_node=N for data-parallel evaluation, and single-card behavior is preserved (final user-facing output files are byte-equivalent for benchmarks that already had a defined output format).
What's added?
distributed runner changes
per-cli changes
All 5 CLIs (mmbench, mme, mmmu, mathvista, mmvet) gain parallel support and share the same shape:
Usage
Single-GPU (unchanged)
PYTHONPATH=src python -m umm.cli.main eval --config <cfg>.yamlMulti-GPUs
PYTHONPATH=src torchrun --nproc_per_node=8 -m umm.cli.main eval --config <cfg>.yamlTIPS
我已经在mme, mmbench, mmmu, mathvista, mmvet上适配了,但是没有在生成任务上做适配。此外可能还有一些follow-up PRs:
Backbone LOCAL_RANK adaptation. 需要给所有的backnones/models/apdater.py文件做一下gpu device分配:
并且替换
为