Skip to content

Latest commit

 

History

History
237 lines (187 loc) · 6.04 KB

File metadata and controls

237 lines (187 loc) · 6.04 KB

VSCD: Video-based Scene Change Detection in Unaligned Scenes

This repository provides the official PyTorch implementation of the paper
"VSCD: Video-based Scene Change Detection in Unaligned Scenes (ICML 2026)."

arXiv Paper Dataset

Jiae Yoon · Ue-Hwan Kim
ICML 2026

VSCD introduces a video-based scene change detection setting with unconstrained camera motion, strong cross-view misalignment, and multiple object-level changes.

Highlights

  • Video-based Scene Change Detection (VSCD): predicts a dense change mask for each query frame given an unaligned reference video and query video.
  • Large-scale VSCD benchmark: provides reference-query video pairs with query-aligned pixel-wise change masks.
  • Synthetic and real-world evaluation: includes a large synthetic benchmark and a real-world test set for sim-to-real evaluation.
  • VSCDNet: a query-centric multi-reference model with frame-level alignment, patch-level correspondence, confidence-aware feature fusion, and query-guided high-resolution decoding.
  • Real-world validation: demonstrated on a mobile robot for visual surveillance and object incremental learning.

Installation

1. Clone this repository

git clone https://github.com/AutoCompSysLab/VSCD.git
cd VSCD

2. Install dependencies

pip install -r requirements.txt

3. Prepare Segment Anything

VSCDNet uses a frozen SAM ViT-B image encoder. Please clone the official Segment Anything repository and download the SAM ViT-B checkpoint.

git clone https://github.com/facebookresearch/segment-anything.git

Expected paths:

SAM_ROOT=/path/to/segment-anything
SAM_CKPT=/path/to/sam_vit_b_01ec64.pth

Training

Run:

DATA_ROOT=/path/to/vscd_dataset \
SAM_ROOT=/path/to/segment-anything \
SAM_CKPT=/path/to/sam_vit_b_01ec64.pth \
bash scripts/train_default.sh

Or directly:

python train.py \
  --data_root /path/to/vscd_dataset \
  --sam_root /path/to/segment-anything \
  --sam_ckpt /path/to/sam_vit_b_01ec64.pth \
  --out_dir ./runs/vscd_default \
  --device cuda \
  --epochs 80 \
  --seed 0 \
  --batch_size 2 \
  --num_workers 2 \
  --num_frames_fixed 32 \
  --backbone_chunk 2 \
  --amp \
  --topk_ref_per_t 4 \
  --max_ref_cands_per_t 6 \
  --local_k 5 \
  --max_msp_len 5 \
  --lr 1e-4 \
  --weight_decay 0.01 \
  --val_thr 0.5 \
  --softmax_temp 0.5

Checkpoints are saved under:

./runs/vscd_default/
  ckpt_current.pt
  ckpt_best.pt
  ckpt_last.pt

Evaluation

Evaluate a trained checkpoint:

DATA_ROOT=/path/to/vscd_dataset \
SAM_ROOT=/path/to/segment-anything \
SAM_CKPT=/path/to/sam_vit_b_01ec64.pth \
MODEL_CKPT=./runs/vscd_default/ckpt_best.pt \
bash scripts/eval_default.sh

Or directly:

python eval.py \
  --data_root /path/to/vscd_dataset \
  --split test \
  --sam_root /path/to/segment-anything \
  --sam_ckpt /path/to/sam_vit_b_01ec64.pth \
  --model_ckpt ./runs/vscd_default/ckpt_best.pt \
  --num_frames_fixed 32 \
  --batch_size 1 \
  --num_workers 2 \
  --amp \
  --thr 0.5

To save predicted masks:

python eval.py \
  --data_root /path/to/vscd_dataset \
  --split test \
  --sam_root /path/to/segment-anything \
  --sam_ckpt /path/to/sam_vit_b_01ec64.pth \
  --model_ckpt ./runs/vscd_default/ckpt_best.pt \
  --save_dir ./outputs/vscd_default \
  --save_pred

Dataset

The VSCD dataset will be released on Hugging Face:

https://huggingface.co/datasets/jiae1234/vscd

The benchmark contains reference-query video pairs from the same indoor environment, captured at different times and along different camera trajectories. For each query video, pixel-wise change masks are provided for object-level changes such as appearances, disappearances, and relocations.

Dataset Structure

The code expects the dataset to be organized as follows:

<DATA_ROOT>/
  train/
    <space_name>/
      scene0_frames/
        scene0_0000.jpg
        scene0_0001.jpg
        ...
      scene1_frames/
        scene1_0000.jpg
        scene1_0001.jpg
        ...
      scene2_frames/
      scene3_frames/
      scene4_frames/
      change_mask_scene0_to_scene1_frames/
        change_mask_scene0_to_scene1_0000.jpg
        change_mask_scene0_to_scene1_0001.jpg
        ...
      change_mask_scene1_to_scene2_frames/
      change_mask_scene2_to_scene3_frames/
      change_mask_scene3_to_scene4_frames/
      change_mask_scene4_to_scene0_frames/

  test/
    <space_name>/
      scene0_frames/
      scene1_frames/
      scene2_frames/
      scene3_frames/
      scene4_frames/
      change_mask_scene0_to_scene1_frames/
      change_mask_scene1_to_scene2_frames/
      change_mask_scene2_to_scene3_frames/
      change_mask_scene3_to_scene4_frames/
      change_mask_scene4_to_scene0_frames/

The default directed scene pairs are:

0 -> 1
1 -> 2
2 -> 3
3 -> 4
4 -> 0

Each training/evaluation sample consists of:

  • a reference video from scene{a}_frames/,
  • a query video from scene{b}_frames/,
  • query-aligned change masks from change_mask_scene{a}_to_scene{b}_frames/.

Citation

If you find this repository or dataset useful, please cite:

@inproceedings{yoon2026vscd,
  title     = {VSCD: Video-based Scene Change Detection in Unaligned Scenes},
  author    = {Yoon, Jiae and Kim, Ue-Hwan},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2026}
}

Acknowledgements

This implementation is built upon:

We thank the authors of Segment Anything for releasing their code and pretrained models.