Skip to content

cwangrun/CheXficient

Repository files navigation

CheXficient

This repository provides the implementation of the paper A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling. CheXficient is a chest X-ray (CXR) foundation model developed within a contrastive language–image pretraining (CLIP) framework. Instead of relying on aggressive scaling, it emphasizes more effective utilization of training data to enhance both data efficiency and computational efficiency. Through active data-curated pretraining, CheXficient achieves competitive performance while requiring substantially fewer data and compute resources, offering a practical approach for scalable medical imaging foundation models.

@article{wang2026data,
  title={A data-and compute-efficient chest X-ray foundation model beyond aggressive scaling},
  author={Wang, Chong and Zhang, Yabin and Gao, Yunhe and Varma, Maya and Mottez, Clemence and Patsatzi, Faidra and Liu, Jiaming and Long, Jin and Delbrouck, Jean-Benoit and Gatidis, Sergios and others},
  journal={arXiv preprint arXiv:2602.22843},
  year={2026}
}

Quick Links

Overview

CheXficient incorporates a prototype-driven online data curator during pretraining (a). A set of prototypes (i.e., prototypical centroids) is leveraged to approximate the underlying data manifold, enabling dynamic prioritization of informative CXR image–report data pairs for model optimization using the InfoNCE contrastive loss. Concretely, training samples that lie farther from the prototypes (corresponding to under-represented but informative regions of the data distribution) are emphasized, while samples near the prototypes, which tend to contain redundant information, are down-weighted and under-sampled (b). The prototypes are updated concurrently with model training to reflect the evolving data distribution.

Getting Started

This codebase is designed with minimal dependencies (tested under Python 3.10.17, PyTorch 2.1.1 and Transformers 4.52.4), see here for full details. The installation process typically takes less than one hour.

pip install transformers==4.52.4

Prepare Encoder

CheXficient leverages pre-trained vision and text encoders, such as DINO-v2, Bio_ClinicalBERT, enabling flexible extension to other pre-trained vision and language models. The corresponding model checkpoints are automatically downloaded in training.

Download Pretrained CheXficient

pip install gdown
gdown --folder https://drive.google.com/drive/folders/1ISHSL8wf6upI_dRigMFroTaUPbozNFmS

Check Model List for other models.

Use CheXficient with PyTorch

For PyTorch-based usage, you can utilize the following code to load CheXficient pretrained models (similar to resuming training in main.py):

import torch
import torchvision.transforms as transforms
from PIL import Image
import run_configs
from models_clip import CheXficient

config_name = "chexficient"
args = getattr(run_configs, config_name)()
model = CheXficient(image_size=args.image_size)
model.to(torch.device(f'cuda:{0}'))
tokenizer = model.text_encoder.tokenizer
image_transform = transforms.Compose([
            transforms.Resize(args.image_size, interpolation=Image.BICUBIC),
            transforms.CenterCrop(args.image_size),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711])
    ])

state_dict = torch.load(f"pretrained_models/pytorch_model.pth", map_location='cpu')['model']
res = model.load_state_dict(state_dict, strict=False)
model.eval()

inputs_text = tokenizer(["Pneumonia", "no Pneumonia"], padding="longest", truncation=True, max_length=args.max_bert_length, return_tensors="pt")
for key in inputs_text:
    inputs_text[key] = inputs_text[key].to(next(model.parameters()).device, non_blocking=True)
inputs_image = image_transform(Image.open("./CXR/images/5AF3BB6C1BCC83C.png").convert("RGB")).unsqueeze(0)
inputs_image = inputs_image.to(next(model.parameters()).device, non_blocking=True)
with torch.no_grad():
    text_embeds = model.encode_text(inputs_text)
    image_embeds = model.encode_image(inputs_image)
    cosine = image_embeds @ text_embeds.t()
print('prob:', cosine.softmax(dim=1))

Use CheXficient with Huggingface

Please run the following to load checkpoints from Huggingface.

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor

repo_id = "StanfordAIMI/CheXficient"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    repo_id,
    trust_remote_code=True
).to(device)

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(repo_id, trust_remote_code=True)

model.eval()

image = Image.open("./CXR/images/5AF3BB6C1BCC83C.png").convert("RGB")
text = ["Pneumonia", "no Pneumonia"]

image_inputs = image_processor(images=image, return_tensors="pt").to(device)
text_inputs = tokenizer(text, padding=True, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(
        pixel_values=image_inputs["pixel_values"],
        text_tokens=text_inputs,
    )
print(outputs)

Model List

Our released models are listed as following. You can import them by the following Get Started/Evaluation section.

Model Vision Encoder Text Encoder
vit-b14-clip-378 ViT-B/14 BERT-base

More models coming soon.

Evaluation

Non-adapted Evaluation

Please refer to ./clipeval/eval_zeroshot.py for non-adapted zero-shot evaluation on:

  1. Findings classification;
  2. Cross-modal retrieval.

Downstream-adapted Evaluation:

Please refer to ./models_clip.py for downstream tasks like:

  1. Classification (linear probing);
  2. Segmentation (U-Net decoding);
  3. Radiology report generation, we adopt the VLM framework from Microsoft’s LLaVA-Rad, replacing its original image encoder with our pre-trained vision encoder.

Training

Data

An extensive training corpus of over 1.235 million CXR image–report pairs was collected from 13 public datasets: MIMIC-CXR; ReXGradient-160K; CheXpert-Plus; PadChest; BIMCV-COVID19; CANDID-PTX; CASIA-CXR; Open-I; NIH ChestX-ray14; BRAX; VinDr-CXR; VinDr-PCXR; ChestDR.

Note some of them (NIH ChestX-ray14, BRAX, VinDr-CXR, VinDr-PCXR, and ChestDR) do not provide free-text reports but instead include structured diagnostic labels (e.g., pleural effusion, cardiomegaly, atelectasis). we generate pseudo-reports for them using a template-based report synthesis strategy introduced in LLaVA-Rad.

We preprocess all contributing datasets using simple filtering rules (e.g., excluding samples with empty CXR reports or invalid image–text pairs). Implementation details can be found in the ./preprocess folder.

Training scripts

All example configuration settings are provided in configs.py. Data curation and model training are performed concurrently during the training process.

python main.py    # a local training of the default setup on multiple GPUs.

Single GPU Training

Training can be performed on a single GPU using an embedding accumulation strategy (coming soon).

Curated Data

The curated subset from the raw training set is stored within the model checkpoint under the key "subset". For example:

import torch
state_dict = torch.load(f"pretrained_models/pytorch_model.pth", map_location='cpu')
subset = state_dict['subset']

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Chong Wang (chongwa@stanford.edu).

Citation

Please cite our paper below if CheXficient contributes in your work:

@article{wang2026data,
  title={A data-and compute-efficient chest X-ray foundation model beyond aggressive scaling},
  author={Wang, Chong and Zhang, Yabin and Gao, Yunhe and Varma, Maya and Mottez, Clemence and Patsatzi, Faidra and Liu, Jiaming and Long, Jin and Delbrouck, Jean-Benoit and Gatidis, Sergios and others},
  journal={arXiv preprint arXiv:2602.22843},
  year={2026}
}

About

CheXficient

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages