CVPR 2026 Highlight

A Frame is Worth One Token:
Efficient Generative World Modeling with Delta Tokens

1Amazon   2Eindhoven University of Technology   3Johns Hopkins University
*Work done while at Amazon. **Equal advising.
1
token per frame
> 35×
fewer parameters
2,000×
fewer FLOPs

Introduction

World models are heavy. They don't need to be. Each video frame is typically encoded as many (e.g., 1,024) spatial tokens. What if it were just one?

In this work, we compress frames into "delta" tokens for efficient generative world modeling.

Outline of DeltaWorld

DeltaTok

Consecutive video frames are highly redundant.

We train an autoencoder to compress the difference in patch tokens from a frozen vision foundation model (e.g., DINOv3) between consecutive frames into a single delta token.

Overview of DeltaTok

DeltaWorld

We use these semantic delta tokens for next-token prediction.

DeltaWorld generates many plausible futures in parallel. During training, only the prediction closest to ground truth is supervised. At inference, this gives diverse predictions in a single forward pass.

Overview of DeltaWorld

Qualitative Examples

DeltaWorld generalizes zero-shot to unseen scenes.

Below, given four context frames, DeltaWorld samples multiple plausible futures that differ in pedestrian motion and camera trajectory.

Diverse sampled futures on VSPW

Performance Comparison

Despite its simplicity, DeltaWorld performs accurate short- to mid-term forecasting with incredible efficiency.

  • 1,024–2,048x fewer tokens (vs. DINO-based models)
  • >35x fewer parameters (vs. generative models)
  • 2,000x fewer FLOPs (vs. generative models)
Performance comparison

Citation

@inproceedings{kerssies2026deltatok,
  title     = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
  author    = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}