CVPR 2026 Highlight

A Frame is Worth One Token:
Efficient Generative World Modeling with Delta Tokens

Tommie Kerssies^1,2,*, Gabriele Berton^1,*, Ju He¹, Qihang Yu¹, Wufei Ma^1,3,*
Daan de Geus^2,**, Gijs Dubbelman^2,**, Liang-Chieh Chen^1,*

¹Amazon ²Eindhoven University of Technology ³Johns Hopkins University

^*Work done while at Amazon. ^**Equal advising.

Paper Code Models

1
token per frame

> 35×
fewer parameters

2,000×
fewer FLOPs

Introduction

World models are heavy. They don't need to be. Each video frame is typically encoded as many (e.g., 1,024) spatial tokens. What if it were just one?

In this work, we compress frames into "delta" tokens for efficient generative world modeling.

DeltaTok

Consecutive video frames are highly redundant.

We train an autoencoder to compress the difference in patch tokens from a frozen vision foundation model (e.g., DINOv3) between consecutive frames into a single delta token.

DeltaWorld

We use these semantic delta tokens for next-token prediction.

DeltaWorld generates many plausible futures in parallel. During training, only the prediction closest to ground truth is supervised. At inference, this gives diverse predictions in a single forward pass.

Qualitative Examples

DeltaWorld generalizes zero-shot to unseen scenes.

Below, given four context frames, DeltaWorld samples multiple plausible futures that differ in pedestrian motion and camera trajectory.

Performance Comparison

Despite its simplicity, DeltaWorld performs accurate short- to mid-term forecasting with incredible efficiency.

1,024–2,048x fewer tokens (vs. DINO-based models)
>35x fewer parameters (vs. generative models)
2,000x fewer FLOPs (vs. generative models)

Citation

@inproceedings{kerssies2026deltatok,
  title     = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
  author    = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}