RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

Imperial College London

TL;DR

Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler–Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.

Contributions

Comparison with Baselines

Apply to all videos →

Ablation Studies

Apply to all videos →

User Study

We conduct a user study on 100 long and detailed prompts drawn from the qualitative showcases of the existing baselines, generating 4 samples per prompt for each method. The study covers the four baselines designed for 5-second short video generation, namely CausVid, Self Forcing, Reward Forcing, and Causal Forcing. For each sample pair, an individual user rates a RAVEN clip against its baseline counterpart presented in randomized order along Quality, Semantic, and Overall. Aggregate preference rates are reported below, where RAVEN is preferred on every dimension against all four baselines, with a more pronounced lead on Semantic than on Quality and a clear margin on Overall.

User study preference rates on Quality, Semantic, and Overall.
User study preference rates on Quality, Semantic, and Overall.

Citation

If you find this work useful, please cite RAVEN.

@article{lu2026raven,
  title = {RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO},
  author = {Lu, Yanzuo and Zuo, Ronglai and Deng, Jiankang},
  year = 2026,
  journal = {arXiv preprint arXiv:2605.15190}
}