RAVEN — Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

TL;DR

Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler–Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.

Contributions

We identify a history supervision gap in autoregressive video diffusion distillation, where existing methods are either optimized under history distributions that differ from inference or conditioned on rollout history without end-to-end supervision.
We introduce RAVEN, a training-time test framework that repacks self rollouts into an interleaved sequence of clean historical endpoints and noisy denoising states, allowing supervision to propagate through the history representations used during extrapolation.
We propose CM-GRPO, which reformulates a consistency sampling step as a conditional Gaussian transition kernel and applies group relative policy optimization directly to this kernel, matching the sampler interface used at inference.
We demonstrate that RAVEN surpasses recent causal video distillation baselines and that CM-GRPO provides complementary gains when combined with RAVEN.

Comparison with Baselines

Apply to all videos →

Ablation Studies

Apply to all videos →

User Study

We conduct a user study on 100 long and detailed prompts drawn from the qualitative showcases of the existing baselines, generating 4 samples per prompt for each method. The study covers the four baselines designed for 5-second short video generation, namely CausVid, Self Forcing, Reward Forcing, and Causal Forcing. For each sample pair, an individual user rates a RAVEN clip against its baseline counterpart presented in randomized order along Quality, Semantic, and Overall. Aggregate preference rates are reported below, where RAVEN is preferred on every dimension against all four baselines, with a more pronounced lead on Semantic than on Quality and a clear margin on Overall.

Citation

If you find this work useful, please cite RAVEN.

@article{lu2026raven,
  title = {RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO},
  author = {Lu, Yanzuo and Zuo, Ronglai and Deng, Jiankang},
  year = 2026,
  journal = {arXiv preprint arXiv:2605.15190}
}