At inference, given an initial image and a language instruction, FloMo jointly denoises its output heads for 10 UniPC steps, then executes the full predicted chunk of 16 actions before re-planning from the next observation.
World-action models usually predict future video. We argue the right target is 3D motion. FloMo renders 3D scene flow into the latent space of a pretrained video generator and fine-tunes a single backbone to predict motion and action. Predicting motion instead of pixels gives better in-distribution success, far stronger out-of-distribution generalization, and higher sample efficiency.
Abstract
Pretrained video generation models distill internet-scale motion priors into robot policies, but full video is a redundant signal for action prediction. We argue that dense 3D motion is the medium through which video pretraining best transfers to robot actions, and that predicting motion rather than video affords better sample-efficiency for action prediction. FloMo is a 5B video-generation model fine-tuned to predict 3D scene-flow and action tokens. Rendering scene flow as RGB and encoding it into the same latent space as video enables a pretrained video generation backbone to efficiently adapt to predict 3D scene flow. Co-trained on action-free human video and action-labeled robot teleoperation data, our model benefits from human-video pretraining and generalizes to unseen objects, motion primitives, and tasks.
Removing the scene flow prediction objective degrades out-of-distribution generalization, while adding a video prediction target alongside motion lowers policy success rather than improving it. When motion, video, and action are predicted jointly, action tokens attend far more to motion than to video, suggesting that predicting video wastes model capacity. Together, these results indicate that motion mediates the transfer from video pretraining to action.
Contributions
We render 3D scene flow as RGB and encode it with a frozen video VAE, so motion tokens share the video latent space. A single backbone flow-matches motion and action, with no new tokenizer, and the video priors live in the pretrained weights, not the prediction target.
Across real-robot manipulation, predicting motion yields higher in-distribution success and stronger OOD generalization than predicting video or jointly predicting video and motion. Adding a video target interferes with action learning rather than complementing it.
FloMo transfers to objects, motion primitives, and tasks seen only in action-free human video, such as pulling a string, a primitive never present in robot training data.
Method
The action that produces a future state depends on how the scene moves, not how it looks. 2D optical flow captures motion but conflates camera ego-motion with task-relevant motion, so FloMo predicts 3D scene flow in a canonical frame: the cumulative 3D displacement of each tracked point, expressed in the first frame's camera, which isolates object and hand motion while discarding appearance.
We render this 3D vector field as an RGB video and encode it with the frozen Wan 2.2 VAE. Because rendered scene flow lands in the same latent distribution as video, the pretrained DiT operates on it without re-training the tokenizer; we fine-tune only the backbone (LoRA, rank 64) and keep the VAE and text encoder frozen.
FloMo is co-trained on a mixture of egocentric human video and bimanual robot teleoperation.
Train on the full mixture. Human video supervises the scene-flow head only; robot data supervises both scene flow and action. This builds a broad motion prior at scale before any action-specific tuning.
Train on robot data only, dropping the human-video stream. This sharpens action prediction without diluting the gradient with action-free samples.
At inference, given an initial image and a language instruction, FloMo jointly denoises its output heads for 10 UniPC steps, then executes the full predicted chunk of 16 actions before re-planning from the next observation.
Results
On three contact-rich bimanual tasks (Oven, Toolbox, Corn), FloMo achieves the highest success on every task, averaging 87%, versus 47% for Joint Video-Action, 40% for FloMo w/ Video, and 7% for a from-scratch behavior-cloning baseline.
The gap is largest on Oven, where success hinges on precisely aligning the gripper with the handle: 3D motion supervision targets the geometry of that contact directly, whereas a video target must also reconstruct task-irrelevant reflections and textures.
FloMo outperforms every baseline across all three OOD axes: an unseen motion primitive (Close Cabinet), an unseen task (Pull String), and an unseen object (Highlighter). Every method that predicts pixels degrades sharply.
| Method | Close Cabinet | Pull String | Highlighter | |||
|---|---|---|---|---|---|---|
| SR ↑ | TP ↑ | SR ↑ | TP ↑ | SR ↑ | TP ↑ | |
| BC (no pretraining) | 3/5 | 80 | 0/5 | 0 | 1/5 | 50 |
| Joint Video-Action | 0/5 | 40 | 1/5 | 20 | 2/5 | 40 |
| FloMo w/ Video | 0/5 | 0 | 0/5 | 0 | 0/5 | 0 |
| AMPLIFY | 2/5 | 60 | 0/5 | 0 | 0/5 | 0 |
| DreamZero | 2/5 | 80 | 1/5 | 30 | 1/5 | 30 |
| FloMo (ours) | 5/5 | 100 | 3/5 | 60 | 4/5 | 80 |
SR = successful rollouts / total · TP = avg. fraction of subtasks completed (%).
Generalization from human video
3D scene flow is an embodiment-agnostic representation: a human hand and a robot gripper produce a similar motion field. FloMo learns motions from action-free human video and carries them to the robot, generalizing to objects, motion primitives, and tasks seen only in human data.
| Training data | Close Cabinet | Pull String | Highlighter | Average | ||||
|---|---|---|---|---|---|---|---|---|
| SR | TP | SR | TP | SR | TP | SR | TP | |
| FloMo | 5/5 | 100 | 3/5 | 60 | 4/5 | 80 | 80% | 80 |
| FloMo (no human video) | 0/5 | 20 | 1/5 | 50 | 0/5 | 20 | 7% | 30 |
Both models predict 3D motion and differ only in whether the ~81h EgoVerse human-video subset is in training. SR = successful rollouts / total; TP = avg. fraction of subtasks completed (%).
Holding the model and the motion-prediction target fixed and removing only the ~81h EgoVerse human-video subset drops average out-of-distribution success from 80% to 7%. The generalization is inherited from the human video, made transferable by the embodiment-agnostic motion representation.
Rollouts
Bimanual YAM rollouts on in-distribution and out-of-distribution tasks.
The same unseen task, same conditions. FloMo succeeds where pixel-predicting baselines fail.