FloMo: 3D Scene Flow as a World-Action Model Intermediate

TL;DR

World-action models usually predict future video. We argue the right target is 3D motion. FloMo renders 3D scene flow into the latent space of a pretrained video generator and fine-tunes a single backbone to predict motion and action. Predicting motion instead of pixels gives better in-distribution success, far stronger out-of-distribution generalization, and higher sample efficiency.

**FloMo architecture.** Given an image and a text prompt, FloMo jointly predicts dense 3D motion (scene flow) and robot actions with a pretrained video-generation backbone (Wan 2.2 5B DiT) using flow-matching. To tokenize scene flow, instantaneous 3D velocities are rendered as RGB videos and encoded by the pretrained video backbone's VAE, so motion shares the video latent space and needs no new tokenizer.

Abstract

Motion mediates the transfer from video to action

Pretrained video generation models distill internet-scale motion priors into robot policies, but full video is a redundant signal for action prediction. We argue that dense 3D motion is the medium through which video pretraining best transfers to robot actions, and that predicting motion rather than video affords better sample-efficiency for action prediction. FloMo is a 5B video-generation model fine-tuned to predict 3D scene-flow and action tokens. Rendering scene flow as RGB and encoding it into the same latent space as video enables a pretrained video generation backbone to efficiently adapt to predict 3D scene flow. Co-trained on action-free human video and action-labeled robot teleoperation data, our model benefits from human-video pretraining and generalizes to unseen objects, motion primitives, and tasks.

Removing the scene flow prediction objective degrades out-of-distribution generalization, while adding a video prediction target alongside motion lowers policy success rather than improving it. When motion, video, and action are predicted jointly, action tokens attend far more to motion than to video, suggesting that predicting video wastes model capacity. Together, these results indicate that motion mediates the transfer from video pretraining to action.

Contributions

What we show

1

Motion as the WAM prediction target

We render 3D scene flow as RGB and encode it with a frozen video VAE, so motion tokens share the video latent space. A single backbone flow-matches motion and action, with no new tokenizer, and the video priors live in the pretrained weights, not the prediction target.

2

Motion is the most effective target

Across real-robot manipulation, predicting motion yields higher in-distribution success and stronger OOD generalization than predicting video or jointly predicting video and motion. Adding a video target interferes with action learning rather than complementing it.

3

Generalization from human video

FloMo transfers to objects, motion primitives, and tasks seen only in action-free human video, such as pulling a string, a primitive never present in robot training data.

Method

3D scene flow as the video-action intermediate

The action that produces a future state depends on how the scene moves, not how it looks. 2D optical flow captures motion but conflates camera ego-motion with task-relevant motion, so FloMo predicts 3D scene flow in a canonical frame: the cumulative 3D displacement of each tracked point, expressed in the first frame's camera, which isolates object and hand motion while discarding appearance.

We render this 3D vector field as an RGB video and encode it with the frozen Wan 2.2 VAE. Because rendered scene flow lands in the same latent distribution as video, the pretrained DiT operates on it without re-training the tokenizer; we fine-tune only the backbone (LoRA, rank 64) and keep the VAE and text encoder frozen.

**Scene-flow tokenization.** Each row is a timestep from human video. FloMo recovers the dominant task-relevant 3D motion field (rendered as RGB) while staying invariant to texture changes that confound pixel-space predictors.

Two-stage co-training

FloMo is co-trained on a mixture of egocentric human video and bimanual robot teleoperation.

Stage 1

Mid-training

Train on the full mixture. Human video supervises the scene-flow head only; robot data supervises both scene flow and action. This builds a broad motion prior at scale before any action-specific tuning.

Stage 2

Fine-tuning

Train on robot data only, dropping the human-video stream. This sharpens action prediction without diluting the gradient with action-free samples.

At inference, given an initial image and a language instruction, FloMo jointly denoises its output heads for 10 UniPC steps, then executes the full predicted chunk of 16 actions before re-planning from the next observation.

Results

Predicting motion beats predicting pixels

In-distribution success rates — **In-distribution success.** Predicting motion (ours) is the best target on every task; predicting only video or video + motion lowers success.

In-distribution real-robot evaluation

On three contact-rich bimanual tasks (Oven, Toolbox, Corn), FloMo achieves the highest success on every task, averaging 87%, versus 47% for Joint Video-Action, 40% for FloMo w/ Video, and 7% for a from-scratch behavior-cloning baseline.

The gap is largest on Oven, where success hinges on precisely aligning the gripper with the handle: 3D motion supervision targets the geometry of that contact directly, whereas a video target must also reconstruct task-irrelevant reflections and textures.

Out-of-distribution generalization

FloMo outperforms every baseline across all three OOD axes: an unseen motion primitive (Close Cabinet), an unseen task (Pull String), and an unseen object (Highlighter). Every method that predicts pixels degrades sharply.

Method	Close Cabinet		Pull String		Highlighter
Method	SR ↑	TP ↑	SR ↑	TP ↑	SR ↑	TP ↑
BC (no pretraining)	3/5	80	0/5	0	1/5	50
Joint Video-Action	0/5	40	1/5	20	2/5	40
FloMo w/ Video	0/5	0	0/5	0	0/5	0
AMPLIFY	2/5	60	0/5	0	0/5	0
DreamZero	2/5	80	1/5	30	1/5	30
FloMo (ours)	5/5	100	3/5	60	4/5	80

SR = successful rollouts / total · TP = avg. fraction of subtasks completed (%).

Inverse-dynamics scaling — **Sample efficiency.** Held-out action error vs. action-data scale for two *ground-truth* conditioning signals (log-log). Recovering actions from motion is more sample-efficient than from video, with ~22% lower error at 1% of the data.

Action attention by modality — **Action attention concentrates on motion.** Share of action-query attention to each input modality across DiT layers. Attention to motion rises through the trunk and dominates; action queries attend to motion roughly 3× more than to video.

Generalization from human video

Scene flow transfers from human video

3D scene flow is an embodiment-agnostic representation: a human hand and a robot gripper produce a similar motion field. FloMo learns motions from action-free human video and carries them to the robot, generalizing to objects, motion primitives, and tasks seen only in human data.

80%

avg. OOD success
with human video

7%

avg. OOD success
without human video

Training data	Close Cabinet		Pull String		Highlighter		Average
Training data	SR	TP	SR	TP	SR	TP	SR	TP
FloMo	5/5	100	3/5	60	4/5	80	80%	80
FloMo (no human video)	0/5	20	1/5	50	0/5	20	7%	30

Both models predict 3D motion and differ only in whether the ~81h EgoVerse human-video subset is in training. SR = successful rollouts / total; TP = avg. fraction of subtasks completed (%).

Holding the model and the motion-prediction target fixed and removing only the ~81h EgoVerse human-video subset drops average out-of-distribution success from 80% to 7%. The generalization is inherited from the human video, made transferable by the embodiment-agnostic motion representation.

Rollouts