Back to notes
Field notebook
/
16:16:04
/
Note

World Models + JEPA Reading Notes

Reading Notes: World Models + JEPA Read date: 2026 06 01 Scope: 14 downloaded papers in this folder. I read the abstracts, introductions, method sections, result/conclusion sections, and the key ablations/limitations whe…

Reading signalRoute: World Models27 sections3 notes nearby
Follow World Models route

At a glance

Reading effort and structure before you settle in.

Reading time
11 min
Images
0
Views
64

Reader briefing

Primary route: World Models27 sections

Reading deck

Quiet body11 min27 sectionsWorld Models route

Reading Notes: World Models + JEPA

Read date: 2026-06-01

Scope: 14 downloaded papers in this folder. I read the abstracts, introductions, method sections, result/conclusion sections, and the key ablations/limitations where they affected planning or embodied use. I did not line-by-line verify every proof appendix.

One-line Synthesis

The JEPA world-model line is moving from "predict useful representations instead of pixels" toward "train stable latent dynamics that can plan": I-JEPA and V-JEPA establish representation-space prediction; V-JEPA 2 adds large-scale video pretraining plus action-conditioned robot planning; LeJEPA and LeWorldModel make end-to-end JEPA training stable with Gaussian regularization; the 2026 variants add value-aware planning, probabilistic uncertainty, object-centric causal structure, and subspace regularization.

Mental Map

  1. Foundation idea: JEPA learns by predicting embeddings of missing/future observations, not reconstructing pixels. This avoids spending capacity on high-entropy details that do not help semantics or control.

  2. The core engineering problem: A pure embedding-prediction loss collapses easily. Earlier systems use EMA teachers, stop-gradients, VCReg, or frozen encoders. LeJEPA/LeWorldModel argue that explicitly forcing embeddings toward an isotropic Gaussian via SIGReg can prevent collapse with fewer heuristics.

  3. The planning problem: A good predictor is not automatically a good planner. Planning success depends on the geometry of the latent space, the optimizer, context length, rollout training, proprioception, and whether latent distance to a goal is meaningful.

  4. The likely embodied-navigation lesson: For ObjectNav, JEPA is most useful as a latent transition model and surprise/novelty signal, but it should be paired with object-centric or dense features and a value-shaped latent distance. Raw latent L2 to a goal is often too weak for long-horizon navigation.

Per-paper Notes

1. A Path Towards Autonomous Machine Intelligence

Core idea: LeCun's blueprint for autonomous agents combines perception, a configurable predictive world model, cost modules, actor/planner, short-term memory, and intrinsic objectives.

Why it matters here: It is the conceptual parent of JEPA world models. The paper argues that agents should learn abstract predictive models of the world and plan in representation space rather than model every pixel.

ObjectNav relevance: The architecture maps naturally to ObjectNav: perception builds latent state; world model predicts consequences of movement; cost module encodes object-goal progress, collision risk, novelty, and map uncertainty; planner searches over actions.

Main caveat: It is a position paper, not a concrete training recipe.

2. I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Core idea: Predict representations of large target blocks from context blocks in the same image. The model learns semantic image features without hand-crafted view augmentations or pixel reconstruction.

Key mechanism: The target encoder defines the representation-space target; large multi-block masking pushes the prediction task toward semantic content.

Takeaway: Predicting in representation space is much better for semantic representation learning than predicting pixels.

ObjectNav relevance: Useful as a foundation encoder idea, but image-only I-JEPA does not model temporal dynamics or actions.

3. V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video

Core idea: Extend JEPA-style feature prediction to video. V-JEPA learns visual representations from video using feature prediction only, without text, reconstruction, negatives, or pretrained image encoders.

Key result: Frozen V-JEPA features perform strongly on action recognition, spatio-temporal detection, and image classification, especially tasks requiring motion understanding.

Takeaway: Video feature prediction learns temporal abstractions and motion-sensitive representations.

ObjectNav relevance: Better than image-only features for moving-agent perception, but still not action-conditioned planning.

4. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Core idea: Scale V-JEPA pretraining to internet-scale video, then post-train an action-conditioned predictor on robot trajectories. The resulting V-JEPA 2-AC performs robot manipulation via latent-space MPC.

Important details:

  • Pretraining uses over 1 million hours of internet video plus images.
  • Action-conditioned post-training uses under 62 hours of DROID robot data.
  • Planning is done with image goals, without task-specific reward or lab-specific robot data.

Takeaway: This is the strongest direct evidence in the set that JEPA representations can support real-world planning when combined with a small amount of interaction data.

Limitations: Current planning horizons are short; long-horizon tasks require subgoals or hierarchical world models. Goals are image-based; language goals remain future work.

ObjectNav relevance: Very relevant. A navigation version would likely need action-conditioned post-training on robot/bag/sim trajectories plus hierarchical subgoal planning.

5. seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

Core idea: Process sequences of action-observation pairs and predict the next observation representation. The architecture separates equivariant per-view representations from invariant aggregate representations.

Why interesting: It addresses a common tradeoff: classification wants invariance, while control and localization often need equivariance.

ObjectNav relevance: Strong conceptual fit. ObjectNav needs invariant object semantics and equivariant geometry/motion. seq-JEPA suggests a way to keep both instead of collapsing everything into semantic invariance.

Limitations: Experiments are mostly transformation/view-sequence representation learning, not full embodied navigation.

6. LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Core idea: LeJEPA combines a JEPA predictive/alignment loss with SIGReg, a sketched isotropic Gaussian regularizer, to prevent collapse without stop-gradients, EMA teachers, or asymmetric tricks.

Key claim: Isotropic Gaussian embeddings are theoretically optimal for broad downstream prediction risk; SIGReg enforces this distribution efficiently.

Practical value: A single main tradeoff hyperparameter, stable training across architectures/domains, and simple implementation.

ObjectNav relevance: This is the anti-collapse foundation for training a JEPA-like encoder directly on navigation observations.

Main caveat: The paper focuses on SSL representation learning, not necessarily action-conditioned control.

7. What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

Core idea: A systematic study of what makes JEPA world models work for physical planning.

Important findings:

  • Good rollout prediction does not automatically imply good planning.
  • CEM with L2 latent cost performs best overall.
  • Nevergrad performs similarly to CEM on real-world manipulation data and needs less tuning.
  • Gradient planners work on smooth landscapes but fail on non-greedy navigation/contact-rich tasks due to local minima.
  • Proprioception helps a lot.
  • Multistep rollout loss helps align training with planning.
  • Context length must be enough to infer velocity; too long can hurt by reducing useful training slices.
  • DINO-style image encoders can beat V-JEPA encoders for manipulation/navigation because fine object segmentation matters.
  • Scaling model size helps real-world robotics more than simple simulation.

ObjectNav relevance: Very high. This paper is practically useful for choosing rollout loss, context window, planner, encoder, and evaluation setup.

Design warning: Do not assume a better representation benchmark score means better planning.

8. Value-guided Action Planning with JEPA World Models

Core idea: Shape the latent representation so distance or quasi-distance between state embeddings approximates the negative goal-conditioned value function.

Result: IQL-inspired value-function representations, especially quasi-distance versions, improve planning over standard JEPA or simple contrastive/regressive baselines in wall/maze control tasks.

ObjectNav relevance: Very high. ObjectNav goals are not just "match this image"; they are reachability/value problems. A value-shaped latent distance is more promising than raw embedding distance.

Limitations: Small/simple environments; IQL-style value learning can be biased in stochastic environments; distant-state relationships remain hard.

9. VJEPA: Variational JEPA as Probabilistic World Models

Core idea: Replace deterministic JEPA regression with a predictive distribution over future latent states. This makes JEPA compatible with uncertainty-aware planning and stochastic control.

Why it matters: Deterministic MSE predicts a conditional mean, which can be invalid in multimodal futures. Probabilistic latent prediction can represent multiple possible futures.

ObjectNav relevance: Important for real navigation because observations are partial, occluded, and multimodal. Uncertainty should feed exploration and safety decisions.

Status caveat: The paper is more theoretical/framework-oriented than a mature robotics benchmark result.

10. Causal-JEPA: Learning World Models through Object-Level Latent Masking

Core idea: Use object-centric representations and mask object-level latents so the model must infer masked object states from surrounding objects and temporal context.

Key contribution: Object-level masking acts like a counterfactual query and encourages interaction reasoning rather than patch-level shortcut learning.

Reported benefit: Efficient predictive control with far fewer tokens than patch-based world models and much faster MPC, while improving counterfactual visual reasoning.

ObjectNav relevance: Very high for semantic navigation. ObjectNav is object-centric by nature; object slots are a more natural state unit than image patches.

Limitations: Depends on the quality of the frozen object-centric encoder; richer causal graph validation remains future work.

11. V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Core idea: Improve V-JEPA 2 dense spatial features with dense predictive loss, deep self-supervision, image/video tokenizers, and better scaling/data curation.

Why it matters: V-JEPA 2 had strong global understanding, but dense localization features were weaker. V-JEPA 2.1 improves depth, tracking, segmentation, object-interaction anticipation, action anticipation, and robot grasping.

ObjectNav relevance: Dense features are essential for localization, obstacle reasoning, object grounding, and map updates. This paper suggests that a world-model encoder should not only be globally semantic; it must preserve local spatial structure.

12. LeWorldModel: Stable End-to-End JEPA from Pixels

Core idea: Train an action-conditioned latent world model end-to-end from pixels using only a next-embedding prediction loss plus SIGReg. No frozen encoder, no EMA, no stop-gradient, no reconstruction, no reward.

Important details:

  • Compact 15M-parameter model.
  • Trainable on a single GPU in a few hours.
  • Up to about 48x faster planning than DINO-WM in their setup.
  • Competitive across 2D and 3D control tasks.
  • Latent space encodes physical quantities and supports violation-of-expectation surprise detection.

Takeaway: LeWorldModel is the cleanest "small, trainable, end-to-end JEPA world model" recipe in this pack.

Limitations:

  • Planning remains short-horizon.
  • Needs sufficient offline interaction coverage.
  • SIGReg can be mismatched in very low-dimensional/simple environments.
  • Still depends on action labels.

ObjectNav relevance: Excellent prototype candidate for a local navigation world model, especially if trained from RGB/RGB-D plus egomotion/action data.

13. Sub-JEPA: Subspace Gaussian Regularization for Stable End-to-End World Models

Core idea: LeWorldModel's full-space isotropic Gaussian prior can be too strong when real latent dynamics lie on a lower-dimensional manifold. Sub-JEPA applies Gaussian regularization in multiple random orthogonal subspaces.

Result: Improves LeWM across four continuous-control environments and gives more coherent latent trajectories/open-loop rollouts.

Interpretation: Subspace regularization relaxes the bias of full Gaussian regularization while preserving anti-collapse behavior.

ObjectNav relevance: Strong. Navigation often has low intrinsic local dynamics embedded in high-dimensional images. Subspace regularization may be better than full SIGReg for map/pose/object state manifolds.

Risk: Extra projection design choices; too-small subspaces can hurt stability.

14. When Does LeJEPA Learn a World Model?

Core idea: Gives a theoretical answer: LeJEPA learns a world model when its representation linearly recovers the true latent variables from nonlinear observations. Under stationary additive-noise worlds with Gaussian latents, LeJEPA is linearly identifiable up to rotation.

Why it matters: It ties the Gaussian regularization story to actual world-model structure, not just anti-collapse.

Practical implication:

  • Data distribution matters. Isotropic/random-walk-like exploration supports identifiability better than narrow goal-directed data.
  • Encoder dimension matters; wrong latent dimensionality is an open problem.
  • Linear identifiability is a state-side guarantee; action-conditioned transition learning is still separate.

ObjectNav relevance: Important for data collection. If lifelong ObjectNav only records goal-directed successful routes, it may not learn a faithful world model. Exploration diversity matters.

Cross-paper Takeaways For ObjectNav

A. Use JEPA For Latent Dynamics, Not Pixel Simulation

The strongest common argument is that pixel prediction is a bad default for navigation. It wastes capacity on texture, lighting, and stochastic details. For ObjectNav, the useful state should preserve:

  • object identity and affordance,
  • spatial layout,
  • agent/object relative geometry,
  • traversability,
  • motion and occlusion cues,
  • uncertainty or surprise.

JEPA-style latent prediction is a good fit for this target.

B. Raw Latent Distance Is Not Enough

Multiple papers imply that planning cost is the fragile part. If latent distance does not correspond to reachability, the planner can get stuck.

Recommended direction:

  • learn or calibrate a goal-conditioned value/quasi-distance over latent states;
  • use latent prediction for rollout, but use value-shaped cost for planning;
  • evaluate planning success directly, not only representation quality.

C. Dense/Object-centric Features Matter

For manipulation/navigation, DINO-like dense segmentation sometimes beats V-JEPA-style video encoders. V-JEPA 2.1 and C-JEPA both point toward preserving local/object structure.

Recommended direction:

  • consider an object-centric latent state for ObjectNav,
  • keep dense features for map/object grounding,
  • avoid only using a single global embedding for planning.

D. Short Horizon Is The Current Wall

V-JEPA 2, LeWorldModel, and value-guided JEPA all still struggle with long-horizon planning. They work best with MPC, subgoals, or short rollouts.

Recommended direction:

  • combine JEPA world model with hierarchical navigation,
  • use high-level semantic subgoals from a map or LLM/planner,
  • use JEPA locally for rollout, surprise, and short-horizon action scoring.

E. Offline Data Coverage Is Crucial

The theory and empirical papers agree: if the training data does not cover relevant dynamics, the model will not recover a useful world state.

Recommended direction:

  • train from diverse trajectories, including random/exploratory behavior;
  • include failed routes and recovery behaviors;
  • include changes in viewpoint, lighting, occlusion, and object arrangements;
  • log actions/odometry/proprioception where available.

F. Proprioception / Egomotion Should Be Included

The planning study shows proprioception is important. For a robot navigation stack, the equivalent is odometry, IMU, previous velocity/action, and maybe depth.

Recommended input boundary:

  • image or RGB-D observation,
  • previous action/control,
  • odometry/egomotion,
  • object detections or object slots,
  • optional semantic map context.

Suggested Research Direction For This Repo

If we were to turn this into an ObjectNav research thread, I would not start with a giant V-JEPA 2 reproduction. I would start smaller:

  1. Build a local latent dynamics model. Train LeWorldModel/Sub-JEPA-style next-latent prediction from recorded/simulated navigation trajectories.

  2. Use object/dense anchors. Encode both a dense visual feature stream and an object-slot stream. This matches the "dual-anchor" idea better than a single frame embedding.

  3. Add value-shaped goal cost. Train a goal-conditioned distance/quasi-distance for "can I reach this object/subgoal from here?" and use it as the MPC cost.

  4. Evaluate directly on navigation. Metrics should include ObjectNav success/SPL, short-rollout prediction error, surprise on scene changes, and whether latent distance correlates with geodesic progress.

  5. Keep it hierarchical. Use semantic map/object memory for long horizon, JEPA world model for local rollout and replanning.

Priority Reading Order

If time is short, read in this order:

  1. 2026_LeWorldModel_Stable_End-to-End_JEPA_from_Pixels.pdf
  2. 2025_JEPA-WM_Physical_Planning_Drivers.pdf
  3. 2025_V-JEPA_2_Self-Supervised_Video_Models_Enable_Understanding_Prediction_and_Planning.pdf
  4. 2026_Value-Guided_Action_Planning_with_JEPA_World_Models.pdf
  5. 2026_Causal-JEPA_Object-Level_Latent_Interventions.pdf
  6. 2026_When_Does_LeJEPA_Learn_a_World_Model.pdf
  7. 2026_Sub-JEPA_Subspace_Gaussian_Regularization_for_World_Models.pdf
  8. 2026_V-JEPA_2-1_Unlocking_Dense_Features_in_Video_SSL.pdf

The rest are foundation/context.

Route control

After Reading

Choose the next trail: follow the same topic route, open the research shelf, or continue through nearby notes.

World Models3 nearby notes