world models and Human–Object Interaction (HOI)
Author: Chatgpt
Here are several key research papers that explore the intersection of world models and Human–Object Interaction (HOI)—especially ones that build structured, object-centric representations from videos or use world-model-based learning to plan object-rich interactions.
🧠 1. FOCUS: Object‑Centric World Models for Robotic Manipulation (Jul 2023)
Proposes a model-based RL agent, FOCUS, that builds a structured world model by encoding objects into separate latent vectors. It guides exploration toward object interaction and enables efficient task learning across environments like ManiSkill2 or Robosuite, even on real Franka robot hardware. Object-centric focus improves exploration and sample efficiency in sparse‑reward manipulation tasks. (arXiv, Frontiers)
🔧 2. Structured World Models from Human Videos (RSS’23)
Also known as SWIM (or SWIM/SWIMROC), this approach pre-trains world models using human video data. The affordance-based, human-centric structured action space lets robots learn diverse manipulation skills in just ~30 minutes of real robot experience. This model enables generalization beyond robot-specific embodiment. (Medium)
🎛️ 3. Structured World Models from Human Videos (Paper: Structured World Models from Human Videos)
Same as above, centered on leveraging human video to learn affordance-grounded world models that encode object interactions, enabling goal-based planning and policy execution even with limited robot experience.
🖐️ 4. Human‑Object Interaction with Vision‑Language Model Guided Relative Movement Dynamics (RMD‑HOI) — Mar 2025
Introduces a framework where vision-language models translate free-form instructions into Relative Movement Dynamics (RMD) guiding language‑conditioned reinforcement learning. The model allows long‑horizon, multi-round HOI planning—even with dynamic and articulated objects. It couples semantic instruction, perception, and motion planning. (arXiv)
🌍 5. OpenHOI: Open‑World HOI Synthesis with Multimodal LLM — May 2025
OpenHOI brings together affordance grounding, language decomposition, and an affordance-driven diffusion model with physics-based refinement. It enables generation of long-horizon hand-object interactions from language commands over novel objects. This is essentially world-model-informed HOI synthesis grounded in affordance and physics. (arXiv)
🔄 6. Vision-Based Manipulation from Single Human Video (ORION)
Learn manipulation policies from a single RGB-D human demonstration using Open-world Object Graphs (OOGs)—structured, object- and hand-centric representations. ORION constructs manipulation plans that generalize across spatial layouts, backgrounds, and unseen object instances. (arXiv)
📚 7. World Model Foundations
- Ha & Schmidhuber (2018) original definition: VAE for perception, RNN for dynamics, policy head for control.
- LeCun (2022): world models as neural “mental simulation” for commonsense reasoning, often incorporated in embodied agents. (维基百科)
📊 Summary Table
Paper / Model | Domain | World-Model Structure | HOI Aspect |
---|---|---|---|
FOCUS | RL / robotics | Object-centric latent dynamics | Focused exploration, object manipulation |
SWIM (Structured WM) | Pre‑training RL | Affordance action world model | From human videos → robot affordance plans |
RMD‑HOI | HOI / RL | Language-guided dynamics model | Vision-language → sequential HOI planning |
OpenHOI | Multimodal HOI | Affordance+diffusion + world model | Open-world HOI synthesis with physics |
ORION | Imitation from video | Object-graph world plan extraction | Single-demo generalizable HOI policies |
💡 Why These Matter
- Object-centric representations in world models (like FOCUS, SWIM, ORION) enable models to capture and reason about interactions more efficiently and generalize better.
- Affordance-guided structures bridge perception and action, enabling tasks to be grounded even from limited data.
- Language-guided dynamics planning (RMD‑HOI, OpenHOI) allows long-horizon sequential HOI planning from natural instructions.
- These methods enable zero-/few-shot generalization to new objects, instructions, or environments.