当前位置: 首页 > news >正文

world models and Human–Object Interaction (HOI)

Author: Chatgpt
Here are several key research papers that explore the intersection of world models and Human–Object Interaction (HOI)—especially ones that build structured, object-centric representations from videos or use world-model-based learning to plan object-rich interactions.


🧠 1. FOCUS: Object‑Centric World Models for Robotic Manipulation (Jul 2023)

Proposes a model-based RL agent, FOCUS, that builds a structured world model by encoding objects into separate latent vectors. It guides exploration toward object interaction and enables efficient task learning across environments like ManiSkill2 or Robosuite, even on real Franka robot hardware. Object-centric focus improves exploration and sample efficiency in sparse‑reward manipulation tasks. (arXiv, Frontiers)


🔧 2. Structured World Models from Human Videos (RSS’23)

Also known as SWIM (or SWIM/SWIMROC), this approach pre-trains world models using human video data. The affordance-based, human-centric structured action space lets robots learn diverse manipulation skills in just ~30 minutes of real robot experience. This model enables generalization beyond robot-specific embodiment. (Medium)


🎛️ 3. Structured World Models from Human Videos (Paper: Structured World Models from Human Videos)

Same as above, centered on leveraging human video to learn affordance-grounded world models that encode object interactions, enabling goal-based planning and policy execution even with limited robot experience.


🖐️ 4. Human‑Object Interaction with Vision‑Language Model Guided Relative Movement Dynamics (RMD‑HOI) — Mar 2025

Introduces a framework where vision-language models translate free-form instructions into Relative Movement Dynamics (RMD) guiding language‑conditioned reinforcement learning. The model allows long‑horizon, multi-round HOI planning—even with dynamic and articulated objects. It couples semantic instruction, perception, and motion planning. (arXiv)


🌍 5. OpenHOI: Open‑World HOI Synthesis with Multimodal LLM — May 2025

OpenHOI brings together affordance grounding, language decomposition, and an affordance-driven diffusion model with physics-based refinement. It enables generation of long-horizon hand-object interactions from language commands over novel objects. This is essentially world-model-informed HOI synthesis grounded in affordance and physics. (arXiv)


🔄 6. Vision-Based Manipulation from Single Human Video (ORION)

Learn manipulation policies from a single RGB-D human demonstration using Open-world Object Graphs (OOGs)—structured, object- and hand-centric representations. ORION constructs manipulation plans that generalize across spatial layouts, backgrounds, and unseen object instances. (arXiv)


📚 7. World Model Foundations

  • Ha & Schmidhuber (2018) original definition: VAE for perception, RNN for dynamics, policy head for control.
  • LeCun (2022): world models as neural “mental simulation” for commonsense reasoning, often incorporated in embodied agents. (维基百科)

📊 Summary Table

Paper / ModelDomainWorld-Model StructureHOI Aspect
FOCUSRL / roboticsObject-centric latent dynamicsFocused exploration, object manipulation
SWIM (Structured WM)Pre‑training RLAffordance action world modelFrom human videos → robot affordance plans
RMD‑HOIHOI / RLLanguage-guided dynamics modelVision-language → sequential HOI planning
OpenHOIMultimodal HOIAffordance+diffusion + world modelOpen-world HOI synthesis with physics
ORIONImitation from videoObject-graph world plan extractionSingle-demo generalizable HOI policies

💡 Why These Matter

  • Object-centric representations in world models (like FOCUS, SWIM, ORION) enable models to capture and reason about interactions more efficiently and generalize better.
  • Affordance-guided structures bridge perception and action, enabling tasks to be grounded even from limited data.
  • Language-guided dynamics planning (RMD‑HOI, OpenHOI) allows long-horizon sequential HOI planning from natural instructions.
  • These methods enable zero-/few-shot generalization to new objects, instructions, or environments.
http://www.dtcms.com/a/285927.html

相关文章:

  • Ubuntu locale命令介绍(查询和设置当前用户的语言、国家/地区以及字符编码等环境设置)
  • Windows 安装WSL +Docker 部署通义千问大模型(同步解决Ubuntu启动命令闪退)
  • 注册表单案例
  • Android性能优化之内存优化
  • HTML常见标签
  • 零售快销行业中线下巡店AI是如何颠覆传统计算机视觉识别的详细解决方案
  • Thymeleaf与Spring Boot深度集成与性能优化实战
  • 深度学习模型开发部署全流程:以YOLOv11目标检测任务为例
  • JavaScript解构性能解密:数组与对象解构的隐藏性能差异
  • 计算机视觉:AI 的 “眼睛” 如何看懂世界?
  • Sui 在非洲增长最快的科技市场开设 SuiHub Lagos 以推动创新
  • 质变科技亮相可信数据库发展大会,参编《数据库发展研究报告2025》
  • 《Python Web 框架深度剖析:Django、Flask 与 FastAPI 的选择之道》
  • web开发-HTML
  • Linux入门篇学习——借助 U 盘或 TF 卡拷贝程序到开发板上
  • Vue3 从 0 到 ∞:Composition API 的底层哲学、渲染管线与生态演进全景
  • *SFT深度实践指南:从数据构建到模型部署的全流程解析
  • 算法提升之字符串练习-03(KMP)
  • docker,防火墙关闭后,未重启docker,导致端口映射失败
  • 【51】MFC入门到精通——MFC串口助手(一)---初级版(串口设置、初始化、打开/关闭、状态显示),附源码
  • Java异步日志系统性能优化实践指南:基于Log4j2异步Appender与Disruptor
  • 鸿蒙实现一次上传多张图片
  • 物流3D工业相机:解锁自动化物流新纪元
  • 第三章-提示词-初级:一文带你入门提示词工程,开启AI高效交互之旅(11/36)
  • [Python] -实用技巧8-解锁 Python 中的 lambda 表达式用法
  • GISBox切片器技术解析:RVT模型到3DTiles瓦片的高性能转换方案
  • 内存数据库的持久化与恢复策略:数据安全性与重启速度的平衡点
  • QT窗口(3)-状态栏
  • 菱形继承 虚继承
  • vue-router