当前位置：首页 > news >正文

【General Agent Benchmark】论文分享No.12：LLF-Bench

news 2025/7/6 12:59:59

论文名称：LLF-Bench: Benchmark for Interactive Learning from Language Feedback

论文：https://arxiv.org/abs/2312.06853

机构：微软

Github 链接：https://github.com/microsoft/LLF-Bench

官方界面：https://microsoft.github.io/LLF-Bench/

简介

LLF-Bench 是由微软研究院等机构提出的交互式学习基准测试平台，旨在评估 AI Agent 通过自然语言反馈和指令进行学习的能力。其核心设计理念是模拟人类通过语言指导加速学习的过程，而非依赖传统强化学习的数值奖励机制。

整个评测集包含用户推荐、诗歌创作、导航和机器人控制等多种类型的序列决策任务，并且实际执行评测时，遵循OpenAlGymAPI的规范，围绕 Agent 与 Environment 的交互流程展开评测。

评测集

在这里插入图片描述

LLF-Bench里有8个不同的任务集合，每个集合都有自己独特的特点：

llf-bandit：它是经典多臂老虎机问题用语言描述的版本，用来测试Agent在不熟悉环境里探索和利用资源的能力。
llf-poem：是诗歌创作任务，能检验 Agent 对各种限制条件的理解和推理水平。
llf-reco-movie：属于电影推荐任务，模拟用户和推荐系统互动的场景，考察 Agent 能不能理解用户喜好并给出合适推荐。
llf-optimization：是函数优化任务，测试 Agent 在面对未知损失函数时进行优化的能力。
llf-parking：自动泊车任务，看看 Agent 在连续控制任务方面的学习和规划能力咋样。
llf-gridworld：网格世界导航任务，用来测试 Agent 的空间推理和规划路线的能力。
llf-alfworld：基于文本描述的家庭环境任务，检测 Agent 在复杂环境里能不能灵活应用学到的东西。
llf-metaworld：机器人操作任务，主要测试 Agent 操控机器人手臂的能力。

使用方法

其实 Github 界面已经给出了详细的使用说明，下面是一些细节的补充。

Step1：任务配置与初始化

通过Python脚本选择任务类型（如lf-parking）,并设定反馈模式（如仅启用“建议”类反馈）。环境初始化时可设置随机种子以复现实验。

Step2：Agent交互流程

Environment 返回任务的自然语言描述：如 请创作一首关于春天的五言绝句。
Agent 进行 Action，如 输出诗歌初稿。
Environment 基于 Action 结果返回结构化反馈，包含性能评分及语言建议。
Agent 迭代优化动作，直至任务完成或达到步数限制。

Step3：自定义反馈与评估

通过修改配置文件调整反馈内容的信息密度(如是否包含解释性文本)，并接入自定义评估指标(如诗歌创意性的人工评分)。

示例代码

import llfbench as gym# Environments in the benchmark are registered following
# the naming convention of llf-*env = gym.make('llf-gridworld-v0')done = False
cumulative_reward = 0.0# First observation is acquired by resetting the environmentobservation, info = env.reset()while not done:# Observation is dict having 'observation', 'instruction', 'feedback'# Here we print the observation and ask the user for an actionaction = input( observation['observation'] + '\n' +observation['instruction'] + '\n' +observation['feedback'] + '\n' +'Action: ' )# Gridworld has a text action space, so TextWrapper is not needed# to parse a valid action from the input stringobservation, reward, terminated, truncated, info = env.step(action)# reward is never revealed to the agent; only used for evaluationcumulative_reward += reward# terminated and truncated follow the same semantics as in Gymnasiumdone = terminated or truncatedprint(f'Episode reward: {cumulative_reward}')