当前位置: 首页 > news >正文

smolagents学习笔记系列(十)Examples - Web Browser Automation with Agents

这篇文章锁定官网教程中 Examples 章节中的 Web Browser Automation with Agents文章,主要介绍了如何设计一个由Agent驱动结合视觉模态的Web内容浏览功能,包含了以下几个功能:

  1. Navigate to web pages:前往指定网页;
  2. Click on elements:点击网页对象;
  3. Search within pages:在页面中搜索;
  4. Handle popups and modals:处理页面弹窗内容;
  5. Extract information :抽取信息;
  • 官网链接:https://huggingface.co/docs/smolagents/v1.9.2/en/examples/web_browser;

安装以下依赖:

$ pip install smolagents selenium helium pillow -q

为了实现上面这些功能,需要完成以下步骤:

  1. 定义能够对网页进行操作的 tool,包括可以执行 Ctrl+F、后退、关闭弹窗的功能;
  2. 配置浏览器内核,官网示例中使用了 Chrmoe 浏览器内核;
  3. 定义Agent和模型;
  4. 明确操作提示词;
  5. Agnet执行操作提示词;

完整代码如下:

【注意】:官网示例中使用的是 meta-llama/Llama-3.3-70B-Instruct 模型,但这个模型的Token是需要购买的,如果这里对其进行修改像之前文章中一样使用默认分配的 Qwen-Coder 那么会在中间某一步停下来,因为默认的免费模型不支持超过 10000 Token 的输入,有条件的读者可以尝试购买一些Token实验其完整功能。

from io import BytesIO
from time import sleep

import helium
from dotenv import load_dotenv
from PIL import Image
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

from smolagents import CodeAgent, tool
from smolagents.agents import ActionStep
from smolagents import HfApiModel

load_dotenv()

#----------------------------------------------------------------# 
# Step1. 定义网页操作tool
@tool
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
    """
    Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.
    Args:
        text: The text to search for
        nth_result: Which occurrence to jump to (default: 1)
    """
    elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
    if nth_result > len(elements):
        raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
    result = f"Found {len(elements)} matches for '{text}'."
    elem = elements[nth_result - 1]
    driver.execute_script("arguments[0].scrollIntoView(true);", elem)
    result += f"Focused on element {nth_result} of {len(elements)}"
    return result

@tool
def go_back() -> None:
    """Goes back to previous page."""
    driver.back()

@tool
def close_popups() -> str:
    """
    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows!
    This does not work on cookie consent banners.
    """
    webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()
    

#----------------------------------------------------------------# 
# Step2. 配置Chrome内核

# Configure Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--force-device-scale-factor=1")
chrome_options.add_argument("--window-size=1000,1350")
chrome_options.add_argument("--disable-pdf-viewer")
chrome_options.add_argument("--window-position=0,0")

# Initialize the browser
driver = helium.start_chrome(headless=False, options=chrome_options)

# Set up screenshot callback
def save_screenshot(memory_step: ActionStep, agent: CodeAgent) -> None:
    sleep(1.0)  # Let JavaScript animations happen before taking the screenshot
    driver = helium.get_driver()
    current_step = memory_step.step_number
    if driver is not None:
        for previous_memory_step in agent.memory.steps:  # Remove previous screenshots for lean processing
            if isinstance(previous_memory_step, ActionStep) and previous_memory_step.step_number <= current_step - 2:
                previous_memory_step.observations_images = None
        png_bytes = driver.get_screenshot_as_png()
        image = Image.open(BytesIO(png_bytes))
        print(f"Captured a browser screenshot: {image.size} pixels")
        memory_step.observations_images = [image.copy()]  # Create a copy to ensure it persists

    # Update observations with current URL
    url_info = f"Current url: {driver.current_url}"
    memory_step.observations = (
        url_info if memory_step.observations is None else memory_step.observations + "\n" + url_info
    )
    
#----------------------------------------------------------------# 
# Step3. 定义 Agent

# Initialize the model
# 如果你有下面这个模型的Token则使用下面这两行代码
# model_id = "meta-llama/Llama-3.3-70B-Instruct"
# model = HfApiModel(model_id)
# 如果你只有免费的Token则使用下面这一行代码
model = HfApiModel()

# Create the agent
agent = CodeAgent(
    tools=[go_back, close_popups, search_item_ctrl_f],
    model=model,
    additional_authorized_imports=["helium"],
    step_callbacks=[save_screenshot],
    max_steps=20,
    verbosity_level=2,
)

# Import helium for the agent
agent.python_executor("from helium import *", agent.state)

#----------------------------------------------------------------# 
# Step4. 明确操作提示词

helium_instructions = """
You can use helium to access websites. Don't bother about the helium driver, it's already managed.
We've already ran "from helium import *"
Then you can go to pages!
Code:
```py
go_to('github.com/trending')
```<end_code>

You can directly click clickable elements by inputting the text that appears on them.
Code:
```py
click("Top products")
```<end_code>

If it's a link:
Code:
```py
click(Link("Top products"))
```<end_code>

If you try to interact with an element and it's not found, you'll get a LookupError.
In general stop your action after each button click to see what happens on your screenshot.
Never try to login in a page.

To scroll up or down, use scroll_down or scroll_up with as an argument the number of pixels to scroll from.
Code:
```py
scroll_down(num_pixels=1200) # This will scroll one viewport down
```<end_code>

When you have pop-ups with a cross icon to close, don't try to click the close icon by finding its element or targeting an 'X' element (this most often fails).
Just use your built-in tool `close_popups` to close them:
Code:
```py
close_popups()
```<end_code>

You can use .exists() to check for the existence of an element. For example:
Code:
```py
if Text('Accept cookies?').exists():
    click('I accept')
```<end_code>
"""


search_request = """
Please navigate to https://en.wikipedia.org/wiki/Chicago and give me a sentence containing the word "1992" that mentions a construction accident.
"""

#----------------------------------------------------------------# 
# Step5. Agent执行提示词
agent_output = agent.run(search_request + helium_instructions)
print("Final output:")
print(agent_output)

这里使用免费的Token执行结果如下,Agent会卡在中间的一步中,这个完全随缘,有时候刚打开网页还没有滚动就报错Token超限,有时候能滚动很多次才报错:

$ python demo.py

在这里插入图片描述

相关文章:

  • 极简RabbitMQ快速学习
  • 网络通信库
  • 软件测试丨Docker与虚拟机架构对比分析
  • 物理服务器如何保障数据的安全性?
  • 【学写LibreCAD】0 仿写LibreCAD简介
  • 在android studio上使用rknn模块下面的yolov8_pose模型
  • MySQL 创建指定IP用户并赋予全部权限(兼容8.0以下及8.0以上版本)
  • hbase笔记总结1
  • MFC线程
  • vue3的生命周期
  • 【JAVA-数据结构】Lambda表达式
  • JavaScript 作用域与作用域链深度解析
  • 安装Maven配置阿里云地址 详细教程
  • 子进程的创建 ─── linux第10课
  • 3.19 ReAct 理论企业级实战:构建动态进化的智能 Agent 系统
  • Python爬虫(四)- Selenium 安装与使用教程
  • WordPress二次开发实现用户注册审核功能
  • 【JavaScript】《JavaScript高级程序设计 (第4版) 》笔记-Chapter25-客户端存储
  • 5.11 PEFT重参数化方法:低秩分解的微调革命
  • jupyterhub on k8s 配置用户名密码 + 自定义镜像
  • “上海之帆”巡展在日本大阪开幕,松江区组织企业集体出展
  • 【社论】以法治力量促进民企长远健康发展
  • 国家发改委:目前有的核电项目民间资本参股比例已经达到20%
  • 扶桑谈|素称清廉的石破茂被曝受贿,日本政坛或掀起倒阁浪潮
  • AI聊天机器人涉多起骚扰行为,专家呼吁加强伦理设计与监管
  • 国际上首次,地月空间卫星激光测距试验在白天成功实施