Mobile-Agent赛题分析和代码解读笔记(DataWhale AI夏令营)

logo

前言

你好,我是GISer Liu,一名热爱AI技术的GIS开发者,本文是DataWhale 2024 AI夏令营的最后一期——Mobile-Agent赛道,关于赛题分析和代码解读的学习文档总结;这边作者也会分享自己的思路;


本文是对原视频的总结与拓展,原视频链接如下:MobileAgent解析

一、赛题背景

1. 什么是 Mobile-Agent?

大模型智能体式 LLM(大语言模型)应用的未来,正如 AI 领域的知名人物 Andrej Karpathy 和比尔·盖茨所预言的,将会彻底改变我们与计算机交互的方式,并颠覆整个软件行业。

Andrej Karpathy 曾说道:
“如果一篇论文提出了不同的训练方法,我们内部的 Slack 上会嗤之以鼻,认为那些都是我们玩剩下的东西。然而,当新的 AI Agents 论文发布时,我们会认真且兴奋地讨论。AI Agent 不仅会改变每个人与计算机交互的方式,它还将颠覆软件行业,带来自我们从键入命令到点击图标以来最大的计算变革。”

比尔·盖茨同样表示,AI Agent 的未来潜力巨大。

① 概念

Mobile-Agent 概念图

Mobile-Agent 是什么?

  • 多智能体架构:Mobile-Agent 是一个多智能体架构,用于解决在长上下文、图文交错输入中导航的复杂问题。
  • 增强的视觉感知模块:它配备了增强的视觉感知模块,能够提升操作的准确率,尤其是在视觉识别和信息处理方面。
  • GPT-4o 支持:凭借 GPT-4o 的强大处理能力,Mobile-Agent 在操作性能和速度上有了显著提升。

在 v3 版本演示 中可以看到该框架的实际应用效果:

v3 版本演示图

② 功能

那么,Mobile-Agent 和其他 Agent 有什么区别呢?

Mobile-Agent 的功能特点包括:

  • 纯视觉方案:不依赖系统数据,完全基于视觉感知来完成任务,这使得它在隐私和系统兼容性方面有明显优势。
  • 跨应用操作能力:能够在多个应用之间无缝操作,具有较强的任务迁移能力。
  • 感知、规划、反思三者结合:集成了感知、规划和反思模块,能够更好地理解并完成复杂任务。
  • 即插即用,无需训练:无需预先训练,只需简单配置即可投入使用,非常适合快速部署。
③ 应用案例

Mobile-Agent 应用案例

Mobile-Agent 已经在多个实际应用场景中得到了验证,如:

  • 分析天气:通过调用天气应用,自动分析和获取天气信息。
  • 刷短视频并点赞:自动刷短视频,并根据特定规则进行点赞操作。
  • 搜索视频并评论:在视频平台上搜索指定内容并进行评论。
  • 地图导航:自动获取并执行导航指令,例如在地图应用中设置目的地并开始导航。
2. 两个不同版本的分析
V1 版本存在的问题

在 Mobile-Agent 的早期版本(V1 版本)中,存在以下问题:

  • 任务进度追踪难度高:由于操作历史记录冗长且图文交错,单智能体难以高效地追踪任务进度,容易出现操作错误或任务失败。

V1 版本问题图示

V2 版本的改进

在 V2 版本中,Mobile-Agent 进行了重大改进,包括:

  • 多智能体架构:首次在移动设备操作任务上采用了多智能体架构,极大地提高了任务执行的准确性和效率。
  • 延续纯视觉方案:保持了纯视觉方案的优势,使得系统依然不依赖于底层系统数据,具备更广泛的兼容性。
  • 更有效的任务进度追踪:多智能体各司其职,确保了任务进度的高效追踪,并增强了任务相关信息的记忆能力和操作反思能力。
  • 增强的指令拆解能力:改进了对复杂指令的拆解能力,能够在不同应用之间进行复杂操作,并支持多语言场景。
  • 高德地图打车等实际应用:V2 版本还通过了实际应用的验证,例如利用智能体实现购买刀削面和使用高德地图打车的任务。

V2 版本 Demo

3. V2 版本 Mobile-Agent 结构分析

为什么 Mobile-Agent 如此强大?以下是 V2 版本的框架图解:

Mobile-Agent V2 框架图

V2 版本的 Mobile-Agent 由三个主要智能体组成:

  1. Planning Agent(任务规划 Agent)

    • 功能:负责任务的总体规划,根据用户的指令和当前已完成的步骤制定后续操作计划。它解决了“我应该先做什么?”、“我已经完成了哪些工作?”、“接下来我应该干什么?”等问题。
  2. Decision Agent(执行决策 Agent)

    • 功能:基于 Planning Agent 的输出和当前的视觉输入(如屏幕截图),进行具体步骤的决策和执行。Decision Agent 会输出一个思维链(文本描述)、一个行动指令(Action)和一个可能在未来使用的关键记忆(Memory)。
  3. Reflection Agent(反思 Agent)

    • 功能:在 Decision Agent 执行操作后,Reflection Agent 负责判断操作的正确性。如果操作正确,它会反馈成功信号,允许系统继续执行下一步;如果操作失败,它会提供失败相关信息,并促使 Decision Agent 重新进行决策。由于 Decision Agent 不可能达到 100% 的正确率,因此 Reflection Agent 在确保系统整体准确性方面起着至关重要的作用。

这套框架不仅使得 Mobile-Agent 在复杂任务处理上更具优势,还增强了系统的鲁棒性和适应性。

4. 赛题解读

本次 Mobile-Agent 应用挑战赛包括两大赛题:

  • 赛题 1:《基于 Mobile-Agent 框架设计并实现特定应用场景的手机端智能体》
  • 赛题 2:《基于 Mobile-Agent 框架设计并实现面向其他终端设备的智能体》

主题:让 Agent 成为你的超级助理
目的:探索 Mobile-Agent 框架在不同应用场景中的应用与发展

① 手机端智能体

关键点

  • 围绕手机设备:此赛题聚焦于在智能手机平台上开发创新性的应用场景。
  • 面向“特定”“创新”应用场景,大开脑洞:参赛者应针对特定的手机端应用场景,充分发挥创意,设计出智能化、自动化的解决方案。
  • 尽量保留架构,修改、添加提示词,调用其他工具:在现有 Mobile-Agent 框架基础上,参赛者可以通过定制提示词、集成第三方工具等方式,增强智能体的功能和适应性。
  • 提高推理速度:为了保证实际应用中的用户体验,智能体的推理速度(即执行效率)是一个重要的考量因素。

任务思路

手机端智能体任务思路

在设计手机端智能体时,可以考虑以下几个步骤:

  1. 确定应用场景:首先明确你要解决的具体问题或提供的服务,例如自动化办公、智能家居控制、健康监测等。
  2. 智能体功能设计:根据应用场景,设计出智能体的核心功能模块,例如文本识别、语音交互、数据处理等。
  3. 优化框架:在 Mobile-Agent 框架的基础上,调整其结构以适应你的应用场景,优化推理速度,确保在移动设备上的流畅运行。
  4. 工具集成:根据需要,集成第三方 API 或工具,如机器学习模型、云计算服务等,以增强智能体的能力。

评分标准

评分标准

评分标准可能包括以下几个方面:

  1. 创新性:智能体的设计是否具有新颖性和创造性。
  2. 实用性:智能体能否在实际应用场景中有效运行,解决真实问题。
  3. 技术实现:代码质量、技术实现的难度和复杂性。
  4. 用户体验:推理速度和系统响应是否足够快,用户界面是否友好。
② 其他终端设备智能体

关键点

  • 围绕“其他”、非手机端设备:此赛题侧重于在非手机端设备上开发智能体,例如 PC、智能家居设备、物联网设备等。
  • 框架扩展,结合所选终端的特定:根据所选择的终端设备特点,对 Mobile-Agent 框架进行必要的扩展和适配,以满足特定设备的需求。
  • 修改、添加提示词、调用其他工具:类似于手机端智能体赛题,参赛者可以通过定制提示词、集成工具等方式增强智能体的功能。
  • 提高推理速度:同样需要关注智能体在终端设备上的推理速度,以确保良好的用户体验。

任务思路

在设计其他终端设备的智能体时,可以考虑以下几个步骤:

  1. 选择目标设备:明确你要针对的终端设备类型,如智能手表、智能音箱、智能电视等。
  2. 分析设备特点:研究设备的硬件和软件特性,例如处理能力、输入输出方式、操作系统等,以便进行框架适配。
  3. 智能体功能设计:基于设备特点,设计出智能体的核心功能模块。例如在智能电视上,可能需要集成语音识别和内容推荐功能。
  4. 优化与测试:针对设备的特性进行优化,并通过实际测试确保智能体能够在目标设备上高效运行。
  • 评分标准
    其他终端设备智能体任务思路
    评分标准与手机端智能体类似,主要关注创新性、实用性、技术实现和用户体验。

此外,奖金也是十分可观的 😀。

二、Mobile-Agent框架解读

1.启动配置信息

开始

❗目前仅安卓和鸿蒙系统(版本号 ≤ 4)支持工具调试。其他系统如 iOS 暂时不支持使用 Mobile-Agent。

安装依赖

pip install -r requirements.txt

准备通过 ADB 连接你的移动设备

  1. 下载 Android Debug Bridge(ADB)。
  2. 在你的移动设备上开启“USB调试”或“ADB调试”。通常需要打开开发者选项并在其中开启。如果是 HyperOS 系统,还需要同时开启“USB调试(安全设置)”。
  3. 通过数据线连接移动设备和电脑,在手机的连接选项中选择“传输文件”。
  4. 用以下命令测试连接是否成功:
    /path/to/adb devices
    
    如果输出的设备列表不为空,说明连接成功。
  5. 如果使用 MacOS 或者 Linux,请先为 ADB 开启权限:
    sudo chmod +x /path/to/adb
    
    /path/to/adb 在 Windows 电脑上为 xx/xx/adb.exe,在 MacOS 或 Linux 上则为 xx/xx/adb

在你的移动设备上安装 ADB 键盘

  1. 下载 ADB 键盘的 APK 安装包。
  2. 在设备上点击 APK 安装包进行安装。
  3. 在系统设置中将默认输入法切换为“ADB Keyboard”。

选择适合的运行方式

  1. run.py 的第 22 行起编辑你的设置,并输入你的 ADB 路径、指令、GPT-4 API URL 和 Token。

  2. 选择适合你的设备的图标描述模型的调用方法:

    • 如果设备配备高性能 GPU,建议使用“local”方法,即在本地设备中部署图标描述模型。此方法通常更有效率。
    • 如果设备无法运行 7B 大小的 LLM,请选择“api”方法。我们使用并行调用以确保效率。
  3. 选择图标描述模型:

    • 如果选择“local”方法,需要在“qwen-vl-chat”和“qwen-vl-chat-int4”之间选择。“qwen-vl-chat”需要更多的 GPU 内存,但性能优于“qwen-vl-chat-int4”。同时,“qwen_api”可以为空。
    • 如果选择“api”方法,需要在“qwen-vl-plus”和“qwen-vl-max”之间选择。“qwen-vl-max”需要更多费用,但性能优于“qwen-vl-plus”。此外,还需要申请 Qwen-VL 的 API-KEY,并将其输入到“qwen_api”中。
  4. 你可以在“add_info”中添加操作知识(例如完成指令所需的特定步骤),以帮助更准确地操作移动设备。

  5. 如果想进一步提高移动设备的效率,可以将“reflection_switch”和“memory_switch”设置为“False”:

    • “reflection_switch”用于确定是否在过程中添加“反思智能体”。这可能会导致操作陷入死循环,但你可以通过在“add_info”中添加操作知识来避免。
    • “memory_switch”用于决定是否将“内存单元”添加到过程中。如果指令中不需要在后续操作中使用之前屏幕中的信息,则可以关闭此选项。

运行

python run.py
2.run.py主文件代码解析
  • 项目的主要目的是通过调用多模态大模型和图像处理技术,实现对移动设备的屏幕内容的读取、分析和操作
  • 项目通过 Android 设备桥(ADB)与设备通信,获取屏幕截图,然后利用各种模型进行图像识别、文字识别、操作决策,最终执行用户指令

项目由多个模块组成,每个模块都承担特定的功能。以下是模块划分和相应的代码分析:

环境设置与初始化
  • 功能:设置ADB路径、用户指令、选择模型和API的调用方式等配置。
  • 代码
     # Your ADB pathadb_path = "C:/Users/<username>/AppData/Local/Android/Sdk/platform-tools/adb.exe"# Your instructioninstruction = "Read the Screen, tell me what day it is today. Then open Play Store."# Choose between "api" and "local". api: use the qwen api. local: use the local qwen checkpointcaption_call_method = "api"# Choose between "qwen-vl-plus" and "qwen-vl-max" if use api method. Choose between "qwen-vl-chat" and "qwen-vl-chat-int4" if use local method.caption_model = "qwen-vl-plus"# If you choose the api caption call method, input your Qwen api hereqwen_api = "<your api key>"# Other settings...
  • 思路:在开始前,项目通过设置 ADB 路径、用户指令、API调用方式以及模型选择来初始化项目运行的基础环境。
聊天历史初始化
  • 功能:初始化不同对话历史(如操作历史、反思历史、记忆历史)用于后续交互。

  • 代码

    def init_action_chat():operation_history = []sysetm_prompt = "You are a helpful AI mobile phone operating assistant. You need to help me operate the phone to complete the user's instruction."operation_history.append({'role': 'system', 'content': [{'text': sysetm_prompt}]})return operation_history
    
  • 思路:不同的聊天初始化函数用于分别构建操作对话历史、反思对话历史、记忆对话历史等,这样在不同阶段可以复用这些历史对话记录来生成决策。

图像处理与信息提取
  • 功能:截取手机屏幕、进行OCR识别、图标检测、坐标处理等。

  • 代码

    def get_perception_infos(adb_path, screenshot_file):get_screenshot(adb_path)width, height = Image.open(screenshot_file).sizetext, coordinates = ocr(screenshot_file, ocr_detection, ocr_recognition)text, coordinates = merge_text_blocks(text, coordinates)center_list = [[(coordinate[0]+coordinate[2])/2, (coordinate[1]+coordinate[3])/2] for coordinate in coordinates]draw_coordinates_on_image(screenshot_file, center_list)perception_infos = []for i in range(len(coordinates)):perception_info = {"text": "text: " + text[i], "coordinates": coordinates[i]}perception_infos.append(perception_info)# Detect icons...# Add icon descriptions to perception_infos...return perception_infos, width, height
    
  • 思路:该模块负责从手机截图中提取有用的信息,包括文本和图标,并将这些信息转化为后续操作的输入。

④. 深度学习模型加载与推理
  • 功能:加载和初始化所需的深度学习模型,处理用户的指令。

  • 代码

    device = "cpu"
    torch.manual_seed(1234)
    if caption_call_method == "local":# Load local models...
    elif caption_call_method == "api":# Use API for models...
    
  • 思路:根据用户选择,项目会加载本地或API提供的模型来进行图像描述、文本识别、图标检测等任务。通过选择不同模型和API,可以适应不同的应用场景和硬件环境。

⑤. 操作与执行
  • 功能:根据模型输出的操作指令,执行相应的手机操作(点击、滑动、返回等)。

  • 代码

    if "Open app" in action:# Open a specific app...
    elif "Tap" in action:# Tap on a specific coordinate...
    elif "Swipe" in action:# Swipe from one coordinate to another...
    elif "Type" in action:# Type text...
    elif "Back" in action:back(adb_path)
    elif "Home" in action:home(adb_path)
    elif "Stop" in action:break
    
  • 思路:这一部分是项目的核心逻辑,它根据分析得到的操作指令执行相应的手机操作,来完成用户的任务指令。

⑥. 反思与记忆模块
  • 功能:通过反思上一次的操作结果来调整下一步操作的策略,并将有价值的信息存储在记忆中。

  • 代码

    if reflection_switch:prompt_reflect = get_reflect_prompt(...)chat_reflect = init_reflect_chat()chat_reflect = add_response_two_image("user", prompt_reflect, chat_reflect, [last_screenshot_file, screenshot_file])output_reflect = call_with_local_file(chat_action, api_key=qwen_api, model='qwen-vl-plus')reflect = output_reflect.split("### Answer ###")[-1].replace("\n", " ").strip()chat_reflect = add_response("system", output_reflect, chat_reflect)if 'A' in reflect:thought_history.append(thought)summary_history.append(summary)action_history.append(action)# Other conditions...
    
  • 思路:通过反思模块,系统会基于之前的操作结果来判断是否需要调整策略,并将重要的信息存储到内存模块中,以便在后续操作中参考。

⑦. 主循环与终止条件
  • 功能:主循环执行多轮操作,并根据一定条件终止循环。

  • 代码

    while True:iter += 1# First iteration...# Action decision...# Memory update...# Reflection...if "Stop" in action:breaktime.sleep(5)
    
  • 思路:项目在一个循环中进行,直到任务完成或达到终止条件。每次循环都会根据新的屏幕截图和用户指令更新操作,并在适当的时候进行反思和策略调整。

总结功能
  • 功能:对项目进行总结,提取核心内容,确保项目达成目标。

  • 代码

    completed_requirements = output_planning.split("### Completed contents ###")[-1].replace("\n", " ").strip()
    
  • 思路:这一部分通过对完成任务的总结,验证项目的执行效果,确保达到用户的预期目标。


3.api.py代码解析

原始代码如下;

import base64
import requestsdef encode_image(image_path):with open(image_path, "rb") as image_file:return base64.b64encode(image_file.read()).decode('utf-8')def inference_chat(chat, model, api_url, token):    headers = {"Content-Type": "application/json","Authorization": f"Bearer {token}"}data = {"model": model,"messages": [],"max_tokens": 2048,'temperature': 0.0,"seed": 1234}for role, content in chat:data["messages"].append({"role": role, "content": content})while True:try:res = requests.post(api_url, headers=headers, json=data)res_json = res.json()res_content = res_json['choices'][0]['message']['content']except:print("Network Error:")try:print(res.json())except:print("Request Failed")else:breakreturn res_content
4.chat.py代码解析

源码如下:

import copy
from MobileAgent.api import encode_imagedef init_action_chat():operation_history = []sysetm_prompt = "You are a helpful AI mobile phone operating assistant. You need to help me operate the phone to complete the user\'s instruction."operation_history.append(["system", [{"type": "text", "text": sysetm_prompt}]])return operation_historydef init_reflect_chat():operation_history = []sysetm_prompt = "You are a helpful AI mobile phone operating assistant."operation_history.append(["system", [{"type": "text", "text": sysetm_prompt}]])return operation_historydef init_memory_chat():operation_history = []sysetm_prompt = "You are a helpful AI mobile phone operating assistant."operation_history.append(["system", [{"type": "text", "text": sysetm_prompt}]])return operation_historydef add_response(role, prompt, chat_history, image=None):new_chat_history = copy.deepcopy(chat_history)if image:base64_image = encode_image(image)content = [{"type": "text", "text": prompt},{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}},]else:content = [{"type": "text", "text": prompt},]new_chat_history.append([role, content])return new_chat_historydef add_response_two_image(role, prompt, chat_history, image):new_chat_history = copy.deepcopy(chat_history)base64_image1 = encode_image(image[0])base64_image2 = encode_image(image[1])content = [{"type": "text", "text": prompt},{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image1}"}},{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image2}"}},]new_chat_history.append([role, content])return new_chat_historydef print_status(chat_history):print("*"*100)for chat in chat_history:print("role:", chat[0])print(chat[1][0]["text"] + "<image>"*(len(chat[1])-1) + "\n")print("*"*100)
5.controller.py代码解读

源码如下:

import os
import time
import subprocess
from PIL import Imagedef get_screenshot(adb_path):command = adb_path + " shell rm /sdcard/screenshot.png"subprocess.run(command, capture_output=True, text=True, shell=True)time.sleep(0.5)command = adb_path + " shell screencap -p /sdcard/screenshot.png"subprocess.run(command, capture_output=True, text=True, shell=True)time.sleep(0.5)command = adb_path + " pull /sdcard/screenshot.png ./screenshot"subprocess.run(command, capture_output=True, text=True, shell=True)image_path = "./screenshot/screenshot.png"save_path = "./screenshot/screenshot.jpg"image = Image.open(image_path)image.convert("RGB").save(save_path, "JPEG")os.remove(image_path)def tap(adb_path, x, y):command = adb_path + f" shell input tap {x} {y}"subprocess.run(command, capture_output=True, text=True, shell=True)def type(adb_path, text):text = text.replace("\\n", "_").replace("\n", "_")for char in text:if char == ' ':command = adb_path + f" shell input text %s"subprocess.run(command, capture_output=True, text=True, shell=True)elif char == '_':command = adb_path + f" shell input keyevent 66"subprocess.run(command, capture_output=True, text=True, shell=True)elif 'a' <= char <= 'z' or 'A' <= char <= 'Z' or char.isdigit():command = adb_path + f" shell input text {char}"subprocess.run(command, capture_output=True, text=True, shell=True)elif char in '-.,!?@\'°/:;()':command = adb_path + f" shell input text \"{char}\""subprocess.run(command, capture_output=True, text=True, shell=True)else:command = adb_path + f" shell am broadcast -a ADB_INPUT_TEXT --es msg \"{char}\""subprocess.run(command, capture_output=True, text=True, shell=True)def slide(adb_path, x1, y1, x2, y2):command = adb_path + f" shell input swipe {x1} {y1} {x2} {y2} 500"subprocess.run(command, capture_output=True, text=True, shell=True)def back(adb_path):command = adb_path + f" shell input keyevent 4"subprocess.run(command, capture_output=True, text=True, shell=True)def home(adb_path):command = adb_path + f" shell am start -a android.intent.action.MAIN -c android.intent.category.HOME"subprocess.run(command, capture_output=True, text=True, shell=True)
6.crop.py代码解读

源码如下:

import math
import cv2
import numpy as np
from PIL import Image, ImageDraw
import clip
import torchdef crop_image(img, position):def distance(x1,y1,x2,y2):return math.sqrt(pow(x1 - x2, 2) + pow(y1 - y2, 2))    position = position.tolist()for i in range(4):for j in range(i+1, 4):if(position[i][0] > position[j][0]):tmp = position[j]position[j] = position[i]position[i] = tmpif position[0][1] > position[1][1]:tmp = position[0]position[0] = position[1]position[1] = tmpif position[2][1] > position[3][1]:tmp = position[2]position[2] = position[3]position[3] = tmpx1, y1 = position[0][0], position[0][1]x2, y2 = position[2][0], position[2][1]x3, y3 = position[3][0], position[3][1]x4, y4 = position[1][0], position[1][1]corners = np.zeros((4,2), np.float32)corners[0] = [x1, y1]corners[1] = [x2, y2]corners[2] = [x4, y4]corners[3] = [x3, y3]img_width = distance((x1+x4)/2, (y1+y4)/2, (x2+x3)/2, (y2+y3)/2)img_height = distance((x1+x2)/2, (y1+y2)/2, (x4+x3)/2, (y4+y3)/2)corners_trans = np.zeros((4,2), np.float32)corners_trans[0] = [0, 0]corners_trans[1] = [img_width - 1, 0]corners_trans[2] = [0, img_height - 1]corners_trans[3] = [img_width - 1, img_height - 1]transform = cv2.getPerspectiveTransform(corners, corners_trans)dst = cv2.warpPerspective(img, transform, (int(img_width), int(img_height)))return dstdef calculate_size(box):return (box[2]-box[0]) * (box[3]-box[1])def calculate_iou(box1, box2):xA = max(box1[0], box2[0])yA = max(box1[1], box2[1])xB = min(box1[2], box2[2])yB = min(box1[3], box2[3])interArea = max(0, xB - xA) * max(0, yB - yA)box1Area = (box1[2] - box1[0]) * (box1[3] - box1[1])box2Area = (box2[2] - box2[0]) * (box2[3] - box2[1])unionArea = box1Area + box2Area - interAreaiou = interArea / unionAreareturn ioudef crop(image, box, i, text_data=None):image = Image.open(image)if text_data:draw = ImageDraw.Draw(image)draw.rectangle(((text_data[0], text_data[1]), (text_data[2], text_data[3])), outline="red", width=5)# font_size = int((text_data[3] - text_data[1])*0.75)# font = ImageFont.truetype("arial.ttf", font_size)# draw.text((text_data[0]+5, text_data[1]+5), str(i), font=font, fill="red")cropped_image = image.crop(box)cropped_image.save(f"./temp/{i}.jpg")def in_box(box, target):if (box[0] > target[0]) and (box[1] > target[1]) and (box[2] < target[2]) and (box[3] < target[3]):return Trueelse:return Falsedef crop_for_clip(image, box, i, position):image = Image.open(image)w, h = image.sizeif position == "left":bound = [0, 0, w/2, h]elif position == "right":bound = [w/2, 0, w, h]elif position == "top":bound = [0, 0, w, h/2]elif position == "bottom":bound = [0, h/2, w, h]elif position == "top left":bound = [0, 0, w/2, h/2]elif position == "top right":bound = [w/2, 0, w, h/2]elif position == "bottom left":bound = [0, h/2, w/2, h]elif position == "bottom right":bound = [w/2, h/2, w, h]else:bound = [0, 0, w, h]if in_box(box, bound):cropped_image = image.crop(box)cropped_image.save(f"./temp/{i}.jpg")return Trueelse:return Falsedef clip_for_icon(clip_model, clip_preprocess, images, prompt):image_features = []for image_file in images:image = clip_preprocess(Image.open(image_file)).unsqueeze(0).to(next(clip_model.parameters()).device)image_feature = clip_model.encode_image(image)image_features.append(image_feature)image_features = torch.cat(image_features)text = clip.tokenize([prompt]).to(next(clip_model.parameters()).device)text_features = clip_model.encode_text(text)image_features /= image_features.norm(dim=-1, keepdim=True)text_features /= text_features.norm(dim=-1, keepdim=True)similarity = (100.0 * image_features @ text_features.T).softmax(dim=0).squeeze(0)_, max_pos = torch.max(similarity, dim=0)pos = max_pos.item()return pos
7.icon_localization.py.py代码解读
from MobileAgent.crop import calculate_size, calculate_iou
from PIL import Image
import torchdef remove_boxes(boxes_filt, size, iou_threshold=0.5):boxes_to_remove = set()for i in range(len(boxes_filt)):if calculate_size(boxes_filt[i]) > 0.05*size[0]*size[1]:boxes_to_remove.add(i)for j in range(len(boxes_filt)):if calculate_size(boxes_filt[j]) > 0.05*size[0]*size[1]:boxes_to_remove.add(j)if i == j:continueif i in boxes_to_remove or j in boxes_to_remove:continueiou = calculate_iou(boxes_filt[i], boxes_filt[j])if iou >= iou_threshold:boxes_to_remove.add(j)boxes_filt = [box for idx, box in enumerate(boxes_filt) if idx not in boxes_to_remove]return boxes_filtdef det(input_image_path, caption, groundingdino_model, box_threshold=0.05, text_threshold=0.5):image = Image.open(input_image_path)size = image.sizecaption = caption.lower()caption = caption.strip()if not caption.endswith('.'):caption = caption + '.'inputs = {'IMAGE_PATH': input_image_path,'TEXT_PROMPT': caption,'BOX_TRESHOLD': box_threshold,'TEXT_TRESHOLD': text_threshold}result = groundingdino_model(inputs)boxes_filt = result['boxes']H, W = size[1], size[0]for i in range(boxes_filt.size(0)):boxes_filt[i] = boxes_filt[i] * torch.Tensor([W, H, W, H])boxes_filt[i][:2] -= boxes_filt[i][2:] / 2boxes_filt[i][2:] += boxes_filt[i][:2]boxes_filt = boxes_filt.cpu().int().tolist()filtered_boxes = remove_boxes(boxes_filt, size)  # [:9]coordinates = []for box in filtered_boxes:coordinates.append([box[0], box[1], box[2], box[3]])return coordinates
8.prompt.py.py代码解读

源码如下:

def get_action_prompt(instruction, clickable_infos, width, height, keyboard, summary_history, action_history, last_summary, last_action, add_info, error_flag, completed_content, memory):prompt = "### Background ###\n"prompt += f"This image is a phone screenshot. Its width is {width} pixels and its height is {height} pixels. The user\'s instruction is: {instruction}.\n\n"prompt += "### Screenshot information ###\n"prompt += "In order to help you better perceive the content in this screenshot, we extract some information on the current screenshot through system files. "prompt += "This information consists of two parts: coordinates; content. "prompt += "The format of the coordinates is [x, y], x is the pixel from left to right and y is the pixel from top to bottom; the content is a text or an icon description respectively. "prompt += "The information is as follow:\n"for clickable_info in clickable_infos:if clickable_info['text'] != "" and clickable_info['text'] != "icon: None" and clickable_info['coordinates'] != (0, 0):prompt += f"{clickable_info['coordinates']}; {clickable_info['text']}\n"prompt += "Please note that this information is not necessarily accurate. You need to combine the screenshot to understand."prompt += "\n\n"prompt += "### Keyboard status ###\n"prompt += "We extract the keyboard status of the current screenshot and it is whether the keyboard of the current screenshot is activated.\n"prompt += "The keyboard status is as follow:\n"if keyboard:prompt += "The keyboard has been activated and you can type."else:prompt += "The keyboard has not been activated and you can\'t type."prompt += "\n\n"if add_info != "":prompt += "### Hint ###\n"prompt += "There are hints to help you complete the user\'s instructions. The hints are as follow:\n"prompt += add_infoprompt += "\n\n"if len(action_history) > 0:prompt += "### History operations ###\n"prompt += "Before reaching this page, some operations have been completed. You need to refer to the completed operations to decide the next operation. These operations are as follow:\n"for i in range(len(action_history)):prompt += f"Step-{i+1}: [Operation: " + summary_history[i].split(" to ")[0].strip() + "; Action: " + action_history[i] + "]\n"prompt += "\n"if completed_content != "":prompt += "### Progress ###\n"prompt += "After completing the history operations, you have the following thoughts about the progress of user\'s instruction completion:\n"prompt += "Completed contents:\n" + completed_content + "\n\n"if memory != "":prompt += "### Memory ###\n"prompt += "During the operations, you record the following contents on the screenshot for use in subsequent operations:\n"prompt += "Memory:\n" + memory + "\n"if error_flag:prompt += "### Last operation ###\n"prompt += f"You previously wanted to perform the operation \"{last_summary}\" on this page and executed the Action \"{last_action}\". But you find that this operation does not meet your expectation. You need to reflect and revise your operation this time."prompt += "\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to combine all of the above to perform just one action on the current page. You must choose one of the six actions below:\n"prompt += "Open app (app name): If the current page is desktop, you can use this action to open the app named \"app name\" on the desktop.\n"prompt += "Tap (x, y): Tap the position (x, y) in current page.\n"prompt += "Swipe (x1, y1), (x2, y2): Swipe from position (x1, y1) to position (x2, y2).\n"if keyboard:prompt += "Type (text): Type the \"text\" in the input box.\n"else:prompt += "Unable to Type. You cannot use the action \"Type\" because the keyboard has not been activated. If you want to type, please first activate the keyboard by tapping on the input box on the screen.\n"prompt += "Home: Return to home page.\n"prompt += "Stop: If you think all the requirements of user\'s instruction have been completed and no further operation is required, you can choose this action to terminate the operation process."prompt += "\n\n"prompt += "### Output format ###\n"prompt += "Your output consists of the following three parts:\n"prompt += "### Thought ###\nThink about the requirements that have been completed in previous operations and the requirements that need to be completed in the next one operation.\n"prompt += "### Action ###\nYou can only choose one from the six actions above. Make sure that the coordinates or text in the \"()\".\n"prompt += "### Operation ###\nPlease generate a brief natural language description for the operation in Action based on your Thought."return promptdef get_reflect_prompt(instruction, clickable_infos1, clickable_infos2, width, height, keyboard1, keyboard2, summary, action, add_info):prompt = f"These images are two phone screenshots before and after an operation. Their widths are {width} pixels and their heights are {height} pixels.\n\n"prompt += "In order to help you better perceive the content in this screenshot, we extract some information on the current screenshot through system files. "prompt += "The information consists of two parts, consisting of format: coordinates; content. "prompt += "The format of the coordinates is [x, y], x is the pixel from left to right and y is the pixel from top to bottom; the content is a text or an icon description respectively "prompt += "The keyboard status is whether the keyboard of the current page is activated."prompt += "\n\n"prompt += "### Before the current operation ###\n"prompt += "Screenshot information:\n"for clickable_info in clickable_infos1:if clickable_info['text'] != "" and clickable_info['text'] != "icon: None" and clickable_info['coordinates'] != (0, 0):prompt += f"{clickable_info['coordinates']}; {clickable_info['text']}\n"prompt += "Keyboard status:\n"if keyboard1:prompt += f"The keyboard has been activated."else:prompt += "The keyboard has not been activated."prompt += "\n\n"prompt += "### After the current operation ###\n"prompt += "Screenshot information:\n"for clickable_info in clickable_infos2:if clickable_info['text'] != "" and clickable_info['text'] != "icon: None" and clickable_info['coordinates'] != (0, 0):prompt += f"{clickable_info['coordinates']}; {clickable_info['text']}\n"prompt += "Keyboard status:\n"if keyboard2:prompt += f"The keyboard has been activated."else:prompt += "The keyboard has not been activated."prompt += "\n\n"prompt += "### Current operation ###\n"prompt += f"The user\'s instruction is: {instruction}. You also need to note the following requirements: {add_info}. In the process of completing the requirements of instruction, an operation is performed on the phone. Below are the details of this operation:\n"prompt += "Operation thought: " + summary.split(" to ")[0].strip() + "\n"prompt += "Operation action: " + actionprompt += "\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to output the following content based on the screenshots before and after the current operation:\n"prompt += "Whether the result of the \"Operation action\" meets your expectation of \"Operation thought\"?\n"prompt += "A: The result of the \"Operation action\" meets my expectation of \"Operation thought\".\n"prompt += "B: The \"Operation action\" results in a wrong page and I need to return to the previous page.\n"prompt += "C: The \"Operation action\" produces no changes."prompt += "\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Thought ###\nYour thought about the question\n"prompt += "### Answer ###\nA or B or C"return promptdef get_memory_prompt(insight):if insight != "":prompt  = "### Important content ###\n"prompt += insightprompt += "\n\n"prompt += "### Response requirements ###\n"prompt += "Please think about whether there is any content closely related to ### Important content ### on the current page? If there is, please output the content. If not, please output \"None\".\n\n"else:prompt  = "### Response requirements ###\n"prompt += "Please think about whether there is any content closely related to user\'s instrcution on the current page? If there is, please output the content. If not, please output \"None\".\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Important content ###\nThe content or None. Please do not repeatedly output the information in ### Memory ###."return promptdef get_process_prompt(instruction, thought_history, summary_history, action_history, completed_content, add_info):prompt = "### Background ###\n"prompt += f"There is an user\'s instruction which is: {instruction}. You are a mobile phone operating assistant and are operating the user\'s mobile phone.\n\n"if add_info != "":prompt += "### Hint ###\n"prompt += "There are hints to help you complete the user\'s instructions. The hints are as follow:\n"prompt += add_infoprompt += "\n\n"if len(thought_history) > 1:prompt += "### History operations ###\n"prompt += "To complete the requirements of user\'s instruction, you have performed a series of operations. These operations are as follow:\n"for i in range(len(summary_history)):operation = summary_history[i].split(" to ")[0].strip()prompt += f"Step-{i+1}: [Operation thought: " + operation + "; Operation action: " + action_history[i] + "]\n"prompt += "\n"prompt += "### Progress thinking ###\n"prompt += "After completing the history operations, you have the following thoughts about the progress of user\'s instruction completion:\n"prompt += "Completed contents:\n" + completed_content + "\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to update the \"Completed contents\". Completed contents is a general summary of the current contents that have been completed based on the ### History operations ###.\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Completed contents ###\nUpdated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### History operations ###."else:prompt += "### Current operation ###\n"prompt += "To complete the requirements of user\'s instruction, you have performed an operation. Your operation thought and action of this operation are as follows:\n"prompt += f"Operation thought: {thought_history[-1]}\n"operation = summary_history[-1].split(" to ")[0].strip()prompt += f"Operation action: {operation}\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to combine all of the above to generate the \"Completed contents\".\n"prompt += "Completed contents is a general summary of the current contents that have been completed. You need to first focus on the requirements of user\'s instruction, and then summarize the contents that have been completed.\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Completed contents ###\nGenerated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### Current operation ###.\n"prompt += "(Please use English to output)"return prompt
9.text_localization.py代码解读

源码如下:

import cv2
import numpy as np
from MobileAgent.crop import crop_imagedef order_point(coor):arr = np.array(coor).reshape([4, 2])sum_ = np.sum(arr, 0)centroid = sum_ / arr.shape[0]theta = np.arctan2(arr[:, 1] - centroid[1], arr[:, 0] - centroid[0])sort_points = arr[np.argsort(theta)]sort_points = sort_points.reshape([4, -1])if sort_points[0][0] > centroid[0]:sort_points = np.concatenate([sort_points[3:], sort_points[:3]])sort_points = sort_points.reshape([4, 2]).astype('float32')return sort_pointsdef longest_common_substring_length(str1, str2):m = len(str1)n = len(str2)dp = [[0] * (n + 1) for _ in range(m + 1)]for i in range(1, m + 1):for j in range(1, n + 1):if str1[i - 1] == str2[j - 1]:dp[i][j] = dp[i - 1][j - 1] + 1else:dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])return dp[m][n]def ocr(image_path, ocr_detection, ocr_recognition):text_data = []coordinate = []image_full = cv2.imread(image_path)det_result = ocr_detection(image_full)det_result = det_result['polygons'] for i in range(det_result.shape[0]):pts = order_point(det_result[i])image_crop = crop_image(image_full, pts)try:result = ocr_recognition(image_crop)['text'][0]except:continuebox = [int(e) for e in list(pts.reshape(-1))]box = [box[0], box[1], box[4], box[5]]text_data.append(result)coordinate.append(box)else:return text_data, coordinate

三、优化改进策略

  • icon-locallization和text_licalization,图标OCR定位和文本定位已经是比较成熟的技术了,优化空间很少,而Prompt和Controller(增加操作空间)中有着很大的优化空间;因此我们优化的策略有二:
1.动作空间扩展

Mobile-Agent v2原有的动作空间,如下图和代码:

def get_reflect_prompt(instruction, clickable_infos1, clickable_infos2, width, height, keyboard1, keyboard2, summary, action, add_info):prompt = f"These images are two phone screenshots before and after an operation. Their widths are {width} pixels and their heights are {height} pixels.\n\n"prompt += "In order to help you better perceive the content in this screenshot, we extract some information on the current screenshot through system files. "prompt += "The information consists of two parts, consisting of format: coordinates; content. "prompt += "The format of the coordinates is [x, y], x is the pixel from left to right and y is the pixel from top to bottom; the content is a text or an icon description respectively "prompt += "The keyboard status is whether the keyboard of the current page is activated."prompt += "\n\n"prompt += "### Before the current operation ###\n"prompt += "Screenshot information:\n"for clickable_info in clickable_infos1:if clickable_info['text'] != "" and clickable_info['text'] != "icon: None" and clickable_info['coordinates'] != (0, 0):prompt += f"{clickable_info['coordinates']}; {clickable_info['text']}\n"prompt += "Keyboard status:\n"if keyboard1:prompt += f"The keyboard has been activated."else:prompt += "The keyboard has not been activated."prompt += "\n\n"prompt += "### After the current operation ###\n"prompt += "Screenshot information:\n"for clickable_info in clickable_infos2:if clickable_info['text'] != "" and clickable_info['text'] != "icon: None" and clickable_info['coordinates'] != (0, 0):prompt += f"{clickable_info['coordinates']}; {clickable_info['text']}\n"prompt += "Keyboard status:\n"if keyboard2:prompt += f"The keyboard has been activated."else:prompt += "The keyboard has not been activated."prompt += "\n\n"prompt += "### Current operation ###\n"prompt += f"The user\'s instruction is: {instruction}. You also need to note the following requirements: {add_info}. In the process of completing the requirements of instruction, an operation is performed on the phone. Below are the details of this operation:\n"prompt += "Operation thought: " + summary.split(" to ")[0].strip() + "\n"prompt += "Operation action: " + actionprompt += "\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to output the following content based on the screenshots before and after the current operation:\n"prompt += "Whether the result of the \"Operation action\" meets your expectation of \"Operation thought\"?\n"prompt += "A: The result of the \"Operation action\" meets my expectation of \"Operation thought\".\n"prompt += "B: The \"Operation action\" results in a wrong page and I need to return to the previous page.\n"prompt += "C: The \"Operation action\" produces no changes."prompt += "\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Thought ###\nYour thought about the question\n"prompt += "### Answer ###\nA or B or C"return prompt

可以补充的动作包括:

  • 长按操作:LongTap(x,y,t):Tap the position(x,y) for t second in current page.
  • 扩展点击操作:Tap_scale(x1,y1,x2,y2):Touch at position(x1,y1) and lift at position(x2,y2) in current page;
  • 基于上述原子操作构建组合操作,例如DoubleTap(Tap,Tap)\Search(Tap+Type)等等;
2.修改Prompt

Mobile-Agent v2支持引入额外知识或信息,例如:From:Moble-Agent-v2/run.py

def get_process_prompt(instruction, thought_history, summary_history, action_history, completed_content, add_info):prompt = "### Background ###\n"prompt += f"There is an user\'s instruction which is: {instruction}. You are a mobile phone operating assistant and are operating the user\'s mobile phone.\n\n"if add_info != "":prompt += "### Hint ###\n"prompt += "There are hints to help you complete the user\'s instructions. The hints are as follow:\n"prompt += add_infoprompt += "\n\n"if len(thought_history) > 1:prompt += "### History operations ###\n"prompt += "To complete the requirements of user\'s instruction, you have performed a series of operations. These operations are as follow:\n"for i in range(len(summary_history)):operation = summary_history[i].split(" to ")[0].strip()prompt += f"Step-{i+1}: [Operation thought: " + operation + "; Operation action: " + action_history[i] + "]\n"prompt += "\n"prompt += "### Progress thinking ###\n"prompt += "After completing the history operations, you have the following thoughts about the progress of user\'s instruction completion:\n"prompt += "Completed contents:\n" + completed_content + "\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to update the \"Completed contents\". Completed contents is a general summary of the current contents that have been completed based on the ### History operations ###.\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Completed contents ###\nUpdated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### History operations ###."else:prompt += "### Current operation ###\n"prompt += "To complete the requirements of user\'s instruction, you have performed an operation. Your operation thought and action of this operation are as follows:\n"prompt += f"Operation thought: {thought_history[-1]}\n"operation = summary_history[-1].split(" to ")[0].strip()prompt += f"Operation action: {operation}\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to combine all of the above to generate the \"Completed contents\".\n"prompt += "Completed contents is a general summary of the current contents that have been completed. You need to first focus on the requirements of user\'s instruction, and then summarize the contents that have been completed.\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Completed contents ###\nGenerated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### Current operation ###.\n"prompt += "(Please use English to output)"return prompt

对于特定的任务,可以在addinfo中添加对应的说明书信息,以起到强调的作用,如下:
斗地主

3.针对特定应用适配Prompt

改写为更特定化的Prompt,如下代码:

def get_action_prompt(instruction, clickable_infos, width, height, keyboard, summary_history, action_history, last_summary, last_action, add_info, error_flag, completed_content, memory):prompt = "### Background ###\n"prompt += f"This image is a phone screenshot. Its width is {width} pixels and its height is {height} pixels. The user\'s instruction is: {instruction}.\n\n" 
4.其他方法
  • 调用工具扩展,如RAG,更适用于UI检测的工具
  • 更好的解决广告与弹窗;

四、PC-Agent框架解读

项目地址:PC_Agent

1. 介绍信息

简介
PC-Agent 是一个多智能体协作系统,能够基于用户指令自动控制计算机软件(如 Chrome、Word、微信等)。其为高分辨率屏幕设计的视觉感知模块更适合PC平台使用。通过“规划-决策-反思”的框架,PC-Agent 提高了操作的成功率。

🔧 快速开始

安装
PC-Agent 支持 MacOS 和 Windows 系统。

MacOS 安装:

pip install -r requirements.txt

Windows 安装:

pip install -r requirements_win.txt

在你的电脑上测试
运行 run.py,并提供你的指令和 GPT-4o API 令牌。例如:

python run.py --instruction="Create a new doc on Word, write a brief introduction of Alibaba, and save the document." --api_token='Your GPT-4o API token.'

你可以通过 --add_info 选项添加特定的操作知识,以帮助 PC-Agent 更加准确地执行操作。

为了进一步提高 PC-Agent 的操作效率,你可以设置 --disable_reflection 来跳过反思过程,但请注意,这可能会降低操作的成功率。

2. 框架介绍

PC-Agent 框架示意图

PC-Agent 框架是在 Mobile-Agent v2 框架的基础上发展而来,为了适配PC平台的特性进行了多项改进。以下是各个模块的详细解读:

① 调试工具的适配

在 Mobile-Agent v2 中,使用 Android Debug Bridge (ADB) 来控制和调试移动设备。然而,在 PC-Agent 中,由于操作对象从移动设备转移到了PC端,调试工具需要进行相应的适配。因此,PC-Agent 将 ADB 替换为 Python 库 Pyautogui 和 Pyperclip。

  • Pyautogui:用于模拟鼠标和键盘操作,包括点击、拖动、键盘输入等,可以实现对大多数PC软件的基本控制。
  • Pyperclip:用于处理剪贴板内容,支持文本的复制与粘贴操作。

通过这种适配,PC-Agent 能够在PC环境下顺利完成类似于移动设备上的自动化操作。

② 视觉感知模块的适配

PC-Agent 的视觉感知模块针对高分辨率的PC屏幕进行了优化。相比移动设备,PC屏幕的分辨率更高、细节更丰富,因此需要更全面的视觉感知能力。

  • 图像分割:首先对屏幕截图进行切割,以便聚焦于特定区域的内容。
  • OCR(Optical Character Recognition):用于识别屏幕上的文本内容。PC-Agent 依赖 OCR 技术来解析屏幕中的信息,如识别文档中的文本或按钮标签。
  • SAM(Segment Anything Model):用于图像的语义分割,帮助系统理解屏幕上不同元素的含义和位置,从而提高交互的准确性。

通过这些工具的结合,PC-Agent 能够在复杂的PC环境中精确地识别和操作屏幕上的各类元素。

③ 动作空间的适配

PC端的输入设备(如鼠标和键盘)比移动端的操作更加复杂,因而需要针对这些设备特性进行动作空间的适配。

动作空间示意图

  • 鼠标操作:包括单击、双击、三击、右击、拖拽等多种复杂操作,这些操作能够实现对PC应用的全面控制。
  • 键盘输入:除了常规的文字输入外,还包括多种快捷键操作,如 Ctrl+C、Ctrl+V 等。这些快捷键在提高操作效率的同时也带来了更大的操作复杂性。

通过对动作空间的优化,PC-Agent 能够更高效地完成用户指令。

④ 反思 Agent 的优化

反思 Agent 示意图

在“规划-决策-反思”的框架中,反思 Agent 是一个关键组件,它用于评估每一步操作的结果,并在操作失败时提出改进措施。对于PC-Agent,反思 Agent 进行了以下优化:

  • 错误检测与恢复:反思 Agent 能够检测到上一步操作是否失败。如果失败,它会生成一个新的操作策略,并将此策略反馈给 Action Agent 进行重新执行。
  • 动态调整:反思 Agent 会根据每次操作的反馈不断调整和优化自身的行为模式,逐渐提高操作的成功率。

通过这些改进,反思 Agent 能够更有效地处理PC端可能出现的各种问题,确保任务顺利完成。

⑤ 快捷键的高效利用

为了提高操作效率,PC-Agent 充分利用了PC平台上快捷键的优势。快捷键能够大幅减少操作步骤,从而加快任务完成速度。例如,在文档处理中,使用快捷键复制、粘贴、保存文件等操作比通过菜单栏点击要高效得多。

总结来看,PC-Agent 在 Mobile-Agent v2 的基础上,通过对调试工具、视觉感知模块、动作空间、反思 Agent 以及快捷键的高效利用进行全面适配,成功将多智能体协作系统的优势扩展到了PC平台,为用户提供了更强大的自动化操作能力。

参考链接

  • 阿里百炼
  • 代码文件
  • 体验链接

thank_watch

如果觉得我的文章对您有帮助,三连+关注便是对我创作的最大鼓励!或者一个star🌟也可以😂.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/417675.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

万象奥科参展“2024 STM32全国巡回研讨会”—深圳站、广州站

9月3日-9月5日&#xff0c;万象奥科参展“2024 STM32全国巡回研讨会”— 深圳站、广州站。此次STM32研讨会将会走进全国11个城市&#xff0c;展示STM32在智能工业、无线连接、边缘人工智能、安全、图形用户界面等领域的产品解决方案及多样化应用实例&#xff0c;深入解读最新的…

指针之旅(3)—— 指针 与 数组

目录 1. 数组名的两种意义 2. 指针访问数组&#xff08;指针也能下标引用&#xff09; 3. 一维数组传参的本质 和 sizeof在函数中失效的原因 4. 指针数组 4.1 指针数组的概念 4.2 一级指针数组 4.3 一级指针数组模拟实现二维数组 5. 数组、指针 与 字符串 6. 数组指针…

微信小程序接入客服功能

前言 用户可使用小程序客服消息功能&#xff0c;与小程序的客服人员进行沟通。客服功能主要用于在小程序内 用户与客服直接沟通用&#xff0c;本篇介绍客服功能的基础开发以及进阶功能的使用&#xff0c;另外介绍多种客服的对接方式。 更多介绍请查看客服消息使用指南 客服视…

多头切片的关键:Model 类 call解释;LlamaModel 类 call解释;多头切片的关键:cache的数据拼接

目录 Model 类 call解释 LlamaModel 类 call解释 方法签名 方法体 总结 Model 类 call解释 这段代码定义了一个特殊的方法 __call__,它是Python中的一个魔术方法(magic method),允许类的实例像函数那样被调用。在这个上下文中,这个方法很可能被定义在一个封装了某种…

【2025】公司仓库管理系统的设计与实现(公司仓库信息管理系统,仓库信息系统,管理系统,信息管理系统,货物仓管系统)

博主介绍&#xff1a; ✌我是阿龙&#xff0c;一名专注于Java技术领域的程序员&#xff0c;全网拥有10W粉丝。作为CSDN特邀作者、博客专家、新星计划导师&#xff0c;我在计算机毕业设计开发方面积累了丰富的经验。同时&#xff0c;我也是掘金、华为云、阿里云、InfoQ等平台…

【BIO、NIO、AIO适用场景分析】

BIO、NIO、AIO适用场景分析 1.适用场景&#xff1a;2.BIO基本介绍2.1 BIO示例 3.Java NIO基本介绍3.1 NIO中三个核心部分&#xff1a;3.2 NIO非阻塞3.3 buffer案例3.4 比较 1.适用场景&#xff1a; BIO方式适用于连接数目比较少且固定的架构&#xff0c;这种方式对服务器资源要…

Java对象拷贝的浅与深:如何选择?

在日常开发中&#xff0c;我们经常需要将一个对象的属性复制到另一个对象中。无论是使用第三方工具类还是自己手动实现&#xff0c;都会涉及到浅拷贝和深拷贝的问题。本文将深入讨论浅拷贝的潜在风险&#xff0c;并给出几种实现深拷贝的方式&#xff0c;帮助大家避免潜在的坑。…

gin 通过 OpenTelemetry 实现链路追踪

相关阅读:https://juejin.cn/post/7275550548946337829https://juejin.cn/post/7275550548946337829 OpenTelemetry 是 Cloud Native Computing Foundation (CNCF) 下的一个开源项目,旨在标准化遥测数据的生成和收集;遥测数据包括日志、指标和跟踪。 Gin 是一个用 Go (Gol…

69页PPT全面预算管理体系的框架与落地

一、明确企业战略目标企业战略目标是预算指标体系确立的根本出发点。它为预算指标的设定提供了方向和指导。 深入分析企业长期发展规划 企业需要对自身的长期发展规划进行全面、深入的分析。这包括对市场趋势、行业竞争态势、技术发展方向等外部环境因素的研究&#xff0c;以…

从API到应用:直播美颜SDK如何助力主播美颜工具开发

美颜SDK&#xff08;软件开发工具包&#xff09;作为一套强大的工具&#xff0c;能够帮助开发者快速构建美颜功能。通过集成API&#xff0c;开发者可以实现复杂的实时美颜效果&#xff0c;助力主播美颜工具的开发和应用。 一、美颜SDK的基本功能 美颜SDK是一套为开发者提供的…

【2024高教社杯全国大学生数学建模竞赛】B题模型建立求解

目录 1问题重述1.1问题背景1.2研究意义1.3具体问题 2总体分析3模型假设4符号说明&#xff08;等四问全部更新完再写&#xff09;5模型的建立与求解5.1问题一模型的建立与求解5.1.1问题的具体分析5.1.2模型的准备 目前B题第一问的详细求解过程以及对应论文部分已经完成&#xff…

中国生态环境胁迫数据(栅格/县域尺度)-为研究生态环境压力提供数据支撑

中国生态环境胁迫矢量数据&#xff08;2000-2010年&#xff09; 数据介绍 2000-2010年中国生态环境胁迫数据为2000-2010年中国范围内人口、农业生产等生态环境胁迫因子的空间分布图&#xff0c;包括人口密度、农药使用强度、化肥施用强度。数据可用于分析全国生态环境胁迫因子…

QT笔记 - QProcess读取外部程序(进程)消息

简要介绍 QProcess可用于在当前程序中启动独立的外部程序(进程)&#xff0c;并进行通讯&#xff0c;通讯原理是通过程序的输入或输出流&#xff0c;即通过c中的printf()和或c的std::cout等。 函数 void QProcess::start(const QString & program, const QStringList &am…

系统分析师7:数学与经济管理

文章目录 1 图论应用1.1 最小生成树1.2 最短路径1.3 网络与最大流量 2 运筹方法2.1 线性规划2.2 动态规划2.2.1 供需平衡问题2.2.2 任务指派问题 3 预测与决策3.1 不确定型决策分析3.2 风险型决策3.2.1 决策树3.2.2 决策表 4 随机函数5 数学建模 1 图论应用 ①最小生成树 连接…

Android 存储之 SharedPreferences 框架体系编码模板

一、SharedPreferences 框架体系 1、SharedPreferences 基本介绍 SharedPreferences 是 Android 的一个轻量级存储工具&#xff0c;它采用 key - value 的键值对方式进行存储 它允许保存和读取应用中的基本数据类型&#xff0c;例如&#xff0c;String、int、float、boolean …

解决Type-C接口供电难题:LDR6328取电协议芯片的关键作用

在智能设备快速发展的今天&#xff0c;Type-C接口因其便捷性、高速传输能力和双向充电功能&#xff0c;已成为众多设备的标准接口。然而&#xff0c;随着设备功率需求的不断提升&#xff0c;Type-C接口的供电难题也日益凸显。为解决这一难题&#xff0c;LDR6328取电协议芯片应运…

HTB-Pennyworth(cve查询 和 exp使用)

前言 各位师傅大家好&#xff0c;我是qmx_07,今天给大家讲解Pennyworth靶场 渗透过程 信息搜集 服务器端口开放了8080http端口 访问网站 服务器使用jenkins cms系统&#xff0c;版本是2.289.1 通过弱口令爆破&#xff0c;账户是root,密码是password 通过命令执行nday 连…

2.1ceph集群部署准备-硬件及拓扑

硬件配置及建议 时至今日&#xff0c;ceph可以运行在各种各样的硬件平台上&#xff0c;不管是传统的x86架构平台(intel 至强系列、基于amd的海光系列等)&#xff0c;还是基于arm的架构平台(比如华为鲲鹏)&#xff0c;都可以完美运行ceph集群&#xff0c;展现了其强大的适应能力…

结合AI图片增强、去背景,如何更好的恢复旧照片老照片?

随着数字时代的到来&#xff0c;我们越来越依赖于技术来保存和恢复珍贵的记忆。在众多技术中&#xff0c;人工智能&#xff08;AI&#xff09;在恢复旧照片方面展现出了其独特的魅力和潜力。AI不仅能够修复破损的照片&#xff0c;还能够增强图像质量&#xff0c;让那些褪色的记…

WPS中JS宏使用说明(持续优化...)

前言 好久没发文章了&#xff0c;今天闲来无事发篇文章找找之前的码字感觉。 正文 最近在写教案&#xff0c;发现之前的技术又可以派上用场了。就是JS&#xff0c;全称JavaScript&#xff0c;这个语言太强大了&#xff0c;我发现WPS里的宏现在默认就是JS。功能选项如下图&…