Mobile-Agent赛题分析和代码解读笔记（DataWhale AI夏令营）

logo

前言

你好，我是GISer Liu，一名热爱AI技术的GIS开发者，本文是DataWhale 2024 AI夏令营的最后一期——Mobile-Agent赛道，关于赛题分析和代码解读的学习文档总结；这边作者也会分享自己的思路；

本文是对原视频的总结与拓展，原视频链接如下：MobileAgent解析

一、赛题背景

1. 什么是 Mobile-Agent?

大模型智能体式 LLM（大语言模型）应用的未来，正如 AI 领域的知名人物 Andrej Karpathy 和比尔·盖茨所预言的，将会彻底改变我们与计算机交互的方式，并颠覆整个软件行业。

Andrej Karpathy 曾说道：
“如果一篇论文提出了不同的训练方法，我们内部的 Slack 上会嗤之以鼻，认为那些都是我们玩剩下的东西。然而，当新的 AI Agents 论文发布时，我们会认真且兴奋地讨论。AI Agent 不仅会改变每个人与计算机交互的方式，它还将颠覆软件行业，带来自我们从键入命令到点击图标以来最大的计算变革。”

比尔·盖茨同样表示，AI Agent 的未来潜力巨大。

① 概念

Mobile-Agent 概念图

Mobile-Agent 是什么？

多智能体架构：Mobile-Agent 是一个多智能体架构，用于解决在长上下文、图文交错输入中导航的复杂问题。
增强的视觉感知模块：它配备了增强的视觉感知模块，能够提升操作的准确率，尤其是在视觉识别和信息处理方面。
GPT-4o 支持：凭借 GPT-4o 的强大处理能力，Mobile-Agent 在操作性能和速度上有了显著提升。

在 v3 版本演示中可以看到该框架的实际应用效果：

v3 版本演示图

② 功能

那么，Mobile-Agent 和其他 Agent 有什么区别呢？

Mobile-Agent 的功能特点包括：

纯视觉方案：不依赖系统数据，完全基于视觉感知来完成任务，这使得它在隐私和系统兼容性方面有明显优势。
跨应用操作能力：能够在多个应用之间无缝操作，具有较强的任务迁移能力。
感知、规划、反思三者结合：集成了感知、规划和反思模块，能够更好地理解并完成复杂任务。
即插即用，无需训练：无需预先训练，只需简单配置即可投入使用，非常适合快速部署。

③ 应用案例

Mobile-Agent 应用案例

Mobile-Agent 已经在多个实际应用场景中得到了验证，如：

分析天气：通过调用天气应用，自动分析和获取天气信息。
刷短视频并点赞：自动刷短视频，并根据特定规则进行点赞操作。
搜索视频并评论：在视频平台上搜索指定内容并进行评论。
地图导航：自动获取并执行导航指令，例如在地图应用中设置目的地并开始导航。

2. 两个不同版本的分析

V1 版本存在的问题

在 Mobile-Agent 的早期版本（V1 版本）中，存在以下问题：

任务进度追踪难度高：由于操作历史记录冗长且图文交错，单智能体难以高效地追踪任务进度，容易出现操作错误或任务失败。

V1 版本问题图示

V2 版本的改进

在 V2 版本中，Mobile-Agent 进行了重大改进，包括：

多智能体架构：首次在移动设备操作任务上采用了多智能体架构，极大地提高了任务执行的准确性和效率。
延续纯视觉方案：保持了纯视觉方案的优势，使得系统依然不依赖于底层系统数据，具备更广泛的兼容性。
更有效的任务进度追踪：多智能体各司其职，确保了任务进度的高效追踪，并增强了任务相关信息的记忆能力和操作反思能力。
增强的指令拆解能力：改进了对复杂指令的拆解能力，能够在不同应用之间进行复杂操作，并支持多语言场景。
高德地图打车等实际应用：V2 版本还通过了实际应用的验证，例如利用智能体实现购买刀削面和使用高德地图打车的任务。

V2 版本 Demo

3. V2 版本 Mobile-Agent 结构分析

为什么 Mobile-Agent 如此强大？以下是 V2 版本的框架图解：

Mobile-Agent V2 框架图

V2 版本的 Mobile-Agent 由三个主要智能体组成：

Planning Agent（任务规划 Agent）：
- 功能：负责任务的总体规划，根据用户的指令和当前已完成的步骤制定后续操作计划。它解决了“我应该先做什么？”、“我已经完成了哪些工作？”、“接下来我应该干什么？”等问题。
Decision Agent（执行决策 Agent）：
- 功能：基于 Planning Agent 的输出和当前的视觉输入（如屏幕截图），进行具体步骤的决策和执行。Decision Agent 会输出一个思维链（文本描述）、一个行动指令（Action）和一个可能在未来使用的关键记忆（Memory）。
Reflection Agent（反思 Agent）：
- 功能：在 Decision Agent 执行操作后，Reflection Agent 负责判断操作的正确性。如果操作正确，它会反馈成功信号，允许系统继续执行下一步；如果操作失败，它会提供失败相关信息，并促使 Decision Agent 重新进行决策。由于 Decision Agent 不可能达到 100% 的正确率，因此 Reflection Agent 在确保系统整体准确性方面起着至关重要的作用。

这套框架不仅使得 Mobile-Agent 在复杂任务处理上更具优势，还增强了系统的鲁棒性和适应性。

4. 赛题解读

本次 Mobile-Agent 应用挑战赛包括两大赛题：

赛题 1：《基于 Mobile-Agent 框架设计并实现特定应用场景的手机端智能体》
赛题 2：《基于 Mobile-Agent 框架设计并实现面向其他终端设备的智能体》

主题：让 Agent 成为你的超级助理
目的：探索 Mobile-Agent 框架在不同应用场景中的应用与发展

① 手机端智能体

关键点：

围绕手机设备：此赛题聚焦于在智能手机平台上开发创新性的应用场景。
面向“特定”“创新”应用场景，大开脑洞：参赛者应针对特定的手机端应用场景，充分发挥创意，设计出智能化、自动化的解决方案。
尽量保留架构，修改、添加提示词，调用其他工具：在现有 Mobile-Agent 框架基础上，参赛者可以通过定制提示词、集成第三方工具等方式，增强智能体的功能和适应性。
提高推理速度：为了保证实际应用中的用户体验，智能体的推理速度（即执行效率）是一个重要的考量因素。

任务思路：

手机端智能体任务思路

在设计手机端智能体时，可以考虑以下几个步骤：

确定应用场景：首先明确你要解决的具体问题或提供的服务，例如自动化办公、智能家居控制、健康监测等。
智能体功能设计：根据应用场景，设计出智能体的核心功能模块，例如文本识别、语音交互、数据处理等。
优化框架：在 Mobile-Agent 框架的基础上，调整其结构以适应你的应用场景，优化推理速度，确保在移动设备上的流畅运行。
工具集成：根据需要，集成第三方 API 或工具，如机器学习模型、云计算服务等，以增强智能体的能力。

评分标准：

评分标准

评分标准可能包括以下几个方面：

创新性：智能体的设计是否具有新颖性和创造性。
实用性：智能体能否在实际应用场景中有效运行，解决真实问题。
技术实现：代码质量、技术实现的难度和复杂性。
用户体验：推理速度和系统响应是否足够快，用户界面是否友好。

② 其他终端设备智能体

关键点：

围绕“其他”、非手机端设备：此赛题侧重于在非手机端设备上开发智能体，例如 PC、智能家居设备、物联网设备等。
框架扩展，结合所选终端的特定：根据所选择的终端设备特点，对 Mobile-Agent 框架进行必要的扩展和适配，以满足特定设备的需求。
修改、添加提示词、调用其他工具：类似于手机端智能体赛题，参赛者可以通过定制提示词、集成工具等方式增强智能体的功能。
提高推理速度：同样需要关注智能体在终端设备上的推理速度，以确保良好的用户体验。

任务思路：

在设计其他终端设备的智能体时，可以考虑以下几个步骤：

选择目标设备：明确你要针对的终端设备类型，如智能手表、智能音箱、智能电视等。
分析设备特点：研究设备的硬件和软件特性，例如处理能力、输入输出方式、操作系统等，以便进行框架适配。
智能体功能设计：基于设备特点，设计出智能体的核心功能模块。例如在智能电视上，可能需要集成语音识别和内容推荐功能。
优化与测试：针对设备的特性进行优化，并通过实际测试确保智能体能够在目标设备上高效运行。

评分标准：

评分标准与手机端智能体类似，主要关注创新性、实用性、技术实现和用户体验。

此外，奖金也是十分可观的 😀。

二、Mobile-Agent框架解读

1.启动配置信息

开始

❗目前仅安卓和鸿蒙系统（版本号 ≤ 4）支持工具调试。其他系统如 iOS 暂时不支持使用 Mobile-Agent。

安装依赖

pip install -r requirements.txt

准备通过 ADB 连接你的移动设备

下载 Android Debug Bridge（ADB）。
在你的移动设备上开启“USB调试”或“ADB调试”。通常需要打开开发者选项并在其中开启。如果是 HyperOS 系统，还需要同时开启“USB调试(安全设置)”。
通过数据线连接移动设备和电脑，在手机的连接选项中选择“传输文件”。
用以下命令测试连接是否成功：
```
/path/to/adb devices
```
如果输出的设备列表不为空，说明连接成功。
如果使用 MacOS 或者 Linux，请先为 ADB 开启权限：
```
sudo chmod +x /path/to/adb
```
/path/to/adb 在 Windows 电脑上为 xx/xx/adb.exe，在 MacOS 或 Linux 上则为 xx/xx/adb。

在你的移动设备上安装 ADB 键盘

下载 ADB 键盘的 APK 安装包。
在设备上点击 APK 安装包进行安装。
在系统设置中将默认输入法切换为“ADB Keyboard”。

选择适合的运行方式

在 run.py 的第 22 行起编辑你的设置，并输入你的 ADB 路径、指令、GPT-4 API URL 和 Token。
选择适合你的设备的图标描述模型的调用方法：
- 如果设备配备高性能 GPU，建议使用“local”方法，即在本地设备中部署图标描述模型。此方法通常更有效率。
- 如果设备无法运行 7B 大小的 LLM，请选择“api”方法。我们使用并行调用以确保效率。
选择图标描述模型：
- 如果选择“local”方法，需要在“qwen-vl-chat”和“qwen-vl-chat-int4”之间选择。“qwen-vl-chat”需要更多的 GPU 内存，但性能优于“qwen-vl-chat-int4”。同时，“qwen_api”可以为空。
- 如果选择“api”方法，需要在“qwen-vl-plus”和“qwen-vl-max”之间选择。“qwen-vl-max”需要更多费用，但性能优于“qwen-vl-plus”。此外，还需要申请 Qwen-VL 的 API-KEY，并将其输入到“qwen_api”中。
你可以在“add_info”中添加操作知识（例如完成指令所需的特定步骤），以帮助更准确地操作移动设备。
如果想进一步提高移动设备的效率，可以将“reflection_switch”和“memory_switch”设置为“False”：
- “reflection_switch”用于确定是否在过程中添加“反思智能体”。这可能会导致操作陷入死循环，但你可以通过在“add_info”中添加操作知识来避免。
- “memory_switch”用于决定是否将“内存单元”添加到过程中。如果指令中不需要在后续操作中使用之前屏幕中的信息，则可以关闭此选项。

运行

python run.py

2.run.py主文件代码解析

项目的主要目的是通过调用多模态大模型和图像处理技术，实现对移动设备的屏幕内容的读取、分析和操作。
项目通过 Android 设备桥（ADB）与设备通信，获取屏幕截图，然后利用各种模型进行图像识别、文字识别、操作决策，最终执行用户指令。

项目由多个模块组成，每个模块都承担特定的功能。以下是模块划分和相应的代码分析：

① 环境设置与初始化

功能：设置ADB路径、用户指令、选择模型和API的调用方式等配置。
代码：

     # Your ADB pathadb_path = "C:/Users/<username>/AppData/Local/Android/Sdk/platform-tools/adb.exe"# Your instructioninstruction = "Read the Screen, tell me what day it is today. Then open Play Store."# Choose between "api" and "local". api: use the qwen api. local: use the local qwen checkpointcaption_call_method = "api"# Choose between "qwen-vl-plus" and "qwen-vl-max" if use api method. Choose between "qwen-vl-chat" and "qwen-vl-chat-int4" if use local method.caption_model = "qwen-vl-plus"# If you choose the api caption call method, input your Qwen api hereqwen_api = "<your api key>"# Other settings...

思路：在开始前，项目通过设置 ADB 路径、用户指令、API调用方式以及模型选择来初始化项目运行的基础环境。

② 聊天历史初始化

功能：初始化不同对话历史（如操作历史、反思历史、记忆历史）用于后续交互。

代码：

def init_action_chat():operation_history = []sysetm_prompt = "You are a helpful AI mobile phone operating assistant. You need to help me operate the phone to complete the user's instruction."operation_history.append({'role': 'system', 'content': [{'text': sysetm_prompt}]})return operation_history

思路：不同的聊天初始化函数用于分别构建操作对话历史、反思对话历史、记忆对话历史等，这样在不同阶段可以复用这些历史对话记录来生成决策。

③图像处理与信息提取

功能：截取手机屏幕、进行OCR识别、图标检测、坐标处理等。

代码：

def get_perception_infos(adb_path, screenshot_file):get_screenshot(adb_path)width, height = Image.open(screenshot_file).sizetext, coordinates = ocr(screenshot_file, ocr_detection, ocr_recognition)text, coordinates = merge_text_blocks(text, coordinates)center_list = [[(coordinate[0]+coordinate[2])/2, (coordinate[1]+coordinate[3])/2] for coordinate in coordinates]draw_coordinates_on_image(screenshot_file, center_list)perception_infos = []for i in range(len(coordinates)):perception_info = {"text": "text: " + text[i], "coordinates": coordinates[i]}perception_infos.append(perception_info)# Detect icons...# Add icon descriptions to perception_infos...return perception_infos, width, height

思路：该模块负责从手机截图中提取有用的信息，包括文本和图标，并将这些信息转化为后续操作的输入。

④. 深度学习模型加载与推理

功能：加载和初始化所需的深度学习模型，处理用户的指令。

代码：

device = "cpu"
torch.manual_seed(1234)
if caption_call_method == "local":# Load local models...
elif caption_call_method == "api":# Use API for models...

思路：根据用户选择，项目会加载本地或API提供的模型来进行图像描述、文本识别、图标检测等任务。通过选择不同模型和API，可以适应不同的应用场景和硬件环境。

⑤. 操作与执行

功能：根据模型输出的操作指令，执行相应的手机操作（点击、滑动、返回等）。

代码：

if "Open app" in action:# Open a specific app...
elif "Tap" in action:# Tap on a specific coordinate...
elif "Swipe" in action:# Swipe from one coordinate to another...
elif "Type" in action:# Type text...
elif "Back" in action:back(adb_path)
elif "Home" in action:home(adb_path)
elif "Stop" in action:break

思路：这一部分是项目的核心逻辑，它根据分析得到的操作指令执行相应的手机操作，来完成用户的任务指令。

⑥. 反思与记忆模块

功能：通过反思上一次的操作结果来调整下一步操作的策略，并将有价值的信息存储在记忆中。

代码：

if reflection_switch:prompt_reflect = get_reflect_prompt(...)chat_reflect = init_reflect_chat()chat_reflect = add_response_two_image("user", prompt_reflect, chat_reflect, [last_screenshot_file, screenshot_file])output_reflect = call_with_local_file(chat_action, api_key=qwen_api, model='qwen-vl-plus')reflect = output_reflect.split("### Answer ###")[-1].replace("\n", " ").strip()chat_reflect = add_response("system", output_reflect, chat_reflect)if 'A' in reflect:thought_history.append(thought)summary_history.append(summary)action_history.append(action)# Other conditions...

思路：通过反思模块，系统会基于之前的操作结果来判断是否需要调整策略，并将重要的信息存储到内存模块中，以便在后续操作中参考。

⑦. 主循环与终止条件

功能：主循环执行多轮操作，并根据一定条件终止循环。

代码：

while True:iter += 1# First iteration...# Action decision...# Memory update...# Reflection...if "Stop" in action:breaktime.sleep(5)

思路：项目在一个循环中进行，直到任务完成或达到终止条件。每次循环都会根据新的屏幕截图和用户指令更新操作，并在适当的时候进行反思和策略调整。

⑧ 总结功能

功能：对项目进行总结，提取核心内容，确保项目达成目标。

代码：

completed_requirements = output_planning.split("### Completed contents ###")[-1].replace("\n", " ").strip()

思路：这一部分通过对完成任务的总结，验证项目的执行效果，确保达到用户的预期目标。

3.api.py代码解析

原始代码如下;

import base64
import requestsdef encode_image(image_path):with open(image_path, "rb") as image_file:return base64.b64encode(image_file.read()).decode('utf-8')def inference_chat(chat, model, api_url, token):    headers = {"Content-Type": "application/json","Authorization": f"Bearer {token}"}data = {"model": model,"messages": [],"max_tokens": 2048,'temperature': 0.0,"seed": 1234}for role, content in chat:data["messages"].append({"role": role, "content": content})while True:try:res = requests.post(api_url, headers=headers, json=data)res_json = res.json()res_content = res_json['choices'][0]['message']['content']except:print("Network Error:")try:print(res.json())except:print("Request Failed")else:breakreturn res_content

4.chat.py代码解析

源码如下:

import copy
from MobileAgent.api import encode_imagedef init_action_chat():operation_history = []sysetm_prompt = "You are a helpful AI mobile phone operating assistant. You need to help me operate the phone to complete the user\'s instruction."operation_history.append(["system", [{"type": "text", "text": sysetm_prompt}]])return operation_historydef init_reflect_chat():operation_history = []sysetm_prompt = "You are a helpful AI mobile phone operating assistant."operation_history.append(["system", [{"type": "text", "text": sysetm_prompt}]])return operation_historydef init_memory_chat():operation_history = []sysetm_prompt = "You are a helpful AI mobile phone operating assistant."operation_history.append(["system", [{"type": "text", "text": sysetm_prompt}]])return operation_historydef add_response(role, prompt, chat_history, image=None):new_chat_history = copy.deepcopy(chat_history)if image:base64_image = encode_image(image)content = [{"type": "text", "text": prompt},{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}},]else:content = [{"type": "text", "text": prompt},]new_chat_history.append([role, content])return new_chat_historydef add_response_two_image(role, prompt, chat_history, image):new_chat_history = copy.deepcopy(chat_history)base64_image1 = encode_image(image[0])base64_image2 = encode_image(image[1])content = [{"type": "text", "text": prompt},{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image1}"}},{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image2}"}},]new_chat_history.append([role, content])return new_chat_historydef print_status(chat_history):print("*"*100)for chat in chat_history:print("role:", chat[0])print(chat[1][0]["text"] + "<image>"*(len(chat[1])-1) + "\n")print("*"*100)

5.controller.py代码解读

源码如下：

import os
import time
import subprocess
from PIL import Imagedef get_screenshot(adb_path):command = adb_path + " shell rm /sdcard/screenshot.png"subprocess.run(command, capture_output=True, text=True, shell=True)time.sleep(0.5)command = adb_path + " shell screencap -p /sdcard/screenshot.png"subprocess.run(command, capture_output=True, text=True, shell=True)time.sleep(0.5)command = adb_path + " pull /sdcard/screenshot.png ./screenshot"subprocess.run(command, capture_output=True, text=True, shell=True)image_path = "./screenshot/screenshot.png"save_path = "./screenshot/screenshot.jpg"image = Image.open(image_path)image.convert("RGB").save(save_path, "JPEG")os.remove(image_path)def tap(adb_path, x, y):command = adb_path + f" shell input tap {x} {y}"subprocess.run(command, capture_output=True, text=True, shell=True)def type(adb_path, text):text = text.replace("\\n", "_").replace("\n", "_")for char in text:if char == ' ':command = adb_path + f" shell input text %s"subprocess.run(command, capture_output=True, text=True, shell=True)elif char == '_':command = adb_path + f" shell input keyevent 66"subprocess.run(command, capture_output=True, text=True, shell=True)elif 'a' <= char <= 'z' or 'A' <= char <= 'Z' or char.isdigit():command = adb_path + f" shell input text {char}"subprocess.run(command, capture_output=True, text=True, shell=True)elif char in '-.,!?@\'°/:;()':command = adb_path + f" shell input text \"{char}\""subprocess.run(command, capture_output=True, text=True, shell=True)else:command = adb_path + f" shell am broadcast -a ADB_INPUT_TEXT --es msg \"{char}\""subprocess.run(command, capture_output=True, text=True, shell=True)def slide(adb_path, x1, y1, x2, y2):command = adb_path + f" shell input swipe {x1} {y1} {x2} {y2} 500"subprocess.run(command, capture_output=True, text=True, shell=True)def back(adb_path):command = adb_path + f" shell input keyevent 4"subprocess.run(command, capture_output=True, text=True, shell=True)def home(adb_path):command = adb_path + f" shell am start -a android.intent.action.MAIN -c android.intent.category.HOME"subprocess.run(command, capture_output=True, text=True, shell=True)

6.crop.py代码解读

源码如下：

import math
import cv2
import numpy as np
from PIL import Image, ImageDraw
import clip
import torchdef crop_image(img, position):def distance(x1,y1,x2,y2):return math.sqrt(pow(x1 - x2, 2) + pow(y1 - y2, 2))    position = position.tolist()for i in range(4):for j in range(i+1, 4):if(position[i][0] > position[j][0]):tmp = position[j]position[j] = position[i]position[i] = tmpif position[0][1] > position[1][1]:tmp = position[0]position[0] = position[1]position[1] = tmpif position[2][1] > position[3][1]:tmp = position[2]position[2] = position[3]position[3] = tmpx1, y1 = position[0][0], position[0][1]x2, y2 = position[2][0], position[2][1]x3, y3 = position[3][0], position[3][1]x4, y4 = position[1][0], position[1][1]corners = np.zeros((4,2), np.float32)corners[0] = [x1, y1]corners[1] = [x2, y2]corners[2] = [x4, y4]corners[3] = [x3, y3]img_width = distance((x1+x4)/2, (y1+y4)/2, (x2+x3)/2, (y2+y3)/2)img_height = distance((x1+x2)/2, (y1+y2)/2, (x4+x3)/2, (y4+y3)/2)corners_trans = np.zeros((4,2), np.float32)corners_trans[0] = [0, 0]corners_trans[1] = [img_width - 1, 0]corners_trans[2] = [0, img_height - 1]corners_trans[3] = [img_width - 1, img_height - 1]transform = cv2.getPerspectiveTransform(corners, corners_trans)dst = cv2.warpPerspective(img, transform, (int(img_width), int(img_height)))return dstdef calculate_size(box):return (box[2]-box[0]) * (box[3]-box[1])def calculate_iou(box1, box2):xA = max(box1[0], box2[0])yA = max(box1[1], box2[1])xB = min(box1[2], box2[2])yB = min(box1[3], box2[3])interArea = max(0, xB - xA) * max(0, yB - yA)box1Area = (box1[2] - box1[0]) * (box1[3] - box1[1])box2Area = (box2[2] - box2[0]) * (box2[3] - box2[1])unionArea = box1Area + box2Area - interAreaiou = interArea / unionAreareturn ioudef crop(image, box, i, text_data=None):image = Image.open(image)if text_data:draw = ImageDraw.Draw(image)draw.rectangle(((text_data[0], text_data[1]), (text_data[2], text_data[3])), outline="red", width=5)# font_size = int((text_data[3] - text_data[1])*0.75)# font = ImageFont.truetype("arial.ttf", font_size)# draw.text((text_data[0]+5, text_data[1]+5), str(i), font=font, fill="red")cropped_image = image.crop(box)cropped_image.save(f"./temp/{i}.jpg")def in_box(box, target):if (box[0] > target[0]) and (box[1] > target[1]) and (box[2] < target[2]) and (box[3] < target[3]):return Trueelse:return Falsedef crop_for_clip(image, box, i, position):image = Image.open(image)w, h = image.sizeif position == "left":bound = [0, 0, w/2, h]elif position == "right":bound = [w/2, 0, w, h]elif position == "top":bound = [0, 0, w, h/2]elif position == "bottom":bound = [0, h/2, w, h]elif position == "top left":bound = [0, 0, w/2, h/2]elif position == "top right":bound = [w/2, 0, w, h/2]elif position == "bottom left":bound = [0, h/2, w/2, h]elif position == "bottom right":bound = [w/2, h/2, w, h]else:bound = [0, 0, w, h]if in_box(box, bound):cropped_image = image.crop(box)cropped_image.save(f"./temp/{i}.jpg")return Trueelse:return Falsedef clip_for_icon(clip_model, clip_preprocess, images, prompt):image_features = []for image_file in images:image = clip_preprocess(Image.open(image_file)).unsqueeze(0).to(next(clip_model.parameters()).device)image_feature = clip_model.encode_image(image)image_features.append(image_feature)image_features = torch.cat(image_features)text = clip.tokenize([prompt]).to(next(clip_model.parameters()).device)text_features = clip_model.encode_text(text)image_features /= image_features.norm(dim=-1, keepdim=True)text_features /= text_features.norm(dim=-1, keepdim=True)similarity = (100.0 * image_features @ text_features.T).softmax(dim=0).squeeze(0)_, max_pos = torch.max(similarity, dim=0)pos = max_pos.item()return pos

7.icon_localization.py.py代码解读

from MobileAgent.crop import calculate_size, calculate_iou
from PIL import Image
import torchdef remove_boxes(boxes_filt, size, iou_threshold=0.5):boxes_to_remove = set()for i in range(len(boxes_filt)):if calculate_size(boxes_filt[i]) > 0.05*size[0]*size[1]:boxes_to_remove.add(i)for j in range(len(boxes_filt)):if calculate_size(boxes_filt[j]) > 0.05*size[0]*size[1]:boxes_to_remove.add(j)if i == j:continueif i in boxes_to_remove or j in boxes_to_remove:continueiou = calculate_iou(boxes_filt[i], boxes_filt[j])if iou >= iou_threshold:boxes_to_remove.add(j)boxes_filt = [box for idx, box in enumerate(boxes_filt) if idx not in boxes_to_remove]return boxes_filtdef det(input_image_path, caption, groundingdino_model, box_threshold=0.05, text_threshold=0.5):image = Image.open(input_image_path)size = image.sizecaption = caption.lower()caption = caption.strip()if not caption.endswith('.'):caption = caption + '.'inputs = {'IMAGE_PATH': input_image_path,'TEXT_PROMPT': caption,'BOX_TRESHOLD': box_threshold,'TEXT_TRESHOLD': text_threshold}result = groundingdino_model(inputs)boxes_filt = result['boxes']H, W = size[1], size[0]for i in range(boxes_filt.size(0)):boxes_filt[i] = boxes_filt[i] * torch.Tensor([W, H, W, H])boxes_filt[i][:2] -= boxes_filt[i][2:] / 2boxes_filt[i][2:] += boxes_filt[i][:2]boxes_filt = boxes_filt.cpu().int().tolist()filtered_boxes = remove_boxes(boxes_filt, size)  # [:9]coordinates = []for box in filtered_boxes:coordinates.append([box[0], box[1], box[2], box[3]])return coordinates

8.prompt.py.py代码解读

源码如下:

def get_action_prompt(instruction, clickable_infos, width, height, keyboard, summary_history, action_history, last_summary, last_action, add_info, error_flag, completed_content, memory):prompt = "### Background ###\n"prompt += f"This image is a phone screenshot. Its width is {width} pixels and its height is {height} pixels. The user\'s instruction is: {instruction}.\n\n"prompt += "### Screenshot information ###\n"prompt += "In order to help you better perceive the content in this screenshot, we extract some information on the current screenshot through system files. "prompt += "This information consists of two parts: coordinates; content. "prompt += "The format of the coordinates is [x, y], x is the pixel from left to right and y is the pixel from top to bottom; the content is a text or an icon description respectively. "prompt += "The information is as follow:\n"for clickable_info in clickable_infos:if clickable_info['text'] != "" and clickable_info['text'] != "icon: None" and clickable_info['coordinates'] != (0, 0):prompt += f"{clickable_info['coordinates']}; {clickable_info['text']}\n"prompt += "Please note that this information is not necessarily accurate. You need to combine the screenshot to understand."prompt += "\n\n"prompt += "### Keyboard status ###\n"prompt += "We extract the keyboard status of the current screenshot and it is whether the keyboard of the current screenshot is activated.\n"prompt += "The keyboard status is as follow:\n"if keyboard:prompt += "The keyboard has been activated and you can type."else:prompt += "The keyboard has not been activated and you can\'t type."prompt += "\n\n"if add_info != "":prompt += "### Hint ###\n"prompt += "There are hints to help you complete the user\'s instructions. The hints are as follow:\n"prompt += add_infoprompt += "\n\n"if len(action_history) > 0:prompt += "### History operations ###\n"prompt += "Before reaching this page, some operations have been completed. You need to refer to the completed operations to decide the next operation. These operations are as follow:\n"for i in range(len(action_history)):prompt += f"Step-{i+1}: [Operation: " + summary_history[i].split(" to ")[0].strip() + "; Action: " + action_history[i] + "]\n"prompt += "\n"if completed_content != "":prompt += "### Progress ###\n"prompt += "After completing the history operations, you have the following thoughts about the progress of user\'s instruction completion:\n"prompt += "Completed contents:\n" + completed_content + "\n\n"if memory != "":prompt += "### Memory ###\n"prompt += "During the operations, you record the following contents on the screenshot for use in subsequent operations:\n"prompt += "Memory:\n" + memory + "\n"if error_flag:prompt += "### Last operation ###\n"prompt += f"You previously wanted to perform the operation \"{last_summary}\" on this page and executed the Action \"{last_action}\". But you find that this operation does not meet your expectation. You need to reflect and revise your operation this time."prompt += "\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to combine all of the above to perform just one action on the current page. You must choose one of the six actions below:\n"prompt += "Open app (app name): If the current page is desktop, you can use this action to open the app named \"app name\" on the desktop.\n"prompt += "Tap (x, y): Tap the position (x, y) in current page.\n"prompt += "Swipe (x1, y1), (x2, y2): Swipe from position (x1, y1) to position (x2, y2).\n"if keyboard:prompt += "Type (text): Type the \"text\" in the input box.\n"else:prompt += "Unable to Type. You cannot use the action \"Type\" because the keyboard has not been activated. If you want to type, please first activate the keyboard by tapping on the input box on the screen.\n"prompt += "Home: Return to home page.\n"prompt += "Stop: If you think all the requirements of user\'s instruction have been completed and no further operation is required, you can choose this action to terminate the operation process."prompt += "\n\n"prompt += "### Output format ###\n"prompt += "Your output consists of the following three parts:\n"prompt += "### Thought ###\nThink about the requirements that have been completed in previous operations and the requirements that need to be completed in the next one operation.\n"prompt += "### Action ###\nYou can only choose one from the six actions above. Make sure that the coordinates or text in the \"()\".\n"prompt += "### Operation ###\nPlease generate a brief natural language description for the operation in Action based on your Thought."return promptdef get_reflect_prompt(instruction, clickable_infos1, clickable_infos2, width, height, keyboard1, keyboard2, summary, action, add_info):prompt = f"These images are two phone screenshots before and after an operation. Their widths are {width} pixels and their heights are {height} pixels.\n\n"prompt += "In order to help you better perceive the content in this screenshot, we extract some information on the current screenshot through system files. "prompt += "The information consists of two parts, consisting of format: coordinates; content. "prompt += "The format of the coordinates is [x, y], x is the pixel from left to right and y is the pixel from top to bottom; the content is a text or an icon description respectively "prompt += "The keyboard status is whether the keyboard of the current page is activated."prompt += "\n\n"prompt += "### Before the current operation ###\n"prompt += "Screenshot information:\n"for clickable_info in clickable_infos1:if clickable_info['text'] != "" and clickable_info['text'] != "icon: None" and clickable_info['coordinates'] != (0, 0):prompt += f"{clickable_info['coordinates']}; {clickable_info['text']}\n"prompt += "Keyboard status:\n"if keyboard1:prompt += f"The keyboard has been activated."else:prompt += "The keyboard has not been activated."prompt += "\n\n"prompt += "### After the current operation ###\n"prompt += "Screenshot information:\n"for clickable_info in clickable_infos2:if clickable_info['text'] != "" and clickable_info['text'] != "icon: None" and clickable_info['coordinates'] != (0, 0):prompt += f"{clickable_info['coordinates']}; {clickable_info['text']}\n"prompt += "Keyboard status:\n"if keyboard2:prompt += f"The keyboard has been activated."else:prompt += "The keyboard has not been activated."prompt += "\n\n"prompt += "### Current operation ###\n"prompt += f"The user\'s instruction is: {instruction}. You also need to note the following requirements: {add_info}. In the process of completing the requirements of instruction, an operation is performed on the phone. Below are the details of this operation:\n"prompt += "Operation thought: " + summary.split(" to ")[0].strip() + "\n"prompt += "Operation action: " + actionprompt += "\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to output the following content based on the screenshots before and after the current operation:\n"prompt += "Whether the result of the \"Operation action\" meets your expectation of \"Operation thought\"?\n"prompt += "A: The result of the \"Operation action\" meets my expectation of \"Operation thought\".\n"prompt += "B: The \"Operation action\" results in a wrong page and I need to return to the previous page.\n"prompt += "C: The \"Operation action\" produces no changes."prompt += "\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Thought ###\nYour thought about the question\n"prompt += "### Answer ###\nA or B or C"return promptdef get_memory_prompt(insight):if insight != "":prompt  = "### Important content ###\n"prompt += insightprompt += "\n\n"prompt += "### Response requirements ###\n"prompt += "Please think about whether there is any content closely related to ### Important content ### on the current page? If there is, please output the content. If not, please output \"None\".\n\n"else:prompt  = "### Response requirements ###\n"prompt += "Please think about whether there is any content closely related to user\'s instrcution on the current page? If there is, please output the content. If not, please output \"None\".\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Important content ###\nThe content or None. Please do not repeatedly output the information in ### Memory ###."return promptdef get_process_prompt(instruction, thought_history, summary_history, action_history, completed_content, add_info):prompt = "### Background ###\n"prompt += f"There is an user\'s instruction which is: {instruction}. You are a mobile phone operating assistant and are operating the user\'s mobile phone.\n\n"if add_info != "":prompt += "### Hint ###\n"prompt += "There are hints to help you complete the user\'s instructions. The hints are as follow:\n"prompt += add_infoprompt += "\n\n"if len(thought_history) > 1:prompt += "### History operations ###\n"prompt += "To complete the requirements of user\'s instruction, you have performed a series of operations. These operations are as follow:\n"for i in range(len(summary_history)):operation = summary_history[i].split(" to ")[0].strip()prompt += f"Step-{i+1}: [Operation thought: " + operation + "; Operation action: " + action_history[i] + "]\n"prompt += "\n"prompt += "### Progress thinking ###\n"prompt += "After completing the history operations, you have the following thoughts about the progress of user\'s instruction completion:\n"prompt += "Completed contents:\n" + completed_content + "\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to update the \"Completed contents\". Completed contents is a general summary of the current contents that have been completed based on the ### History operations ###.\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Completed contents ###\nUpdated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### History operations ###."else:prompt += "### Current operation ###\n"prompt += "To complete the requirements of user\'s instruction, you have performed an operation. Your operation thought and action of this operation are as follows:\n"prompt += f"Operation thought: {thought_history[-1]}\n"operation = summary_history[-1].split(" to ")[0].strip()prompt += f"Operation action: {operation}\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to combine all of the above to generate the \"Completed contents\".\n"prompt += "Completed contents is a general summary of the current contents that have been completed. You need to first focus on the requirements of user\'s instruction, and then summarize the contents that have been completed.\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Completed contents ###\nGenerated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### Current operation ###.\n"prompt += "(Please use English to output)"return prompt

9.text_localization.py代码解读

源码如下：

import cv2
import numpy as np
from MobileAgent.crop import crop_imagedef order_point(coor):arr = np.array(coor).reshape([4, 2])sum_ = np.sum(arr, 0)centroid = sum_ / arr.shape[0]theta = np.arctan2(arr[:, 1] - centroid[1], arr[:, 0] - centroid[0])sort_points = arr[np.argsort(theta)]sort_points = sort_points.reshape([4, -1])if sort_points[0][0] > centroid[0]:sort_points = np.concatenate([sort_points[3:], sort_points[:3]])sort_points = sort_points.reshape([4, 2]).astype('float32')return sort_pointsdef longest_common_substring_length(str1, str2):m = len(str1)n = len(str2)dp = [[0] * (n + 1) for _ in range(m + 1)]for i in range(1, m + 1):for j in range(1, n + 1):if str1[i - 1] == str2[j - 1]:dp[i][j] = dp[i - 1][j - 1] + 1else:dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])return dp[m][n]def ocr(image_path, ocr_detection, ocr_recognition):text_data = []coordinate = []image_full = cv2.imread(image_path)det_result = ocr_detection(image_full)det_result = det_result['polygons'] for i in range(det_result.shape[0]):pts = order_point(det_result[i])image_crop = crop_image(image_full, pts)try:result = ocr_recognition(image_crop)['text'][0]except:continuebox = [int(e) for e in list(pts.reshape(-1))]box = [box[0], box[1], box[4], box[5]]text_data.append(result)coordinate.append(box)else:return text_data, coordinate

三、优化改进策略

icon-locallization和text_licalization，图标OCR定位和文本定位已经是比较成熟的技术了，优化空间很少，而Prompt和Controller（增加操作空间）中有着很大的优化空间；因此我们优化的策略有二：

1.动作空间扩展

Mobile-Agent v2原有的动作空间,如下图和代码：

def get_reflect_prompt(instruction, clickable_infos1, clickable_infos2, width, height, keyboard1, keyboard2, summary, action, add_info):prompt = f"These images are two phone screenshots before and after an operation. Their widths are {width} pixels and their heights are {height} pixels.\n\n"prompt += "In order to help you better perceive the content in this screenshot, we extract some information on the current screenshot through system files. "prompt += "The information consists of two parts, consisting of format: coordinates; content. "prompt += "The format of the coordinates is [x, y], x is the pixel from left to right and y is the pixel from top to bottom; the content is a text or an icon description respectively "prompt += "The keyboard status is whether the keyboard of the current page is activated."prompt += "\n\n"prompt += "### Before the current operation ###\n"prompt += "Screenshot information:\n"for clickable_info in clickable_infos1:if clickable_info['text'] != "" and clickable_info['text'] != "icon: None" and clickable_info['coordinates'] != (0, 0):prompt += f"{clickable_info['coordinates']}; {clickable_info['text']}\n"prompt += "Keyboard status:\n"if keyboard1:prompt += f"The keyboard has been activated."else:prompt += "The keyboard has not been activated."prompt += "\n\n"prompt += "### After the current operation ###\n"prompt += "Screenshot information:\n"for clickable_info in clickable_infos2:if clickable_info['text'] != "" and clickable_info['text'] != "icon: None" and clickable_info['coordinates'] != (0, 0):prompt += f"{clickable_info['coordinates']}; {clickable_info['text']}\n"prompt += "Keyboard status:\n"if keyboard2:prompt += f"The keyboard has been activated."else:prompt += "The keyboard has not been activated."prompt += "\n\n"prompt += "### Current operation ###\n"prompt += f"The user\'s instruction is: {instruction}. You also need to note the following requirements: {add_info}. In the process of completing the requirements of instruction, an operation is performed on the phone. Below are the details of this operation:\n"prompt += "Operation thought: " + summary.split(" to ")[0].strip() + "\n"prompt += "Operation action: " + actionprompt += "\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to output the following content based on the screenshots before and after the current operation:\n"prompt += "Whether the result of the \"Operation action\" meets your expectation of \"Operation thought\"?\n"prompt += "A: The result of the \"Operation action\" meets my expectation of \"Operation thought\".\n"prompt += "B: The \"Operation action\" results in a wrong page and I need to return to the previous page.\n"prompt += "C: The \"Operation action\" produces no changes."prompt += "\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Thought ###\nYour thought about the question\n"prompt += "### Answer ###\nA or B or C"return prompt

可以补充的动作包括：

长按操作：LongTap(x,y,t):Tap the position(x,y) for t second in current page.
扩展点击操作：Tap_scale(x1,y1,x2,y2):Touch at position(x1,y1) and lift at position(x2,y2) in current page;
基于上述原子操作构建组合操作，例如DoubleTap(Tap,Tap)\Search(Tap+Type)等等；

2.修改Prompt

Mobile-Agent v2支持引入额外知识或信息，例如：From:Moble-Agent-v2/run.py

def get_process_prompt(instruction, thought_history, summary_history, action_history, completed_content, add_info):prompt = "### Background ###\n"prompt += f"There is an user\'s instruction which is: {instruction}. You are a mobile phone operating assistant and are operating the user\'s mobile phone.\n\n"if add_info != "":prompt += "### Hint ###\n"prompt += "There are hints to help you complete the user\'s instructions. The hints are as follow:\n"prompt += add_infoprompt += "\n\n"if len(thought_history) > 1:prompt += "### History operations ###\n"prompt += "To complete the requirements of user\'s instruction, you have performed a series of operations. These operations are as follow:\n"for i in range(len(summary_history)):operation = summary_history[i].split(" to ")[0].strip()prompt += f"Step-{i+1}: [Operation thought: " + operation + "; Operation action: " + action_history[i] + "]\n"prompt += "\n"prompt += "### Progress thinking ###\n"prompt += "After completing the history operations, you have the following thoughts about the progress of user\'s instruction completion:\n"prompt += "Completed contents:\n" + completed_content + "\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to update the \"Completed contents\". Completed contents is a general summary of the current contents that have been completed based on the ### History operations ###.\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Completed contents ###\nUpdated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### History operations ###."else:prompt += "### Current operation ###\n"prompt += "To complete the requirements of user\'s instruction, you have performed an operation. Your operation thought and action of this operation are as follows:\n"prompt += f"Operation thought: {thought_history[-1]}\n"operation = summary_history[-1].split(" to ")[0].strip()prompt += f"Operation action: {operation}\n\n"prompt += "### Response requirements ###\n"prompt += "Now you need to combine all of the above to generate the \"Completed contents\".\n"prompt += "Completed contents is a general summary of the current contents that have been completed. You need to first focus on the requirements of user\'s instruction, and then summarize the contents that have been completed.\n\n"prompt += "### Output format ###\n"prompt += "Your output format is:\n"prompt += "### Completed contents ###\nGenerated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### Current operation ###.\n"prompt += "(Please use English to output)"return prompt

对于特定的任务，可以在addinfo中添加对应的说明书信息，以起到强调的作用,如下：
斗地主

3.针对特定应用适配Prompt

改写为更特定化的Prompt,如下代码：

def get_action_prompt(instruction, clickable_infos, width, height, keyboard, summary_history, action_history, last_summary, last_action, add_info, error_flag, completed_content, memory):prompt = "### Background ###\n"prompt += f"This image is a phone screenshot. Its width is {width} pixels and its height is {height} pixels. The user\'s instruction is: {instruction}.\n\n"

4.其他方法

调用工具扩展，如RAG，更适用于UI检测的工具
更好的解决广告与弹窗；

四、PC-Agent框架解读

项目地址：PC_Agent

1. 介绍信息

简介
PC-Agent 是一个多智能体协作系统，能够基于用户指令自动控制计算机软件（如 Chrome、Word、微信等）。其为高分辨率屏幕设计的视觉感知模块更适合PC平台使用。通过“规划-决策-反思”的框架，PC-Agent 提高了操作的成功率。

🔧 快速开始

安装
PC-Agent 支持 MacOS 和 Windows 系统。

MacOS 安装：

pip install -r requirements.txt

Windows 安装：

pip install -r requirements_win.txt

在你的电脑上测试
运行 run.py，并提供你的指令和 GPT-4o API 令牌。例如：

python run.py --instruction="Create a new doc on Word, write a brief introduction of Alibaba, and save the document." --api_token='Your GPT-4o API token.'

你可以通过 --add_info 选项添加特定的操作知识，以帮助 PC-Agent 更加准确地执行操作。

为了进一步提高 PC-Agent 的操作效率，你可以设置 --disable_reflection 来跳过反思过程，但请注意，这可能会降低操作的成功率。

2. 框架介绍

PC-Agent 框架示意图

PC-Agent 框架是在 Mobile-Agent v2 框架的基础上发展而来，为了适配PC平台的特性进行了多项改进。以下是各个模块的详细解读：

① 调试工具的适配

在 Mobile-Agent v2 中，使用 Android Debug Bridge (ADB) 来控制和调试移动设备。然而，在 PC-Agent 中，由于操作对象从移动设备转移到了PC端，调试工具需要进行相应的适配。因此，PC-Agent 将 ADB 替换为 Python 库 Pyautogui 和 Pyperclip。

Pyautogui：用于模拟鼠标和键盘操作，包括点击、拖动、键盘输入等，可以实现对大多数PC软件的基本控制。
Pyperclip：用于处理剪贴板内容，支持文本的复制与粘贴操作。

通过这种适配，PC-Agent 能够在PC环境下顺利完成类似于移动设备上的自动化操作。

② 视觉感知模块的适配

PC-Agent 的视觉感知模块针对高分辨率的PC屏幕进行了优化。相比移动设备，PC屏幕的分辨率更高、细节更丰富，因此需要更全面的视觉感知能力。

图像分割：首先对屏幕截图进行切割，以便聚焦于特定区域的内容。
OCR（Optical Character Recognition）：用于识别屏幕上的文本内容。PC-Agent 依赖 OCR 技术来解析屏幕中的信息，如识别文档中的文本或按钮标签。
SAM（Segment Anything Model）：用于图像的语义分割，帮助系统理解屏幕上不同元素的含义和位置，从而提高交互的准确性。

通过这些工具的结合，PC-Agent 能够在复杂的PC环境中精确地识别和操作屏幕上的各类元素。