別擔心,你還不會失業——AppAgent 簡單試用

恒温發表於2023-12-24

每一次有自動化的新工具問世,就有一堆人會說啊呀呀呀,測試要失業了。幾天前 AppAgent 出來時,嗅覺靈敏的自媒體就開始搬運,然後劍鋒直指測試工程師,於是咱們又失業了一次。因為團隊內部對應用自動化測試也有訴求,所以第一時間就在自己電腦上跑起來看看。

安裝步驟

安裝很簡單,我用的是 Windows 11 64bit,android 環境已經裝好(其實只要裝了 adb 就可以了),python 環境也安裝好了(我的 python 環境用的是 conda,大家可以自行百度)。然後把程式碼下載下來,pip install -r requirements.txt 安裝好依賴就可以用了。

英語好的,直接看 https://github.com/mnotgod96/AppAgent

執行前配置

其實就是因為他用了 openAI 的 gpt-4-vision-preview 模型,所以咱們必須得有 openAI 的收費賬戶,然後拿到對應的 OPENAI_API_KEY。對應 AppAgent 的配置檔案 config.yaml

...
OPENAI_API_BASE: "https://api.openai.com/v1/chat/completions"
OPENAI_API_KEY: "sk-xxxx"  # Set the value to sk-xxx if you host the openai interface for open llm model
OPENAI_API_MODEL: "gpt-4-vision-preview"  # The only OpenAI model by now that accepts visual input
...

這些引數會在 model.py 裡呼叫,

ask_gpt4v 方法:這個方法是和 openAI 互動的方法

def ask_gpt4v(content):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {configs['OPENAI_API_KEY']}"
    }
    payload = {
        "model": configs["OPENAI_API_MODEL"],
        "messages": [
            {
                "role": "system",
                "content": content
            }
        ],
        "temperature": configs["TEMPERATURE"],
        "max_tokens": configs["MAX_TOKENS"]
    }
    response = requests.post(configs["OPENAI_API_BASE"], headers=headers, json=payload)
    print_with_color("resp: ", response)
    if "error" not in response.json():
        usage = response.json()["usage"]
        prompt_tokens = usage["prompt_tokens"]
        completion_tokens = usage["completion_tokens"]
        print_with_color(f"Request cost is "
                         f"${'{0:.2f}'.format(prompt_tokens / 1000 * 0.01 + completion_tokens / 1000 * 0.03)}",
                         "yellow")
    return response.json()

從 openAI 回來的資料會在 parse_explore_rsp 裡進行解析,我感覺這個方法是最重要的,它利用 openAI 的 Thought/Action/Action Input/Observation 機制,對結構化的返回進行解析。很多這種 agent 其實都是基於這個機制,openai 的這塊做的比較好,每次都能按照這個模式來給你返回,所以目前來說外掛體系啥的也只有 openai 的搞起來了(From 挺神)。這裡也挺有意思的,本來我想 openAI 太貴,AppAgent 呼叫一次,0.02 刀的樣子,想換成阿里雲的通義千問,翻了一遍文件,似乎沒有 Thought/Action/Action Input/Observation 機制,這個我不專業,有懂的同學可以指正下。

所以這裡話又說回來了,你還得花這個 openAI 的錢,否則你得大改 APPAgent 的程式碼。

執行

執行很簡單,按官方文件,先 learn 再 run。我這裡拿 CSDN 做例子,先在手機上把 CSDN 開啟,然後執行 python .\learn.py

這裡我選 human demonstration,autonomous exploration 沒時間跑。在終端輸入 2,回車,就會進入下一步:

What is the name of the target app?

CSDN
Warning! No module named 'sounddevice'
Warning! No module named 'matplotlib'
Warning! No module named 'keras'
List of devices attached:
['42954ffb']

Device selected: 42954ffb

Screen resolution of 42954ffb: 1440x3216

這裡會透過 adb 命令,把裝置資訊拿回來。APPAgent 裡自己封裝了 adb 命令,比如點選就是用的 adb shell input tap 座標,比較原始(我一開始以為會封裝個啥 Appium 之類的),在檔案 and_controller.py 裡。這些資訊列印好之後,會立刻讓你輸入你後面動作的描述。這裡我就寫 “search for testerhome”,然後回車,就會彈出一個介面來。

Please state the goal of your following demo actions clearly, e.g. send a message to John

search for testerhome
(然後回車,就會彈出一個介面來,看英語說的,紅色的是可以點選的,藍色的是可以滾動的,看下面這個圖。)All interactive elements on the screen are labeled with red and blue numeric tags. Elements labeled with red tags are clickable elements; elements labeled with blue tags are scrollable elements.

我們滑鼠聚焦到圖片之後,按回車,圖片就會消失,接著提示我們就可以根據可以點選的地方,來操作,比如這裡搜尋的按鈕是 25,那我就需要點選 25 這個元素。

Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop

tap
Which element do you want to tap? Choose a numeric tag from 1 to 83:

25

這個時候,點選就成功了,會再把點選搜尋按鈕之後的介面截圖出來,

接下來都是一樣的操作,總共 5 個步驟。

Which element do you want to tap? Choose a numeric tag from 1 to 14:

3
Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop

text
Which element do you want to input the text string? Choose a numeric tag from 1 to 14:

3
Enter your input text below:

testerhome
Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop

tap
Which element do you want to tap? Choose a numeric tag from 1 to 15:

4
Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop

stop
Demonstration phase completed. 5 steps were recorded.

然後就是 chatGPT 開始工作了,

Warning! No module named 'sounddevice'
Warning! No module named 'matplotlib'
Warning! No module named 'keras'
Starting to generate documentations for the app CSDN based on the demo demo_CSDN_2023-12-24_20-46-47

Waiting for GPT-4V to generate documentation for the element net.csdn.csdnplus.id_ll_order_tag_net.csdn.csdnplus.id_iv_home_bar_search_1

resp:

Request cost is $0.00

Documentation generated and saved to ./apps\CSDN\demo_docs\net.csdn.csdnplus.id_ll_order_tag_net.csdn.csdnplus.id_iv_home_bar_search_1.txt

Waiting for GPT-4V to generate documentation for the element android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1

resp:

Request cost is $0.00

Documentation generated and saved to ./apps\CSDN\demo_docs\android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1.txt

Waiting for GPT-4V to generate documentation for the element android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1

resp:

Request cost is $0.00

Documentation generated and saved to ./apps\CSDN\demo_docs\android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1.txt

Waiting for GPT-4V to generate documentation for the element android.widget.LinearLayout_1440_176_net.csdn.csdnplus.id_tv_search_search_2

resp:

Request cost is $0.00

Documentation generated and saved to ./apps\CSDN\demo_docs\android.widget.LinearLayout_1440_176_net.csdn.csdnplus.id_tv_search_search_2.txt

Documentation generation phase completed. 4 docs generated.

最後生成的樣子是這樣的:

其中 task_desc 就是我們前面的 search for testerhome,record 是每一步的命令的合併,然後有打標籤的截圖等等。

到這裡,我們的學習就完成了,下面就要執行了, python run.py

Warning! No module named 'sounddevice'
Warning! No module named 'matplotlib'
Warning! No module named 'keras'
Welcome to the deployment phase of AppAgent!
Before giving me the task, you should first tell me the name of the app you want me to operate and what documentation base you want me to use. I will try my best to complete the task without your intervention. First, please enter the main interface of the app on your phone and provide the following information.

What is the name of the target app?

CSDN
Warning! No module named 'sounddevice'
Warning! No module named 'matplotlib'
Warning! No module named 'keras'
Documentations generated from human demonstration were found for the app CSDN. The doc base is selected automatically.

List of devices attached:
['42954ffb']

Device selected: 42954ffb

Screen resolution of 42954ffb: 1440x3216

Please enter the description of the task you want me to complete in a few sentences:

search for testerhome
Round 1

Documentations retrieved for the current interface:
Documentation of UI element labeled with the numeric tag '25':
This UI element is clickable. To assist you effectively, I'll need to see the screenshots of the mobile app you're referring to. Please provide the images showing the UI element before and after it has been tapped.



Thinking about what to do in the next step...

resp:

Request cost is $0.01

Observation:

The image shows a smartphone screen with a Google search bar at the top. There are various app icons on the screen, and the UI element labeled '25' appears to be the Google search bar.

Thought:

To complete the given task, I should tap on the Google search bar to bring up the keyboard and enter the search term "testerhome."

Action:

tap(25)

Summary:

I have tapped on the Google search bar to initiate a search for "testerhome."

Round 2

Documentations retrieved for the current interface:
Documentation of UI element labeled with the numeric tag '3':
This UI element is clickable. I'm sorry, but I can't provide assistance without the images being displayed. If you can provide the screenshots of the mobile app before and after tapping the UI element, I would be able to help you describe its functionality.

This UI element can receive text input. The text input is used for the following purposes: I'm sorry, but I can't provide assistance without the images being displayed. If you can provide the screenshots you mentioned, I'd be happy to help describe the functionality of the UI element labeled with the number 3.

Documentation of UI element labeled with the numeric tag '4':
This UI element is clickable. I'm sorry, but I can't provide assistance without the images being displayed. If you can provide the screenshots you mentioned, I'd be happy to help you with the description of the UI element.



Thinking about what to do in the next step...

resp:

Request cost is $0.01

Observation:

I'm sorry, but I cannot provide observations without an image being displayed. If you can provide the screenshot you mentioned, I'd be happy to assist you further.

Thought:

Since I cannot see the image, I am unable to determine the next step to complete the given task.

Action:

FINISH

Summary:

I have tapped on the Google search bar to initiate a search for "testerhome," but without the image, I cannot proceed further with the task.

Task completed successfully

這個過程,其實就是拿著前面 learn 的時候,記錄的這些資訊,去組成 prompt 模板,再去呼叫 chatGPT。程式碼是下圖,裡面的 image_url,就是打標籤的圖片。把某一步的操作和對應的圖片提交給 GPT

我前面執行 run.py 裡面第一步就成功的把圖片和 tap 的操作給傳給 chatGPT 了,GPT 說 tap(25) 。但是大家再往下看的時候,就發現 GPT 開始胡說八道了,所以很遺憾,我 learn 時候的操作,並沒有在 run 的時候重放出來。

總結

至此,基本把 APPAgent 跑了一遍了,我和群友說,demo 很性感,現實很骨感,顯然 chatGPT 對 CSDN 不夠了解。在我看來,現階段的 APPAgent 只不過是一個客戶端錄製回放的,而且非常簡陋的工具。但是思路非常不錯,我自己組裡準備著手改造,看看能不能真正用起來。

相關文章