MM-Navigator demonstrates that large multimodal models excel in zero-shot GUI navigation through advanced screen interpretation and precise action localization capabilities.
The system exhibited a 91% accuracy rate in generating reasonable action descriptions and a 75% accuracy rate in executing correct actions for single-step instructions.
Our findings highlight the significant improvements of MM-Navigator over previous GUI navigators, establishing a foundation for future research into the GUI navigation task.
This work underscores the potential of GPT-4V to effectively interpret smartphone GUIs and fulfill user instructions accurately, improving user interactions with devices.
Collection
[
|
...
]