Oppo’s Multi-X Team has published X-OmniClaw, an open-source Android AI agent framework that keeps core logic on-device while calling cloud-based language models only for heavy reasoning tasks. Unlike most mobile AI systems that run on cloud servers hosting virtual Android copies, X-OmniClaw executes directly on the user’s physical device, maintaining access to the phone’s camera, photos, and local files.
X-OmniClaw operates through three interconnected components that work as one continuous loop, according to Oppo’s technical documentation.
Omni Perception combines camera feeds, screen content, and voice input into a single pipeline. A vision-language model interprets the scene before the agent takes action. For example, if a user points their camera at a product and asks for its price, the agent first identifies what it’s viewing, then opens the relevant shopping app and begins searching without requiring manual input.
Omni Memory distinguishes X-OmniClaw from one-shot chatbots by maintaining context across tasks, app switches, and sessions. The agent builds long-term semantic memory from the user’s photo gallery, converting raw images into structured notes about objects, scenes, and events. According to the report, “runtime continuity is what lets X-OmniClaw operate as an ongoing device agent rather than a one-shot response system.”
Omni Action handles execution by combining XML interface data with on-device visual models and optical character recognition (OCR) to determine exactly what to tap, even on cluttered screens. The framework includes a behavior cloning feature that allows users to record a navigation path once, then replay it instantly via Android deeplink shortcuts in future sessions, bypassing multi-step app navigation.
Oppo demonstrated several practical applications of X-OmniClaw:
Product identification and pricing: The agent identifies a physical product via camera, opens Taobao, scrolls through results, and returns a price summary without requiring any typing.
Educational assistance: A floating on-screen companion helps users work through math exercises step by step, autonomously reading screen content, processing each question, and advancing when complete.
Video creation from gallery: When asked to assemble a highlight video from parrot-themed photos, the system scans the gallery using semantic memory to find matching images, opens CapCut’s video editor via deeplink, batch-selects files, and generates the video. The report indicates this process, which previously required “a few minutes or longer,” is reduced to a handful of automated steps.
X-OmniClaw extends an architecture pioneered by OpenClaw, an open-source agent framework that reached over 373,000 GitHub stars and was eventually backed by OpenAI. Hermes Agent by Nous Research advanced the concept further with a self-improving learning loop that compounds capabilities over time. Both projects ran primarily on desktop hardware. X-OmniClaw adapts this architecture for smartphones by building on the open-source HermesApp codebase and incorporating OpenClaw’s structured skill model as foundational inspiration, then customizing it for the multimodal, always-on nature of mobile devices.
The code is available on GitHub, with Oppo committing to release all assets and continue updating the project as the system evolves.
Related News
3 Altcoins to Buy for High Returns: Market Picks Eyeing 3x Gains in the Short Term
Hermes Agent v0.14.0 released. Subscription users can call major services without an API key
ChatGPT’s personal finance feature has launched in the US, allowing you to view your personal bank accounts
X releases the original “For You” recommendation algorithm code: a practical guide to running Twitter accounts with algorithms
TON’s agentic wallets turn Telegram bots into spending entities