AI startup OpenAGI has unveiled ‘Lux’, a computer-use agent claiming to solve the reliability issues plaguing current AI. OpenAGI asserts an 83.6% success rate on the Online-Mind2Web benchmark, a score that would leapfrog flagship models from OpenAI and Anthropic by over 20 percentage points.
Unlike traditional Large Language Models (LLMs) trained on static text, the Lux foundation model utilizes ‘Agentic Active Pre-training,’ learning directly from screenshots and action sequences. By processing visual data, the model reportedly controls native desktop applications like Excel and Slack at one-tenth the inference cost of frontier competitors.
Addressing the ‘illusion of progress’ cited by researchers regarding web agents, the startup has also announced a partnership with Intel to optimize Lux for local execution on edge devices.
Promo
Shattering the ‘Illusion of Progress’
While the industry has been flooded with demonstrations of autonomous agents, independent research suggests a significant gap between marketing claims and operational reality. A recent study by researchers at Ohio State University and UC Berkeley exposed how many agents were overfitting to static, cached datasets rather than navigating the chaotic, dynamic nature of the live web.
Addressing the disparity between controlled demos and real-world failure rates, researchers noted the premature celebration surrounding current agent capabilities.
Huan Sun, a researcher with the OSU NLP Group, stated “It seemed that highly capable and practical agents were maybe indeed just months away. However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents.”
To rigorously test these systems, the research team developed a new evaluation framework designed to break agents that rely on memorization. This suite forces models to interact with live websites where elements shift, pop-ups appear, and workflows change unpredictably.
The methodology centers on ‘Online-Mind2Web’, a newly introduced benchmark designed to simulate the breadth of the modern internet. Comprising 300 distinct tasks across 136 real-world websites, the dataset moves beyond static pages to test agents in live environments.
Following a manual evaluation of five frontier models, the results painted a concerning picture of current capabilities: with the notable exception of OpenAI’s Operator, most recent agents failed to outperform ‘SeeAct’, a rudimentary model released back in January 2024.
Under these harsher conditions, the performance of established market leaders dropped precipitously. OpenAI’s Operator, which debuted with significant fanfare in January, managed a success rate of 61.3%. Anthropic’s offering, widely covered following the release of Anthropic’s Computer Use feature, scored 61.0%.
Lux’s reported score of 83.6% represents a generational leap over these incumbents, suggesting that its underlying architecture handles the noise of the open web more effectively than models adapted from standard LLMs. Even simple tasks like booking flights or filtering e-commerce results have historically tripped up “highly capable” agents, a trend OpenAGI aims to reverse.
Scientific rigor remains the primary hurdle for validating these claims. Self-reported benchmarks often diverge from independent reproduction, particularly when the evaluation environment involves variable network conditions and live site updates.
Sun further warned “As a scientific field, we must caution against over-optimism, especially when the supporting data may be insufficient or biased.”
Architectural Shift: Actions Over Text
Driving this performance leap is a fundamental rethinking of how models learn to interact with interfaces. Most current agents are essentially text-prediction engines forced to interpret visual user interfaces (UIs) as code or accessibility trees. OpenAGI argues that this translation layer introduces latency and error.
Explaining the divergence from standard training methodologies, the company highlighted the limitations of corpus-based learning.
Zengyi Qin, CEO of OpenAGI, explained “Traditional LLM training feeds a large amount of text corpus into the model. The model learns to produce text. By contrast, our model learns to produce actions.”
This “Agentic Active Pre-training” creates a self-reinforcing feedback loop. Instead of passively ingesting data, the model interacts with environments during its training phase, learning the consequences of clicks, scrolls, and keystrokes in real-time.
Describing how the system improves itself through usage, the CEO noted the compounding value of autonomous interaction. “The action allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed back to the model for training,” Qin told VentureBeat.
The practical application of this architecture is segmented into three distinct operational modes, each optimized for different types of enterprise workflows. This segmentation allows developers to balance speed against reasoning depth depending on the complexity of the task at hand.
According to the official product documentation, the modes are defined as follows:
Tasker: Strictly follows step-by-step instructions, with ultra-stable, controllable execution.
Actor: Ideal for immediate tasks, completing actions at near-instant speed.
Thinker: Understands vague, complex goals, performing hour-long executions.
Scope of control is another critical differentiator. While early iterations of Microsoft’s Researcher agent and Google’s Gemini 2.5 Computer Use focused heavily on browser-based workflows, Lux is designed to operate native desktop applications.
This capability extends to complex software suites like Adobe Creative Cloud and Microsoft Excel, where proprietary interfaces often confuse standard web agents.
The Edge Frontier & Market Skepticism
Beyond raw performance metrics, the startup is betting on a hybrid deployment model to court enterprise clients wary of cloud costs and data privacy. By optimizing for edge devices, OpenAGI aims to move inference from centralized servers to local hardware, reducing the latency penalties that make remote desktop agents feel sluggish.
Validating this hardware-centric approach, the company confirmed a strategic collaboration to ensure the model runs efficiently on consumer-grade silicon. Qin confirmed “We are partnering with Intel to optimize our model on edge devices, which will make it the best on-device computer-use model.”
Cost efficiency is central to this strategy. OpenAGI claims Lux operates at one-tenth the inference cost of frontier models for equivalent tasks. This reduction is critical for “Agentic AI” workflows, which often require hundreds of inference steps to complete a single objective, such as researching a market segment or reconciling a spreadsheet.
Safety mechanisms have also been prioritized to address the risks of autonomous execution. Previous incidents, such as when ChatGPT Agent was observed bypassing security CAPTCHAs, have highlighted the potential for agents to act unpredictably. Lux reportedly includes internal reasoning steps that force the model to pause and refuse sensitive requests—such as copying bank details—rather than blindly executing the user’s prompt.
Despite the impressive specifications, skepticism remains vital. The 83.6% success rate is currently a self-reported metric found in press materials and has not yet been independently verified on the public Online-Mind2Web leaderboard. Until third-party developers can reproduce these results using the Lux SDK, the claim of “crushing” OpenAI and Anthropic stands as a bold, yet unproven, assertion.

