CEO-Bench

Can agents play the long game?

Haozhe Chen, Karthik Narasimhan, Zhuang Liu Princeton University
  • Today, agents execute individual tasks. Tomorrow, agents steer organizations toward long-term goals.
  • We introduce CEO-Bench to measure this steering intelligence. In CEO-Bench, agents operate a simulated AI startup for 500 days.
We measure cash balance as performance metric. This plot shows cash balance over time for best run of each model.

A Story

Cupertino, 1997.

Apple was ninety days from bankruptcy. Inside a conference room at headquarters, the company's leaders faced the possibility that Apple might not survive.

Steve Jobs walked to the whiteboard and drew a simple grid: Consumer and Professional, Desktop and Portable. Four boxes to hold the whole company. He made the decision: Apple would build only for those four boxes.

It was a painful cut. Products disappeared. Teams were broken apart. But the decision gave Apple something it had lost: focus. The iMac came next. Then the iPod. Then the iPhone. A company near collapse became one of the most valuable companies in the world.

Comic illustration of Apple's 1997 pivot from crisis to focused product strategy

The Next Frontier: Steering Intelligence

Steve Jobs showed a kind of strategic intelligence that has appeared throughout history, driving some of humanity's most monumental achievements.

This kind of intelligence is fundamentally different from intelligence in AI agents today. Today, we build agents that get rapidly better at performing individual tasks like coding and writing. To contribute more value, agents tomorrow need to steer organizations toward long-term goals.

We build CEO-Bench as a first step of measuring Steering Intelligence.

Today, we measure AI agent's intelligence to perform isolated tasks. The next frontier is measuring intelligence to steer systems across long horizon towards distant goals.

In CEO-Bench, we aim to measure the combination of four core skills to steer systems through real-world challenges:

  1. Navigating long horizons amid uncertainty
  2. Acquiring information in noisy environments
  3. Adapting to a changing world
  4. Orchestrating multiple moving parts toward a coherent goal

We evaluate on a canonical real-world task: operating a simulated startup for 500 days.

We give agents $1M starting cash and measure cash balance at the end of simulation as performance metric. The agent operates through a programmable interface with access to business databases, company management tools, and social media. Outcomes are driven by a partially observable, noisy, and evolving market with delayed and coupled consequences.

How does CEO-Bench work?

How an agent makes money. The agent makes profits through customer subscription payments or in-product ad monetization. There are 26 customer segments, with customers in each segment sharing similar price and quality preferences. Six customer segments are initially visible, and the agent must pay for market research to discover the rest. We abstract the company product that customers subscribe to as a numerical quality measure. Each customer subscribes to a subscription plan if the quality for that plan is higher than the customer's minimum accepted quality at the plan's price. For enterprise customers, agents and customers negotiate the price in multiple turns.

Improving product quality. The agent has a wide range of options to control product quality. The most direct way is to spend on daily product development or on larger improvements through research projects. The agent can also target specific customer segments for cheaper product improvement. Other factors such as model tiers, customer support spending, in-product ad monetization, and usage quotas can also affect product quality differently across customer segments. Throughout the simulation, competitors intermittently raise customers' quality expectations, forcing agents to continually spend on product quality improvement.

Acquiring customers. Agents acquire new customers by spending on marketing channels. Different marketing channels have different acquisition rates across customer segments. Reputation in each customer segment also affects acquisition speed. Each customer's satisfaction changes its segment's reputation, and reputation propagates through different groups at different rates. An agent can indirectly monitor reputation through a simulated social media platform. An agent can also post or reply on social media, and customers' simulated reactions to social media affect the acquisition rate.

What can an agent do?

Each day, the agent can take actions for unlimited turns across 34 tools in the categories displayed in the table below. Each tool accepts fine-grained structured arguments, so agents can compose a combinatorially large space of possible actions.

Category Actions Example tools
Database query Query 19 business SQL databases and conduct data analytics query
Pricing and monetization Set prices, usage quotas, discounts, and in-product ads pricing.set_prices, pricing.set_usage_quotas
Growth and market expansion Allocate targeted advertising spend and promotion across channels and customer groups marketing.set_targeted_ad_spend, marketing.set_lead_promotion
Product quality and R&D Choose model tiers, fund day-to-day development, and launch research projects pricing.set_model_tiers, research.start_research_project
Operations and reliability Buy infrastructure capacity and fund customer support infrastructure.set_capacity_tier, analytics.set_targeted_ops_spend
Enterprise sales Conduct multi-turn negotiations over price and plan with enterprise prospects and renewals enterprise.send_enterprise_deal, enterprise.reject_enterprise_deal
Information acquisition Pay for market research to discover new customer groups and learn more about existing groups market.research_market, market.research_group
Public communication Monitor social media for customer complaints, competitor news, and economic trends, then post or reply to influence growth marketing.post_social_media, analytics.get_social_posts
How does an agent interact with the simulator?

We design the interface as a programmable operating surface, so agents can effectively manage granular action spaces and organize them into custom workflows.

Composable action interface in Python. We make evaluating CEO-Bench easy with terminal-based computer-use agents by exposing the action surface through a Python package, novamind_api. An agent manages the company by calling functions in novamind_api in a Python script and executing the script in its terminal. This design maximizes flexibility for an agent to build its own infrastructure on top of the API.

Granular action spaces. We allow agents to act at fine granularity to create a rich space of strategic tradeoffs, failure modes, and opportunities for adaptation. Although the interface contains a finite set of tools, each tool accepts fine-grained structured arguments, so agents can compose a combinatorially large space of possible actions.

Large-scale and realistic databases. We give the agent access to a 19-table operational database covering orders, contracts, subscriptions, the cash ledger, the social-media feed, configuration history, ad-channel attribution, and support tickets, among others. The schema mirrors what a real software company's analytics stack would expose.

Social media. The agent can read a simulated public feed of customer complaints, competitor announcements, and macroeconomic trends. Agents can also reply and post on social media. Reactions to the agent's posts on social media can influence the rate of new customer acquisition.

How do we make the simulator robust and challenging?

Maximize realism with granular simulation. The simulator models 26 customer segments and individual customers within each segment rather than only aggregate demand. Each customer has its own acquisition path, subscription state, price exposure, usage, satisfaction, and churn trajectory. Customers are also organized into diverse groups with different needs, budgets, price sensitivities, ad channel effectiveness, support expectations, and behavioral patterns.

Robust simulation with mechanistic rules. The world emulates real business behavior while maintaining stable cause-and-effect relationships. Almost all simulator outcomes are generated by explicit mechanisms rather than by using an LLM as an opaque judge.

Consistent simulation under stochasticity. While we inject stochasticity into world dynamics, we maintain consistency across runs with independent random number generators for different simulator components. Under the same random seed, after calling the market research tool multiple times, the agent always discovers the same sequence of new market segments, independent of actions in other areas.

Hidden information and indirect feedback. CEO-Bench tests whether agents can gather information in a partially observable world. The agent receives only information that a real operator could plausibly observe: dashboards, database records, social-media posts, research reports, and negotiation history. It does not observe true customer satisfaction, latent willingness to pay, churn propensity, competitor schedules, or demand parameters.

Interconnected world dynamics. We design the simulated world to make it difficult to isolate a single causal relationship and hill-climb on it. Every decision can influence many other parts of the market. Reputation propagates across related groups, so a quality failure in one enterprise segment can spill into nearby segments and eventually affect consumer demand.

Delayed and uncertain consequences. Many actions have delayed and uncertain effects, forcing long-horizon decision making under uncertainty. Costs may appear immediately, while corresponding revenue, retention, research, or reputation effects arrive weeks later.

Non-stationary environment. Agents must continually gather new information and adapt because the environment changes over the course of a simulation. Competitors place adaptive pressure on product quality, customer behavior drifts over time, and macroeconomic trends affect willingness to pay and enterprise seat counts.

How to run CEO-Bench

Check out our code repo. We offer three ways of running the benchmark:

  1. Simply paste a line to any coding agent capable of using terminal, and it can download the game and start playing.
  2. Replicate our experiment with a minimal bash script using the agent harness.
  3. Customize parameters of the simulator.

Most state-of-the-art models struggle to complete the simulation without bankruptcy. While four models (GPT-5.5, Claude Opus 4.7, Kimi K2.6, and Claude Sonnet 4.6) end with positive cash on their best run, only GPT-5.5 finishes above its $1M starting balance. This preliminary evaluation shows that GPT-5.5 demonstrates high-upside strategic behavior; Claude Opus 4.7 survives more conservatively; while most models fail to coordinate growth, quality, and cash flow.

Cash balance over time for all runs of each model.
Model Bankruptcy Max final cash Max survival days Mean survival days Turns/week Best run API cost
GPT-5.5 2/3 $21,297,707 500 333.7 ± 229.7 34.7 $200.49
Claude Opus 4.7 0/3 $389,959 500 500.0 ± 0.0 14.6 $128.72
Kimi K2.6 1/3 $98,050 500 343.0 ± 110.0 30.5 --
Claude Sonnet 4.6 2/3 $69,766 500 282.3 ± 136.0 13.3 $82.84
GLM 5.1 3/3 $0 324 214.7 ± 91.1 51.5 --
Claude Haiku 4.5 3/3 $0 231 144.7 ± 70.5 23.1 $6.68
Gemini 3 Flash 3/3 $0 226 154.0 ± 37.0 18.5 $2.98
DeepSeek V4 Pro 3/3 $0 176 114.3 ± 38.6 19.3 --
Grok 4.20 3/3 $0 37 28.3 ± 8.5 8.2 $0.75
Estimated max final cash -- $2,200,000,000 -- -- -- --
Additional details of benchmark results.

We released all agent trajectories at our interactive trajectory viewer.

We conducted preliminary analysis on agent trajectories. Below are some examples of our findings. Read more findings in our paper.

Example 1: GPT-5.5 explores larger strategy space; Claude Opus 4.7 limits itself to narrow strategy space. GPT-5.5 adapts frequently as conditions change, trying a range of strategies such as scaling acquisition, adjusting model tiers, modifying promotions, and reallocating support or development spend. In contrast, Claude Opus 4.7 tends to respond to setbacks by repeatedly cutting spend and preserving cash, which may help it survive until the final days but can also limit recovery and contribute to its weaker performance. This can be seen quantitatively from the fact that GPT-5.5 distributes actions more evenly across tools than Claude Opus 4.7.

Example memos written by GPT-5.5 (top) and Claude Opus 4.7 (bottom) in their workspaces during the best trajectory of each model. GPT-5.5 actively explores and adjusts across diverse strategies, while Claude Opus 4.7 largely confines its decisions to a single strategic direction.
Average per-week tool usage frequency for GPT-5.5 and Claude Opus 4.7 (top 10 tools per model). GPT-5.5 takes actions more frequently and distributes actions more evenly across tools.

Example 2: GPT-5.5 frequently anticipate future in its memo. We found that GPT-5.5 frequently plans for future contingency when making decisions. It tends to set possible future changes and conditions for changes when taking actions. It uses keyword "if" more frequently than other models.

Examples of planning in GPT-5.5's memos. The agent frequently anticipates scenarios and solutions with "if-then" contingencies.
Frequency of "if" in agent's memos.

Example 3: GPT-5.5 writes sophisticated code to reason about future cash and customer preferences. In its best trajectory, the GPT-5.5 agent wrote its own code files to probe the simulator and negotiation history: running simulations to forecast cash under different scenarios and inferring latent enterprise-customer price and quality preferences from noisy negotiation outcomes.

Example code files written by the GPT-5.5 agent during its best trajectory. (a) The agent runs its own simulation to forecast cash under different scenarios. (b) The agent infers latent enterprise-customer price and quality preferences by mining noisy negotiation outcomes.

CEO-Bench shows a gap between existing models' local tool competence and crucial sustained strategic skills: agents built on existing models can take plausible actions but fail when those actions must compound under delayed feedback, hidden state, and non-stationarity. To develop agents beyond isolated task executors, we need evaluations that ask whether they can organize evolving systems toward distant goals. CEO-Bench is one step toward that future: building agents and training models that do not merely answer requests, but help steer long-running organizations through uncertainty.

Read more in our paper.

Citation
@misc{ceobench2026,
  title = {CEO-Bench},
  author = {Chen, Haozhe and Narasimhan, Karthik and Liu, Zhuang},
  year = {2026},
  note = {Citation forthcoming}
}