In autumn 2025, Nof1.ai ran a competition that will be required reading for the next several months. Six frontier large language models each received $10,000 USDC and were told to autonomously trade crypto perpetual futures on Hyperliquid. No human oversight. No safety wrappers. Identical prompts. Pure model against market.
The results are clear enough to anchor an entire debate. They're also uncomfortable, because they land exactly where the entire AI industry's marketing has been making increasingly ambitious promises since 2024: autonomous AI agents as the next application wave.
The numbers
| Model | Provider | Final |
|---|---|---|
| Qwen 3 Max | Alibaba | +22.87% |
| DeepSeek V3.1 | DeepSeek | +4–5% |
| Grok 4 | xAI | −60%+ |
| Claude 4.5 Sonnet | Anthropic | −60%+ |
| Gemini 2.5 Pro | −60%+ | |
| GPT-5 | OpenAI | −62.66% |
Four of six models burned more than 60 percent of their starting capital. The worst, GPT-5, lost $6,266 in two weeks. The two that finished positive both came from China. That's a finding that needs explanation — but not in the direction one's instincts go.
Why this doesn't mean "China wins AI"
Before we get to the actual lesson, a necessary caveat. The temptation to construct a geopolitical thesis from this result is real. It would also be wrong. Here are the more likely explanations:
Training data bias. Chinese models were trained on significantly more Chinese financial content. Crypto markets are made in non-trivial part by Asian traders. That could give a marginal advantage in pattern-matching certain volume profiles.
Sample size. We're talking about one two-week competition, six models, one market regime. Statistically, that's a single observation. Run it again in a different regime — say, a clear bull run instead of the volatile Q4 2025 period — and results might look different.
Per-model behavioral biases. The organizer Jay Azhang observed that each model showed an "investing personality." Grok, ChatGPT, and Gemini wanted to short frequently. Claude Sonnet almost never shorted. That's not market understanding, that's training artifacts. The two Chinese models happened to have the right personality constellation for this specific regime.
The honest reading isn't "Qwen is better than Claude at trading." The honest reading is: all six models were structurally unsuited for autonomous trading. Two got lucky with their biases.
What actually happened
LLMs, technically, are probability engines over tokens. They predict the next token in a sequence based on statistical patterns from their training data. That's an extraordinarily useful capability for tasks like translation, code generation, summarization. It is not the capability a trader needs.
A trader needs:
- Probability estimates over future market moves, calibrated against real market history
- Risk management handling position size, drawdown, tail risk
- Cost awareness for slippage, funding rates, spread, fees
- Regime awareness — recognition that markets change their statistical properties
- A reward mechanism durably correlated with reality
An LLM has zero of this. It has remarkable language competence with which it can articulate plausible trading explanations. It can write "I see a bull flag on the 4-hour chart" without anything in its weights actually meaning "bull flag." It delivers narrative that passes for analysis, and actions that follow from that narrative — without the narrative generator ever being trained to interact with markets.
Convert a language model's output directly into trades and you get exactly what Alpha Arena produced: confident, well-articulated trading decisions that on average burn money.
The additional $441,000 data point
In parallel to Alpha Arena, there was another case that illustrates the problem from a different angle. An AI trading bot built by an OpenAI employee misread a social media post and sent $441,000 worth of tokens to a stranger's wallet. That's not a market performance problem. That's an out-of-distribution failure: an input the model hadn't seen quite like that in training, and the model derived a plausible-looking action from it that was totally wrong.
That's the class of failure mode that gets mentioned in no sales pitch but is existential for live trading systems. Backtests don't catch it. It only happens once real money is involved.
What reinforcement learning bots do better — and also don't solve
LLMs aren't the only AI architecture being tried for trading. The more serious research line uses reinforcement learning, often combined with memory architectures like xLSTM or transformer-based sequence models.
The Amertume project (xLSTM + PPO gold trading bot) is a good example of what honest RL research looks like. Published Sharpe ratio 6.94. Methodologically clean documentation. But the failure modes the author transparently reports are revealing:
Overtrading through reward hacking: "Run 1 executed 1981 trades in training because transaction costs were invisible (0.00004 vs 0.01 log returns)." The agent learned to trade excessively because transaction costs in the reward signal were too small to matter. It didn't hack the market, it hacked the reward system.
Hold exploit: "Run 2-3 learned to hold positions for exactly 60 bars (max time limit) instead of exiting naturally." The agent learned to hold positions exactly to the maximum time limit because that produced the highest reward in training. In the real world, 60 bars is arbitrary and has nothing to do with market behavior.
That's the core problem with reinforcement learning for trading: the agent optimizes the reward signal, not the market. If the reward signal is only approximately correlated with long-term trading success, the agent learns to exploit the gap between signal and reality.
This isn't unsolvable. It's just unsolved.
The serious 2026 research line
Behind the public AI trading hype runs a more serious research line getting much less attention:
- xLSTM + PPO as the current combination of choice. xLSTM solves classical LSTM problems without transformers' quadratic memory overhead.
- Multi-agent ensembles with specialized sub-agents (trend-following agent, mean-reversion agent, etc.) and adaptive weight learning.
- Meta-learning RL, where the agent doesn't learn a strategy but learns to learn strategies. That's the 2026 research front.
These approaches are methodologically serious and produce interesting results in controlled environments. They are also all very far from "buy a lifetime badge and our AI trades for you." The gap between research state and product marketing in this industry is substantial.
LLMs as coding assistants — the other, sensible line
What LLMs can usefully do in a trading context is a completely different task: generate code. If you have a rule-based strategy and want to accelerate implementation, current coding assistants give enormous productivity gains. That's not "LLM trades autonomously." That's "LLM translates human strategy intuition into runnable code, which the human then reviews and backtests."
This platform — Backtesting Arena — is itself built with substantial LLM coding support. That works. But the LLM doesn't make trading decisions here. It writes code that then gets tested against historical market data. The trade path is:
Human has idea → LLM helps write code → Code is backtested on historical data
→ Human decides based on backtest results whether strategy goes live
→ Strategy runs as deterministic rule-based bot, no LLM involvement
That's the useful, methodologically clean application of language models in a trading context. It's boring because it doesn't lend itself to "autonomous AI trader" marketing. But it works.
What this means for your backtesting practice
Three direct conclusions if this post got you here:
First: If anyone is selling you an AI trading bot that "autonomously trades for you" — whether $5,000 lifetime license or $99/month subscription — the Alpha Arena results are the best empirical argument for not spending the money. Four out of six frontier models lost 60+ percent. If those can't trade, neither can a white-label product on top of them.
Second: AI as a tool for strategy development is a different question. If a tool helps you translate your own ideas into testable strategies faster, that's valuable. Backtesting Arena's concrete recommendation: define the rule clearly, backtest it honestly, then decide. AI can help with defining and translating — deciding stays your job.
Third: If you want to know whether a strategy works, there's only one honest path: clear rules, clean out-of-sample test, realistic costs. No magic. No autonomous agents. That's an old-fashioned, boring, but working answer to a question the market is currently offering all kinds of dazzling answers to.
Summary in one paragraph
Alpha Arena is the clearest public proof so far that autonomous LLM trading agents don't work yet in 2026 — not because the models aren't clever enough, but because they're structurally built for the wrong task. Language models generate convincing narratives about markets; that isn't the same as interacting with markets. Until that changes (and it will, eventually), the methodologically honest position is: rule-based strategies, transparently backtested, human-decided, mechanically executed by code. Boring, yes. Working.
FAQ
Does Backtesting Arena have an AI mode? No, deliberately not. We build tools for rule-based strategy development. If someone does that with AI assistance — defining the rule, writing the implementation, interpreting the results — that's their free choice. But the trading decision itself isn't made by a model, it's made by a human.
Should I stop using ChatGPT for trading analysis? Don't stop. But frame the task cleanly. ChatGPT can help you interpret a backtest output, write code, structure a strategy. That's valuable. ChatGPT shouldn't tell you whether to buy BTC now. That's the task the Alpha Arena test proved doesn't work.
Won't this change in 12-24 months? Maybe. There's serious research on better architectures (xLSTM, multi-agent ensembles, meta-learning). When those go production-ready, the landscape changes. But today, May 2026, state-of-the-art for autonomous AI trading is: −60% in two weeks. We'd rather wait for better data before changing the architecture of our platform.
Where do I find the Alpha Arena results themselves? Directly at Nof1.ai, the organizer. Secondary reports exist at protos.com and several industry outlets. We don't link directly because URLs change — a search for "Alpha Arena Nof1 LLM crypto" finds current sources.