RuneBench

Agent Benchmark on RuneScape Gameplay Tasks

Runebench evaluates AI coding agents on their ability to play RuneScape via a typescript sdk and must accomplish tasks in the game world. Measuring agent's behavior in an "orient, decide, act" loop provides interesting insights into their multi-step planning and problem solving capabilities in coding agents.

How Agents See the World

TypeScript SDK

Agents play the game by writing and executing TypeScript snippets against an emulated game server. The SDK provides access to reading game state and performing actions.

Game Knowledge

Each agent is given a folder of markdown files extracted from the game wiki — skill guides, item stats, NPC locations, and quest walkthroughs. They can search the files to inform their strategy.

Task Design

Scoring: XP Rates as the goal

The originally task was to gain as much XP as possible for a skill within a fixed time window, but we found this approach punished exploration - The winning strategies were often a simple grind with as little stopping as possible. Because we wanted to reward interesting strategies and exploration, we landed on measuring max XP rate per 15 second window. By focusing on XP rate, we reward agents that discover higher-level strategies, beyond pure time-on-task. It was great seeing winning runs use many locations, tricks, and methods as they level up - models are incredible optimizers.

Limitations

The biggest remaining issue with this task design is the long runtime. I experimented with shorter task duration, but it punished the models' tendancy to front-load planning (spending many minutes reading docs), as well as the slow inference rates for some models. The complexity of the environment and low sample size also lead to noise and false negatives that hurt the numberical accuracy of the comparison between models

There is room to design interesting micro-tasks with shorter runtimes to make the benchmark easier to run and iterate on. The complexity of the environment and low sample counts likely contributed significant noise to the results.

Harness Development

Growing an API

The rs-sdk typescript library was developed through cycles of automated error analysis — I would run a batch of agents on a task, categorize failures and missing features, and use those to inform harness improvements. This was a really interesting way to "grow" an API layer between the agents and the game server.

Future Work

Multi-Agent Collaboration

Can two bots outperform a single agent? For example, one agent gathering raw materials while another processes them — splitting the supply chain to achieve a higher combined XP rate. It would be fascinating to test coordination, communication around an optimization task. This might take the form of one LLM agent scripting multiple players characters, or seperate LLMs communicating through in-game chat.

Knowledge Transfer

How well can an agent write a guide for another agent? Given one agent's experience training a skill, how well can it produce instructions that meaningfully improve a second agent's performance?

Contribution

We would love to see more people experimenting with RS-SDK! Join the Discord to get involved, there are many people doing interesting small scale experiments with harnesses and techniques.


Thank you

Thanks to Rob Haisfield, Sean Lee, Christopher Settles, Alex Duffy, and Erik Quintanilla for critical feedback and input, and to LostCity and Harbor RL communities for providing critical open source ecosystem.