Ozone Labs
Ozone Labs

A place to share emerging technologies and AI experiments with our publishing and marketing partners.

Register your interest for a Labs Live hackathon.

Research

LLM Arena: Who answers best?

Ozone Labs has built a simulation platform that helps publishers understand how their content performs inside AI-powered answer engines. This gives us a real-world read on which models handle current affairs best.

This first experiment shows how some different models perform on topical news questions. We used humans and AI judges in a head-to-head “which answer is better” format, and rated the models using ELO, a zero-sum-game scoring mechanism popular in the chess community:

Three AI judges independently scored 2,052 blind pairwise comparisons across 19 current-affairs questions. ELO ratings computed via Bradley-Terry maximum-likelihood estimation. Higher ELO = better performance.

ELO ratings by model: Claude Sonnet 4 at 1774, GPT-4o at 1538, Gemini 2.5 Flash at 1188

Note on Gemini 2.5 Flash results: Gemini calculates token usage differently from Anthropic and OpenAI models, which may have affected how prompts were truncated or processed. We're reviewing whether this created an unfair comparison. Treat the Gemini scores as provisional until we can confirm parity in input handling.

Methodology

We ran a blind side-by-side evaluation of three popular large language models. Each model answered 19 current-affairs questions generated from topical trend information obtained from social media, search, and other trend data. Every pairwise combination (36 pairs per question) was independently judged by three frontier models with web search grounding: Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro. Each judge evaluated pairs on accuracy, completeness, clarity, and helpfulness, casting a vote for the stronger response or declaring a tie. This produced 2,052 pairwise judgements in total.

Performance ratings were computed using ELO, a maximum-likelihood method for estimating relative strengths from pairwise data. To mitigate provider self-serving bias, we deliberately chose judge models from all three providers. No systematic bias was observed. Each judge ranked its own provider's model honestly relative to the competition.

ELO is a rating system for calculating relative skill levels in zero-sum competitions. Scores computed via Bradley-Terry maximum-likelihood estimation.

Labs Live

Where impossible ideas get built

As a business originally built from great collaboration, we truly believe that the best ideas come alive when we work together with our brand, agency and publisher partners. By introducing our engineers and data scientists to the mix, we join up to solve tricky problems – from first concepts through to finished products.

Pitch ideas

Anyone can throw an idea on the table — a hunch, a frustration, a “what if.”

Form teams

Self-organise around the ideas that excite you. Engineers, commercial, data — all welcome.

Build fast

Focused sprint, usually 1–2 days. Working prototypes, not presentations.

Ship

Proven experiments graduate into products. What doesn't ship still teaches us something.

Projects

Concepts, prototypes, open source projects and stuff

While not every great idea makes it through our product development process, the exploration of that idea always uncovers lots of learnings that could be used elsewhere. In our Projects, we will show that thinking and make it available to anyone who might be able to spark another idea from what we've discovered.

Experimenting with AdCP

Last updated: 2026-03-30

AI-powered ad planning that connects Claude directly to Ozone's audience intelligence. Paste a brief, get a media plan back with real data.

ADVERTISINGPLANNINGACTIVATION

Topics

Last updated: 2026-03-30

An exploration into how certain topics are categorised across different publisher titles, and how often companies, people and locations are referenced.

Coming sooncontent

Stories

Last updated: 2026-03-24

Stories is an exploration into how similar articles across publisher titles organise together into related sets.

coming soonContent