Large Language Model Evaluation

← Research Areas

The rapid proliferation of large language models (LLMs) has created an urgent need for rigorous, independent evaluation. Commercial benchmarks provided by model developers are inherently conflicted; academic benchmarks frequently fail to capture real-world task performance; and the pace of model releases has outstripped the research community's ability to assess them systematically.

This program conducts ongoing, structured evaluation of the frontier LLMs available via public API — with a particular focus on tasks relevant to knowledge work, professional services, and research assistance.

Models Under Evaluation

🟠

Claude

Anthropic · Opus 4, Sonnet 4

🔵

Gemini

Google DeepMind · 2.0, 2.5 Pro

🟢

GPT-4o

OpenAI · GPT-4o, o3

⚫

Grok

xAI · Grok 3

🔷

Perplexity

Perplexity AI · Sonar Pro

Benchmark Methodology

Our evaluation draws on a combination of established public benchmarks and proprietary task sets developed specifically for this program. Public benchmarks provide comparability with published research; proprietary tasks allow us to evaluate performance on the real-world, messy, ambiguous queries that characterise knowledge work rather than the clean, well-formed questions of academic datasets.

Public Benchmarks Used

Benchmark	Domain	What It Tests
MMLU	General knowledge	57-subject academic knowledge test spanning law, medicine, history, STEM
HumanEval	Code generation	Python function synthesis from docstrings; pass@1 metric
MATH	Mathematical reasoning	12,500 competition-level maths problems across 7 difficulty levels
GPQA	Expert reasoning	Graduate-level questions in biology, chemistry, and physics
HellaSwag	Commonsense inference	Sentence completion requiring world knowledge and inference
TruthfulQA	Factual accuracy	Questions designed to elicit false answers from overconfident models
MT-Bench	Multi-turn dialogue	Judges model performance in extended conversations requiring context retention

Proprietary Task Sets

Beyond standard benchmarks, we have developed five proprietary task sets targeting the specific use cases most relevant to our research community:

LegalDraft-CA — Drafting and analysis of Canadian legal documents; evaluated by practising lawyers.
PolicyAnalysis-GOC — Summarisation and critique of Canadian federal policy documents.
CyberThreatReasoning — Analysis of cybersecurity incident reports and threat intelligence.
AcademicSynth — Synthesising conflicting academic literature into coherent literature reviews.
FinancialModelling-CA — Building and explaining financial projections in Canadian regulatory context.

Preliminary Findings (Q1 2025)

Our Q1 2025 evaluation round covered all five models on MMLU, HumanEval, MATH, and MT-Bench, plus our LegalDraft-CA and CyberThreatReasoning proprietary tasks. Key observations:

Reasoning tasks (MATH, GPQA): GPT o3 and Claude Opus 4 lead the field, with Gemini 2.5 Pro performing strongly on mathematical tasks. Grok 3 shows impressive improvement over prior versions.
Code generation (HumanEval): Claude Sonnet and GPT-4o are the most reliable for production-quality code. All models degrade significantly on tasks requiring knowledge of Canadian-specific libraries and regulatory APIs.
Factual accuracy (TruthfulQA): All frontier models show meaningful hallucination rates on domain-specific Canadian legal and regulatory questions — a finding with significant implications for any professional use case.
Multi-turn dialogue (MT-Bench): Claude models demonstrate the strongest context retention across long exchanges. This is a material advantage for research assistance workflows.
CyberThreatReasoning (proprietary): Claude and Gemini outperform GPT-4o on structured threat analysis tasks. All models perform poorly on attribution tasks requiring adversarial reasoning.

Full results and methodology details are available in Working Paper WP-2025-01, accessible through the Research Portal.

Safety Evaluation

Beyond capability evaluation, we maintain an ongoing safety and alignment assessment track. This examines how models respond to adversarial prompts, edge cases in professional domains (legal, medical, financial), and requests that sit at the boundary of the models' operational guidelines.

We are particularly interested in practical safety — not the toy examples frequently used in alignment research, but the genuine risks that arise when non-expert users deploy frontier LLMs for consequential tasks without appropriate oversight.

Access the Research Portal

Compare all five models in real time and read our working papers on LLM evaluation methodology.

Open Research Portal →