The rapid proliferation of large language models (LLMs) has created an urgent need for rigorous, independent evaluation. Commercial benchmarks provided by model developers are inherently conflicted; academic benchmarks frequently fail to capture real-world task performance; and the pace of model releases has outstripped the research community's ability to assess them systematically.
This program conducts ongoing, structured evaluation of the frontier LLMs available via public API โ with a particular focus on tasks relevant to knowledge work, professional services, and research assistance.
Models Under Evaluation
Claude
Anthropic ยท Opus 4, Sonnet 4
Gemini
Google DeepMind ยท 2.0, 2.5 Pro
GPT-4o
OpenAI ยท GPT-4o, o3
Grok
xAI ยท Grok 3
Perplexity
Perplexity AI ยท Sonar Pro
Benchmark Methodology
Our evaluation draws on a combination of established public benchmarks and proprietary task sets developed specifically for this program. Public benchmarks provide comparability with published research; proprietary tasks allow us to evaluate performance on the real-world, messy, ambiguous queries that characterise knowledge work rather than the clean, well-formed questions of academic datasets.
Public Benchmarks Used
| Benchmark | Domain | What It Tests |
|---|---|---|
| MMLU | General knowledge | 57-subject academic knowledge test spanning law, medicine, history, STEM |
| HumanEval | Code generation | Python function synthesis from docstrings; pass@1 metric |
| MATH | Mathematical reasoning | 12,500 competition-level maths problems across 7 difficulty levels |
| GPQA | Expert reasoning | Graduate-level questions in biology, chemistry, and physics |
| HellaSwag | Commonsense inference | Sentence completion requiring world knowledge and inference |
| TruthfulQA | Factual accuracy | Questions designed to elicit false answers from overconfident models |
| MT-Bench | Multi-turn dialogue | Judges model performance in extended conversations requiring context retention |
Proprietary Task Sets
Beyond standard benchmarks, we have developed five proprietary task sets targeting the specific use cases most relevant to our research community:
- LegalDraft-CA โ Drafting and analysis of Canadian legal documents; evaluated by practising lawyers.
- PolicyAnalysis-GOC โ Summarisation and critique of Canadian federal policy documents.
- CyberThreatReasoning โ Analysis of cybersecurity incident reports and threat intelligence.
- AcademicSynth โ Synthesising conflicting academic literature into coherent literature reviews.
- FinancialModelling-CA โ Building and explaining financial projections in Canadian regulatory context.
Preliminary Findings (Q1 2025)
Our Q1 2025 evaluation round covered all five models on MMLU, HumanEval, MATH, and MT-Bench, plus our LegalDraft-CA and CyberThreatReasoning proprietary tasks. Key observations:
- Reasoning tasks (MATH, GPQA): GPT o3 and Claude Opus 4 lead the field, with Gemini 2.5 Pro performing strongly on mathematical tasks. Grok 3 shows impressive improvement over prior versions.
- Code generation (HumanEval): Claude Sonnet and GPT-4o are the most reliable for production-quality code. All models degrade significantly on tasks requiring knowledge of Canadian-specific libraries and regulatory APIs.
- Factual accuracy (TruthfulQA): All frontier models show meaningful hallucination rates on domain-specific Canadian legal and regulatory questions โ a finding with significant implications for any professional use case.
- Multi-turn dialogue (MT-Bench): Claude models demonstrate the strongest context retention across long exchanges. This is a material advantage for research assistance workflows.
- CyberThreatReasoning (proprietary): Claude and Gemini outperform GPT-4o on structured threat analysis tasks. All models perform poorly on attribution tasks requiring adversarial reasoning.
Full results and methodology details are available in Working Paper WP-2025-01, accessible through the Research Portal.
Safety Evaluation
Beyond capability evaluation, we maintain an ongoing safety and alignment assessment track. This examines how models respond to adversarial prompts, edge cases in professional domains (legal, medical, financial), and requests that sit at the boundary of the models' operational guidelines.
We are particularly interested in practical safety โ not the toy examples frequently used in alignment research, but the genuine risks that arise when non-expert users deploy frontier LLMs for consequential tasks without appropriate oversight.
Access the Research Portal
Compare all five models in real time and read our working papers on LLM evaluation methodology.
Open Research Portal โ