ODE
AGI Benchmark Suite
Llewellyn SystemsPlatform

AGI Benchmark

Suite

Self-assessment intelligence testing across four frontier benchmarks. All evidence preserved. Not third-party verified.

Composite Index

0 of 4 benchmarks tested

These benchmarks use GPT-4o-mini to both answer questions and judge its own answers. This is not an independent evaluation. Scores should not be cited as verified AGI metrics. Third-party verification by independent researchers is required for external publication.

Benchmarks

Distribution

GPQAHLEARCFM20%40%60%80%100%

Scores

GPQA
HLE
ARC
FM

Methodology & Limitations

Testing Protocol

Questions are submitted to GPT-4o-mini via the OpenAI API. The model generates reasoned answers which are then evaluated against reference answers by the same model acting as a judge.

Self-Judging Limitation

The same model that answers also judges. This creates inherent bias: a model may rate its own incorrect reasoning as valid. Independent human evaluation is required for credible benchmark scores.

Path to Credibility

(1) Submit to official benchmark leaderboards. (2) Engage independent evaluators. (3) Publish methodology for peer review. Until complete, all scores remain internal self-assessments.