Master's Thesis in Mathematics and Philosophy: A Unified Theory of AI Evaluation
Title
Towards a Unified Theory of AI Evaluation: Integrating Mathematics, Epistemology and Formal Logic to Evaluate AI Agents
Objective
Design, justify, and prototype a holistic evaluation framework for AI Agents and Large Language Models (LLMs) that:
- Maps and formalises current state-of-the-art evaluation metrics used in industry and academia.
- Reorganises and extends these metrics using concepts from epistemology and formal logic, treating AI systems as epistemic agents (systems that hold, revise, and communicate “beliefs”).
The choice of philosophical epistemology and logical traditions (e.g. Bayesian epistemology, coherentism, virtue epistemology, phenomenology, epistemic logic, non-monotonic logic, belief revision, etc.) is open: the student is expected to select, motivate, and defend their own approach.
Scope & Challenges
You will work at the intersection of mathematics, philosophy, and AI, with significant freedom to propose your own evaluation perspective.
The work includes:
- Surveying and formalising state-of-the-art evaluation methods for LLMs and AI agents (e.g. benchmark metrics, robustness and safety evaluations, human preference / alignment metrics, agent-specific benchmarks).
- Providing precise mathematical descriptions and a structured taxonomy of these metrics, and analysing what each one really measures and where it falls short.
- Selecting one or more traditions in epistemology and one or more formal/logical frameworks, arguing why they are appropriate lenses for evaluating AI agents, and extracting a set of epistemic desiderata (e.g. truth-tracking, coherence, rational revision, justified explanation, handling of uncertainty).
- Designing a multi-dimensional, philosophy-informed evaluation framework that organises existing metrics in a coherent way and introduces new or refined metrics or protocols where current practice is lacking.
- Implementing and testing parts of this framework on real LLMs/agents (via APIs), and analysing how current systems behave under your criteria.
Key challenges include bridging abstract philosophical concepts with computable metrics, covering enough of the evaluation landscape without losing focus, and making epistemological and logical tools understandable and convincing for technically oriented readers.
Deliverables
By the end of the thesis, you are expected to deliver:
- A structured, mathematically rigorous survey and taxonomy of current evaluation methods for LLMs and AI agents, with formal philosophically grounded definitions and critical commentary.
- A clear conceptual and formal presentation of your chosen epistemological and logical perspectives, and a set of epistemic desiderata for AI evaluation derived from them.
- A holistic evaluation framework that organises existing metrics along epistemic dimensions and proposes new or refined metrics or protocols where needed.
- A prototype implementation (e.g. in Python) that computes selected metrics and runs small-scale experiments on at least one real LLM/agent, plus an empirical and philosophical analysis of the results and their limitations.
- A well-structured written thesis suitable for readers in ML/CS/logic and philosophers interested in the epistemology of AI.
Requirements
We are looking for a student who has:
- Academic background: Master’s-level student in Mathematics and Philosophy with strong logic/maths skills, or a closely related field.
- Technical skills: solid mathematical maturity (formal definitions, proofs, basic probability) and basic programming skills (preferably Python) to interact with LLM APIs, implement evaluation metrics, and run and analyse experiments.
- Philosophical background: genuine passion for epistemology and/or formal logic, prior coursework or significant reading in these areas, and willingness to engage seriously with philosophical texts and connect them to technical work.
- Personal qualities: enjoys working across disciplines and synthesising ideas; comfortable with open-ended problems where you define part of the agenda yourself; able to argue clearly, write precisely, and defend a chosen perspective while remaining critically reflective.
If you’re excited by the question “What does it mean for an AI system to be epistemically good, and how can we actually measure that?” and you enjoy both math and philosophy, this project is for you.
About Hypertype
We’re building the next generation of AI products and AI Agents in the Customer Support space. Focused on manufacturing but with clients across other sectors like fintech, health tech and more we are giving hundreds of companies across the globe the capacity to transform their customer journey with the latest advancements in AI. Our customers proudly recognise us for delivering the "highest quality answers in the market... outsmarting Gemini, Fin, Microsoft Copilot etc". —a testament to our unwavering commitment to excellence and precision in every interaction.
We are a group of people that work the extra mile together, that fearlessly push the boundaries to new heights and that deeply own what they deliver. We look for those who are ambitious, humble and fast to build together things humankind has not witnessed yet.
- Department
- Hyperbrain AI
- Locations
- Stockholm
Stockholm
About Hypertype
We solve urgent pains of billions email and chats conversations. Our vision is to build the smartest experience for customer communications with real-time data.