1.1 Introduction to Evaluation Why It Matters
1.2 Generative versus Understanding Tasks
1.3 Key Metrics for Common Tasks
2.1 Evaluating Multiple-Choice Tasks
2.2 Evaluating Free Text Response Tasks
2.3 AIs Supervising AIs LLM as a Judge
3.1 Evaluating Embedding Tasks
3.2 Evaluating Classification Tasks
3.3 Building an LLM Classifier with BERT and GPT
4.1 The Role of Benchmarks
4.2 Interrogating Common Benchmarks
4.3 Evaluating LLMs with Benchmarks
5.1 Probing LLMs for Knowledge
5.2 Probing LLMs to Play Games
6.1 Fine-Tuning Objectives
6.2 Metrics for Fine-Tuning Success
6.3 Practical Demonstration Evaluating Fine-Tuning
6.4 Evaluating and Cleaning Data
7.1 Evaluating AI Agents Task Automation and Tool Integration
7.2 Measuring Retrieval-Augmented Generation (RAG) Systems
7.3 Building and Evaluating a Recommendation Engine Using LLMs
7.4 Using Evaluation to Combat AI Drift
7.5 Time-Series Regression
8.1 When and How to Evaluate
8.2 Looking Ahead Trends in LLM Evaluation
Evaluating Large Language Models (LLMs) Introduction
Evaluating Large Language Models (LLMs) Summary
Learning objectives
Learning objectives (1)
Learning objectives (2)
Learning objectives (3)
Learning objectives (4)
Learning objectives (5)
Learning objectives (6)
Learning objectives (7)