Classifying news articles
Classification is a core natural language processing (NLP) task that large language models are good at, but building reliable systems is still challenging. In this cookbook, we'll walk through how to improve an LLM-based classification system that sorts news articles by category.
Getting started
Before getting started, make sure you have a Braintrust account and an API key for OpenAI. Make sure to plug the OpenAI key into your Braintrust account's AI provider configuration.
Once you have your Braintrust account set up with an OpenAI API key, install the following dependencies:
Next, we'll import the libraries we need and load the ag_news dataset from Hugging Face. Once the dataset is loaded, we'll extract the category names to build a map from indices to names, allowing us to compare expected categories with model outputs. Then, we'll shuffle the dataset with a fixed seed, trim it to 20 data points, and restructure it into a list where each item includes the article text as input and its expected category name.
To authenticate with Braintrust, export your BRAINTRUST_API_KEY as an environment variable:
Exporting your API key is a best practice, but to make it easier to follow along with this cookbook, you can also hardcode it into the code below.
Once the API key is set, we initialize the OpenAI client using the AI proxy:
Writing the initial prompts
We'll start by testing classification on a single article. We'll select it from the dataset to examine its input and expected output:
Now that we've verified what's in our dataset and initialized the OpenAI client, it's time to try writing a prompt and classifying a title. We'll define a classify_article function that takes an input title and returns a category:
Running an evaluation
We've tested our prompt on a single article, so now we can test across the rest of the dataset using the Eval function. Behind the scenes, Eval will in parallel run the classify_article function on each article in the dataset, and then compare the results to the ground truth labels using a simple Levenshtein scorer. When it finishes running, it will print out the results with a link to dig deeper.
Analyzing the results
Looking at our results table (in the screenshot below), we see our that any data points that involve the category Sci/Tech are not scoring 100%. Let's dive deeper.

Reproducing an example
First, let's see if we can reproduce this issue locally. We can test an article corresponding to the Sci/Tech category and reproduce the evaluation:
Fixing the prompt
Have you spotted the issue? It looks like we misspelled one of the categories in our prompt. The dataset's categories are World, Sports, Business and Sci/Tech - but we are using Sci-Tech in our prompt. Let's fix it:
Evaluate the new prompt
The model classified the correct category Sci/Tech for this example. But, how do we know it works for the rest of the dataset? Let's run a new experiment to evaluate our new prompt:
Conclusion
Select the new experiment, and check it out. You should notice a few things:
- Braintrust will automatically compare the new experiment to your previous one.
- You should see the eval scores increase and you can see which test cases improved.
- You can also filter the test cases by improvements to know exactly why the scores changed.

Next steps
- I ran an eval. Now what?
- Add more custom scorers.
- Try other models like xAI's Grok 2 or OpenAI's o1. To learn more about comparing evals across multiple AI models, check out this cookbook.
