autoevals
AutoEvals is a tool to quickly and easily evaluate AI model outputs.
Quickstart
Example
Use AutoEvals to model-grade an example LLM completion using the factuality prompt.
Interfaces
Namespaces
- AnswerCorrectness
- AnswerRelevancy
- AnswerSimilarity
- Battle
- ClosedQA
- ContextEntityRecall
- ContextPrecision
- ContextRecall
- ContextRelevancy
- EmbeddingSimilarity
- ExactMatch
- Factuality
- Faithfulness
- Humor
- JSONDiff
- Levenshtein
- LevenshteinScorer
- ListContains
- Moderation
- NumericDiff
- Possible
- Security
- Sql
- Summary
- Translation
- ValidJSON
Functions
AnswerCorrectness
▸ AnswerCorrectness(args): Score | Promise<Score>
Measures answer correctness compared to ground truth using a weighted average of factuality and semantic similarity.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, { answerSimilarity?: Scorer<string, {}> ; answerSimilarityWeight?: number ; azureOpenAi?: AzureOpenAiAuth ; client?: OpenAI ; context?: string | string[] ; factualityWeight?: number ; input?: string ; maxTokens?: number ; model?: string ; openAiApiKey?: string ; openAiBaseUrl?: string ; openAiDangerouslyAllowBrowser?: boolean ; openAiDefaultHeaders?: Record<string, string> ; openAiOrganizationId?: string ; temperature?: number }> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
AnswerRelevancy
▸ AnswerRelevancy(args): Score | Promise<Score>
Scores the relevancy of the generated answer to the given question. Answers with incomplete, redundant or unnecessary information are penalized.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, { azureOpenAi?: AzureOpenAiAuth ; client?: OpenAI ; context?: string | string[] ; embeddingModel?: string ; input?: string ; maxTokens?: number ; model?: string ; openAiApiKey?: string ; openAiBaseUrl?: string ; openAiDangerouslyAllowBrowser?: boolean ; openAiDefaultHeaders?: Record<string, string> ; openAiOrganizationId?: string ; strictness?: number ; temperature?: number }> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
AnswerSimilarity
▸ AnswerSimilarity(args): Score | Promise<Score>
Scores the semantic similarity between the generated answer and ground truth.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, RagasArgs> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
Battle
▸ Battle(args): Score | Promise<Score>
Test whether an output better performs the instructions than the original
(expected) value.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, LLMClassifierArgs<{ instructions: string }>> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
ClosedQA
▸ ClosedQA(args): Score | Promise<Score>
Test whether an output answers the input using knowledge built into the model.
You can specify criteria to further constrain the answer.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, LLMClassifierArgs<{ criteria: any ; input: string }>> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
ContextEntityRecall
▸ ContextEntityRecall(args): Score | Promise<Score>
Estimates context recall by estimating TP and FN using annotated answer and retrieved context.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, { azureOpenAi?: AzureOpenAiAuth ; client?: OpenAI ; context?: string | string[] ; input?: string ; maxTokens?: number ; model?: string ; openAiApiKey?: string ; openAiBaseUrl?: string ; openAiDangerouslyAllowBrowser?: boolean ; openAiDefaultHeaders?: Record<string, string> ; openAiOrganizationId?: string ; pairwiseScorer?: Scorer<string, {}> ; temperature?: number }> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
ContextPrecision
▸ ContextPrecision(args): Score | Promise<Score>
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, RagasArgs> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
ContextRecall
▸ ContextRecall(args): Score | Promise<Score>
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, RagasArgs> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
ContextRelevancy
▸ ContextRelevancy(args): Score | Promise<Score>
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, RagasArgs> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
EmbeddingSimilarity
▸ EmbeddingSimilarity(args): Score | Promise<Score>
A scorer that uses cosine similarity to compare two strings.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, { azureOpenAi?: AzureOpenAiAuth ; client?: OpenAI ; expectedMin?: number ; model?: string ; openAiApiKey?: string ; openAiBaseUrl?: string ; openAiDangerouslyAllowBrowser?: boolean ; openAiDefaultHeaders?: Record<string, string> ; openAiOrganizationId?: string ; prefix?: string }> |
Returns
Score | Promise<Score>
A score between 0 and 1, where 1 is a perfect match.
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
ExactMatch
▸ ExactMatch(args): Score | Promise<Score>
A simple scorer that tests whether two values are equal. If the value is an object or array, it will be JSON-serialized and the strings compared for equality.
Parameters
| Name | Type |
|---|---|
args | Object |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
Factuality
▸ Factuality(args): Score | Promise<Score>
Test whether an output is factual, compared to an original (expected) value.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, LLMClassifierArgs<{ expected?: string ; input: string ; output: string }>> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
Faithfulness
▸ Faithfulness(args): Score | Promise<Score>
Measures factual consistency of the generated answer with the given context.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, RagasArgs> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
Humor
▸ Humor(args): Score | Promise<Score>
Test whether an output is funny.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, LLMClassifierArgs<{}>> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
JSONDiff
▸ JSONDiff(args): Score | Promise<Score>
A simple scorer that compares JSON objects, using a customizable comparison method for strings (defaults to Levenshtein) and numbers (defaults to NumericDiff).
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<any, { numberScorer?: Scorer<number, object> ; preserveStrings?: boolean ; stringScorer?: Scorer<string, object> }> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
LLMClassifierFromSpec
▸ LLMClassifierFromSpec<RenderArgs>(name, spec): Scorer<any, LLMClassifierArgs<RenderArgs>>
Type parameters
| Name |
|---|
RenderArgs |
Parameters
| Name | Type |
|---|---|
name | string |
spec | Object |
spec.choice_scores | Record<string, number> |
spec.model? | string |
spec.prompt | string |
spec.temperature? | number |
spec.use_cot? | boolean |
Returns
Scorer<any, LLMClassifierArgs<RenderArgs>>
Defined in
LLMClassifierFromSpecFile
▸ LLMClassifierFromSpecFile<RenderArgs>(name, templateName): Scorer<any, LLMClassifierArgs<RenderArgs>>
Type parameters
| Name |
|---|
RenderArgs |
Parameters
| Name | Type |
|---|---|
name | string |
templateName | "battle" | "closed_q_a" | "factuality" | "humor" | "possible" | "security" | "sql" | "summary" | "translation" |
Returns
Scorer<any, LLMClassifierArgs<RenderArgs>>
Defined in
LLMClassifierFromTemplate
▸ LLMClassifierFromTemplate<RenderArgs>(«destructured»): Scorer<string, LLMClassifierArgs<RenderArgs>>
Type parameters
| Name |
|---|
RenderArgs |
Parameters
| Name | Type |
|---|---|
«destructured» | Object |
› choiceScores | Record<string, number> |
› model? | string |
› name | string |
› promptTemplate | string |
› temperature? | number |
› useCoT? | boolean |
Returns
Scorer<string, LLMClassifierArgs<RenderArgs>>
Defined in
Levenshtein
▸ Levenshtein(args): Score | Promise<Score>
A simple scorer that uses the Levenshtein distance to compare two strings.
Parameters
| Name | Type |
|---|---|
args | Object |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
LevenshteinScorer
▸ LevenshteinScorer(args): Score | Promise<Score>
Parameters
| Name | Type |
|---|---|
args | Object |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
ListContains
▸ ListContains(args): Score | Promise<Score>
A scorer that semantically evaluates the overlap between two lists of strings. It works by computing the pairwise similarity between each element of the output and the expected value, and then using Linear Sum Assignment to find the best matching pairs.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string[], { allowExtraEntities?: boolean ; pairwiseScorer?: Scorer<string, {}> }> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
Moderation
▸ Moderation(args): Score | Promise<Score>
A scorer that uses OpenAI's moderation API to determine if AI response contains ANY flagged content.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, { azureOpenAi?: AzureOpenAiAuth ; client?: OpenAI ; openAiApiKey?: string ; openAiBaseUrl?: string ; openAiDangerouslyAllowBrowser?: boolean ; openAiDefaultHeaders?: Record<string, string> ; openAiOrganizationId?: string ; threshold?: number }> |
Returns
Score | Promise<Score>
A score between 0 and 1, where 1 means content passed all moderation checks.
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
NumericDiff
▸ NumericDiff(args): Score | Promise<Score>
A simple scorer that compares numbers by normalizing their difference.
Parameters
| Name | Type |
|---|---|
args | Object |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
OpenAIClassifier
▸ OpenAIClassifier<RenderArgs, Output>(args): Promise<Score>
Type parameters
| Name |
|---|
RenderArgs |
Output |
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<Output, OpenAIClassifierArgs<RenderArgs>> |
Returns
Promise<Score>
Defined in
Possible
▸ Possible(args): Score | Promise<Score>
Test whether an output is a possible solution to the challenge posed in the input.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, LLMClassifierArgs<{ input: string }>> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
Security
▸ Security(args): Score | Promise<Score>
Test whether an output is malicious.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, LLMClassifierArgs<{}>> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
Sql
▸ Sql(args): Score | Promise<Score>
Test whether a SQL query is semantically the same as a reference (output) query.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, LLMClassifierArgs<{ input: string }>> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
Summary
▸ Summary(args): Score | Promise<Score>
Test whether an output is a better summary of the input than the original (expected) value.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, LLMClassifierArgs<{ input: string }>> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
Translation
▸ Translation(args): Score | Promise<Score>
Test whether an output is as good of a translation of the input in the specified language
as an expert (expected) value.
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<string, LLMClassifierArgs<{ input: string ; language: string }>> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
ValidJSON
▸ ValidJSON(args): Score | Promise<Score>
A binary scorer that evaluates the validity of JSON output, optionally validating against a JSON Schema definition (see https://json-schema.org/learn/getting-started-step-by-step#create).
Parameters
| Name | Type |
|---|---|
args | ScorerArgs<any, { schema?: any }> |
Returns
Score | Promise<Score>
Defined in
node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21
buildClassificationTools
▸ buildClassificationTools(useCoT, choiceStrings): ChatCompletionTool[]
Parameters
| Name | Type |
|---|---|
useCoT | boolean |
choiceStrings | string[] |
Returns
ChatCompletionTool[]
Defined in
init
▸ init(«destructured»?): void
Parameters
| Name | Type |
|---|---|
«destructured» | Object |
› client? | OpenAI |
Returns
void
Defined in
makePartial
▸ makePartial<Output, Extra>(fn, name?): ScorerWithPartial<Output, Extra>
Type parameters
| Name |
|---|
Output |
Extra |
Parameters
| Name | Type |
|---|---|
fn | Scorer<Output, Extra> |
name? | string |
Returns
ScorerWithPartial<Output, Extra>
Defined in
normalizeValue
▸ normalizeValue(value, maybeObject): string
Parameters
| Name | Type |
|---|---|
value | unknown |
maybeObject | boolean |
Returns
string
Defined in
Type Aliases
LLMArgs
Ƭ LLMArgs: { maxTokens?: number ; temperature?: number } & OpenAIAuth
Defined in
LLMClassifierArgs
Ƭ LLMClassifierArgs<RenderArgs>: { model?: string ; useCoT?: boolean } & LLMArgs & RenderArgs
Type parameters
| Name |
|---|
RenderArgs |
Defined in
ModelGradedSpec
Ƭ ModelGradedSpec: z.infer<typeof modelGradedSpecSchema>
Defined in
OpenAIClassifierArgs
Ƭ OpenAIClassifierArgs<RenderArgs>: { cache?: ChatCache ; choiceScores: Record<string, number> ; classificationTools: ChatCompletionTool[] ; messages: ChatCompletionMessageParam[] ; model: string ; name: string } & LLMArgs & RenderArgs
Type parameters
| Name |
|---|
RenderArgs |
Defined in
Variables
DEFAULT_MODEL
• Const DEFAULT_MODEL: "gpt-4o"
Defined in
Evaluators
• Const Evaluators: { label: string ; methods: AutoevalMethod[] }[]
Defined in
modelGradedSpecSchema
• Const modelGradedSpecSchema: ZodObject<{ choice_scores: ZodRecord<ZodString, ZodNumber> ; model: ZodOptional<ZodString> ; prompt: ZodString ; temperature: ZodOptional<ZodNumber> ; use_cot: ZodOptional<ZodBoolean> }, "strip", ZodTypeAny, { choice_scores: Record<string, number> ; model?: string ; prompt: string ; temperature?: number ; use_cot?: boolean }, { choice_scores: Record<string, number> ; model?: string ; prompt: string ; temperature?: number ; use_cot?: boolean }>
Defined in
templates
• Const templates: Record<"battle" | "closed_q_a" | "factuality" | "humor" | "possible" | "security" | "sql" | "summary" | "translation", { choice_scores: Record<string, number> ; model?: string ; prompt: string ; temperature?: number ; use_cot?: boolean }>