| name | using-braintrust |
| description | Enables AI agents to use Braintrust for LLM evaluation, logging, and observability. Includes scripts for querying logs with SQL, running evals, and logging data. |
| version | 1.0.0 |
Using Braintrust
Braintrust is a platform for evaluating, logging, and monitoring LLM applications.
Querying logs with SQL
Use the query_logs.py script to run SQL queries against Braintrust logs.
Always share the SQL query you used when reporting results, so the user understands what was executed.
Script location: scripts/query_logs.py (relative to this file)
Run from the user's project directory (where .env with BRAINTRUST_API_KEY exists):
uv run /path/to/scripts/query_logs.py --project "Project Name" --query "SQL_QUERY"
Common queries
Count logs from last 24 hours:
SELECT count(*) as count FROM logs WHERE created > now() - interval 1 day
Get recent logs:
SELECT input, output, created FROM logs ORDER BY created DESC LIMIT 10
Filter by metadata:
SELECT input, output FROM logs WHERE metadata.user_id = 'user123' LIMIT 20
Filter by time range:
SELECT * FROM logs WHERE created > now() - interval 7 day LIMIT 50
Aggregate by field:
SELECT metadata.model, count(*) as count FROM logs GROUP BY metadata.model
Group by hour:
SELECT hour(created) as hr, count(*) as count FROM logs GROUP BY hour(created)
SQL quirks in Braintrust
- Time functions: Use
hour(),day(),month(),year()instead ofdate_trunc()- ✅
hour(created) - ❌
date_trunc('hour', created)
- ✅
- Intervals: Use
interval 1 day,interval 7 day,interval 1 hour(no quotes, singular unit) - Nested fields: Use dot notation:
metadata.user_id,scores.Factuality,metrics.duration - Table name: Always use
FROM logs(the script handles project scoping)
SQL reference
Operators:
=,!=,>,<,>=,<=IS NULL,IS NOT NULLLIKE 'pattern%'AND,OR,NOT
Aggregations:
count(*),count(field)avg(field),sum(field)min(field),max(field)
Time filters:
created > now() - interval 1 daycreated > now() - interval 7 daycreated > now() - interval 1 hour
Logging data
Use scripts/log_data.py to log data to a project:
uv run /path/to/scripts/log_data.py --project "Project Name" --input "query" --output "response"
With metadata:
--input "query" --output "response" --metadata '{"user_id": "123"}'
Batch from JSON:
--data '[{"input": "a", "output": "b"}, {"input": "c", "output": "d"}]'
Running evaluations
Use scripts/run_eval.py to run evaluations:
uv run /path/to/scripts/run_eval.py --project "Project Name" --data '[{"input": "test", "expected": "test"}]'
From file:
--data-file test_cases.json --scorer factuality
Setup
Create a .env file in your project directory:
BRAINTRUST_API_KEY=your-api-key-here
Writing evaluation code (SDK)
For custom evaluation logic, use the SDK directly.
IMPORTANT: First argument to Eval() is the project name (positional).
import braintrust
from autoevals import Factuality
braintrust.Eval(
"My Project", # Project name (required, positional)
data=lambda: [{"input": "What is 2+2?", "expected": "4"}],
task=lambda input: my_llm_call(input),
scores=[Factuality],
)
Common mistakes:
- ❌
Eval(project_name="My Project", ...)- Wrong! - ❌
Eval(name="My Project", ...)- Wrong! - ✅
Eval("My Project", data=..., task=..., scores=...)- Correct!
Writing logging code (SDK)
import braintrust
logger = braintrust.init_logger(project="My Project")
logger.log(input="query", output="response", metadata={"user_id": "123"})
logger.flush() # Always flush!
Common issues
- "Eval() got an unexpected keyword argument 'project_name'": Use positional argument
- Logs not appearing: Call
logger.flush()after logging - Authentication errors: Create
.envfile withBRAINTRUST_API_KEY=your-key