Observability Overview
Production-grade observability for the Spice Framework using OpenTelemetry.
What is Observability?β
Observability gives you deep insights into your AI agent systems:
- Traces: See the complete journey of requests through your swarm
- Metrics: Track performance, costs, and usage patterns
- Monitoring: Real-time dashboards and alerting
Why Observability?β
1. Performance Optimization β‘β
Find bottlenecks instantly:
- Which agents are slow?
- Which LLM calls take longest?
- Where should you add caching?
2. Cost Management π°β
Track LLM spending:
- Monitor token usage per model
- Estimate costs in real-time
- Identify expensive operations
3. Error Tracking πβ
Debug distributed systems:
- Trace errors across multiple agents
- See exact failure points
- Understand error patterns
4. Capacity Planning πβ
Plan for growth:
- Monitor agent load
- Track memory usage
- Predict scaling needs
Quick Startβ
1. Add Dependenciesβ
OpenTelemetry is included in spice-core:
dependencies {
implementation("io.github.no-ai-labs:spice-core:0.2.1")
}
2. Initialize Observabilityβ
import io.github.noailabs.spice.observability.*
// At application startup
ObservabilityConfig.initialize(
ObservabilityConfig.Config(
serviceName = "my-ai-app",
serviceVersion = "1.0.0",
otlpEndpoint = "http://localhost:4317",
enableTracing = true,
enableMetrics = true
)
)
3. Add Tracing to Agentsβ
val agent = buildAgent {
name = "Research Agent"
handle { comm ->
// Your agent logic
SpiceResult.success(comm.reply("Result", id))
}
}.traced() // Add this line!
4. Run Jaegerβ
docker run -d --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one:latest
5. View Tracesβ
Open http://localhost:16686 and see your traces!
What Gets Tracked?β
Agent Operationsβ
- Request duration
- Success/failure rates
- Agent-to-agent communication
- Tool executions
LLM Usageβ
- API calls per provider
- Token consumption
- Estimated costs
- Response times
Swarm Coordinationβ
- Strategy execution
- Agent participation
- Consensus building
- Result aggregation
System Healthβ
- Memory usage
- Active agents
- Error rates
- Throughput
Example: Traced Swarmβ
val swarm = buildSwarmAgent {
name = "Research Swarm"
quickSwarm {
// Each agent automatically traced
val researcher = buildAgent { ... }.traced()
val analyst = buildAgent { ... }.traced()
val expert = buildAgent { ... }.traced()
addAgent("researcher", researcher)
addAgent("analyst", analyst)
addAgent("expert", expert)
}
config {
debug(true)
}
}
// Execute and view complete trace
val result = swarm.processComm(comm)
Next Stepsβ
- Setup Guide: Detailed configuration
- Tracing: Distributed tracing guide
- Metrics: Metrics collection
- Visualization: Dashboards and alerting