Canary Deployments
Aegis supports canary traffic splitting — a deployment strategy where a percentage of live traffic is routed to a new upstream version (the “canary”) while the rest continues to the current primary. Aegis monitors error rates and latency for both groups independently and can automatically roll back to the primary if the canary starts failing.How It Works
- You deploy a new version of your backend alongside the current version
- Mark the new upstream as canary in Aegis
- Set a traffic percentage (e.g., 10% to canary)
- Aegis routes traffic according to the split and tracks metrics for both groups
- If the canary is healthy, gradually increase the percentage
- If the canary fails, Aegis automatically rolls back to 100% primary
Upstream Roles
Each upstream for a proxy host has a role:| Role | Description |
|---|---|
| Primary | The current stable version. Receives the majority of traffic. Default role for all upstreams. |
| Canary | The new version under evaluation. Receives the configured traffic percentage. |
Configuration
| Setting | Default | Description |
|---|---|---|
| Enable Canary | false | Master toggle for canary routing |
| Traffic Percentage | 0 | Percentage of requests routed to canary upstreams (0-100) |
| Sticky Routing | false | Route the same client IP consistently to the same group |
| Auto-Rollback | true | Automatically set traffic to 0% when error threshold is exceeded |
| Error Threshold | 10% | Canary error rate (5xx responses) that triggers rollback |
| Latency Threshold | 0 (disabled) | Canary P95 latency in ms that triggers rollback |
| Evaluation Window | 300 seconds | Sliding window for computing metrics |
| Min Sample Size | 20 | Minimum canary requests before evaluating thresholds |
Where to Configure
- Admin UI → Hosts → edit a proxy host → Upstream section → set upstream roles → Canary Deployment card
Sticky Canary Routing
Without sticky routing, each request independently rolls the dice — a user might hit primary on one request and canary on the next. This can cause inconsistent behavior for stateful applications. With sticky routing enabled, routing is deterministic per client IP:| Scenario | Recommended |
|---|---|
| User-facing web apps, SPAs | Sticky on |
| Stateless APIs, webhooks | Sticky off |
| Microservice-to-microservice | Sticky off |
Metrics and Monitoring
Aegis tracks metrics independently for primary and canary groups:| Metric | Description |
|---|---|
| Request count | Total requests routed to each group |
| Error rate | Percentage of 5xx responses |
| P95 latency | 95th percentile response time |
| Health status | Upstream health check results |
Live Dashboard
The canary dashboard shows a side-by-side comparison:Auto-Rollback
When auto-rollback is enabled, Aegis continuously evaluates canary metrics:- Wait until the canary has received at least
min_sample_sizerequests - Compute the canary error rate over the evaluation window
- If the error rate exceeds the threshold → rollback
- Compute the canary P95 latency (if latency threshold is set)
- If P95 exceeds the threshold → rollback
Rollback Alert
When auto-rollback triggers:- A warning banner appears in the admin UI
- The event is logged at WARN level
- If notifications are configured, an alert is sent
Canary Lifecycle
Typical Workflow
- Deploy the new version to a separate server
- Add it as an upstream with role
canary - Enable canary routing at 5%
- Monitor for 10-30 minutes
- Increase to 25%, then 50%, then 100%
- Promote the canary to primary (swap roles)
- Remove the old primary upstream
Manual Actions
| Action | Description |
|---|---|
| Increase / Decrease | Adjust traffic percentage |
| Promote | Swap canary and primary roles — the canary becomes the new primary |
| Rollback | Set traffic to 0% — all traffic goes to primary |
| Reset Metrics | Clear counters and start fresh |
Graceful Degradation
If all canary upstreams become unhealthy (health checks fail), Aegis automatically routes 100% of traffic to primary upstreams. This is transparent — no rollback is triggered, and when canary upstreams recover, traffic splitting resumes. If all primary upstreams become unhealthy, Aegis routes to canary upstreams as a fallback (same behavior as standard load balancing failover).Difference from Load Balancing
| Load Balancing | Canary | |
|---|---|---|
| Purpose | Distribute load for capacity | Compare versions for safety |
| Upstreams | All run the same code | Primary and canary run different code |
| Metrics | Aggregate across all upstreams | Tracked separately per group |
| Rollback | Not applicable | Automatic based on error/latency thresholds |
| Traffic split | Based on policy (round-robin, etc.) | Based on configured percentage |
API Reference
| Method | Path | Description |
|---|---|---|
GET | /api/v1/hosts/{id}/canary | Get canary config and live metrics |
PUT | /api/v1/hosts/{id}/canary | Update canary config (traffic percent, thresholds) |
POST | /api/v1/hosts/{id}/canary/promote | Promote canary to primary (swap roles) |
POST | /api/v1/hosts/{id}/canary/rollback | Manual rollback to 0% canary traffic |
POST | /api/v1/hosts/{id}/canary/reset | Reset metrics counters |

