Webhook Reliability Monitoring
Webhooks are the backbone of modern integrations. When your webhook endpoints go down, you miss critical events — failed payments, new orders, user actions. APIAssert helps ensure your webhooks are always ready to receive.
The Problem
Webhook failures are silent killers:
- Missed events — Payment provider sends event, your endpoint is down
- Processing failures — Endpoint returns 200 but doesn't process
- Queue buildup — Events accumulate while you're unaware
- Retry exhaustion — Providers give up after failed retries
Unlike APIs you call, webhooks are pushed to you. If your endpoint is down, you might not know until data is missing.
How APIAssert Helps
Monitor Endpoint Availability
Ensure your webhook endpoint is reachable:
Monitor: Stripe Webhook Endpoint
URL: POST /webhooks/stripe
Headers:
Content-Type: application/json
Body: {"type": "ping"}
Assertions:
✓ Status code == 200 or 400
✓ Response time < 1000ms
Monitor Processing Health
Check that webhook processing is working:
Monitor: Webhook Queue Health
URL: GET /api/webhooks/health
Assertions:
✓ $.queue_size < 100
✓ $.processing == true
✓ $.last_processed within 5 minutes
✓ $.error_rate < 0.05
Monitor Multiple Endpoints
Different services, different endpoints:
Monitors:
✓ POST /webhooks/stripe
✓ POST /webhooks/github
✓ POST /webhooks/slack
✓ POST /webhooks/shopify
Common Webhook Sources
Payment Providers
| Provider | Common Events |
|---|---|
| Stripe | payment_intent.succeeded, customer.subscription.updated |
| PayPal | PAYMENT.CAPTURE.COMPLETED, BILLING.SUBSCRIPTION.ACTIVATED |
| Square | payment.completed, refund.created |
E-commerce
| Platform | Common Events |
|---|---|
| Shopify | orders/create, products/update, inventory_levels/update |
| WooCommerce | order.created, product.updated |
| BigCommerce | store/order/created, store/product/updated |
Developer Tools
| Service | Common Events |
|---|---|
| GitHub | push, pull_request, issues |
| GitLab | Push Hook, Merge Request Hook |
| Jira | jira:issue_created, jira:issue_updated |
Communication
| Service | Common Events |
|---|---|
| Slack | message, app_mention, reaction_added |
| Twilio | message.received, call.completed |
| SendGrid | delivered, opened, bounced |
Real-World Example
The Scenario
A marketplace app receives order webhooks from Shopify. When their webhook endpoint experiences intermittent 502 errors, orders aren't recorded in their system. Customers receive products but the marketplace has no record.
The APIAssert Solution
Monitor 1: Webhook Endpoint
URL: POST /webhooks/shopify
Headers: X-Shopify-Topic: orders/create
Body: {"test": true}
Assertions:
✓ Status code == 200
✓ Response time < 2000ms
Interval: 1 minute
Monitor 2: Order Processing Health
URL: GET /api/orders/sync-status
Assertions:
✓ $.pending_webhooks < 50
✓ $.last_processed within 5 minutes
✓ $.success_rate > 0.99
Interval: 2 minutes
The Outcome
- Endpoint monitoring caught 502 errors within 1 minute
- Alert triggered before significant order backlog
- Root cause: memory leak in webhook handler
- Fix deployed with zero missed orders
What to Assert
Response Codes
Your webhook endpoint should respond correctly:
✓ 200 — Event processed successfully
✓ 202 — Event accepted for processing
✓ 400 — Invalid payload (your code, not downtime)
✗ 500 — Server error (problem!)
✗ 502 — Gateway error (problem!)
✗ 503 — Service unavailable (problem!)
Response Time
Webhook providers have timeout limits:
| Provider | Timeout |
|---|---|
| Stripe | 20 seconds |
| GitHub | 10 seconds |
| Shopify | 5 seconds |
| Slack | 3 seconds |
Set assertions below these thresholds:
Assertion: Response time < 3000ms
Health Metrics
If you have a health endpoint:
// GET /api/webhooks/health
{
"status": "healthy",
"queue_size": 12,
"processing_rate": 150,
"error_rate": 0.001,
"last_event": "2024-12-11T14:30:00Z"
}
Assertions:
$.statusequals "healthy"$.queue_sizeless than threshold$.error_rateless than 0.05$.last_eventwithin expected timeframe
Best Practices
Respond Fast, Process Later
Your webhook endpoint should:
- Validate the request (signature, structure)
- Queue the event for processing
- Return 200 immediately
This ensures you don't timeout while processing.
Monitor the Queue, Not Just the Endpoint
Endpoint might be up, but processing might be stuck:
Endpoint Up + Queue Growing = Problem
Endpoint Up + Queue Stable = Healthy
Use Signature Verification
Most providers sign webhooks. Verify in your handler:
// Stripe example
const event = stripe.webhooks.constructEvent(
body,
signature,
webhookSecret
);
Implement Idempotency
Webhooks can be retried. Use event IDs to prevent duplicate processing:
if (await wasProcessed(event.id)) {
return res.status(200).send('Already processed');
}
Log Everything
When debugging webhook issues, logs are essential:
console.log('Webhook received:', {
type: event.type,
id: event.id,
timestamp: new Date().toISOString()
});
Alert Configuration
Critical (Immediate)
Condition: Endpoint returns 5xx
Action: Page on-call engineer
Reason: Events are being dropped
Warning (Slack)
Condition: Response time > 2000ms
Action: Notify #engineering
Reason: Risk of timeout, needs investigation
Monitoring (Email)
Condition: Queue size > threshold
Action: Email team
Reason: Processing may be falling behind
Getting Started
- List your webhook endpoints — What services send you events?
- Create monitors per endpoint — POST to each with test payload
- Add health monitoring — If you have health endpoints
- Set appropriate assertions — Response code + time
- Configure alerts — Critical events to PagerDuty
Related Use Cases
- Payment API Monitoring — Payment webhook monitoring
- Third-party API Monitoring — Monitor APIs you depend on