Free Webhook Handling Design Spec – A Scalable, Secure, and Reliable Approach

Posted on the 5th of February, 2025

Karl the Koala being an architect.

Introduction

The Problem with Webhooks.

Nowadays, it is very common for businesses to rely heavily on webhooks to receive real-time notifications that update states within their systems and trigger workflows between systems.

As the CTO and founder of Imburse, a payments company, I handled webhooks from various third parties. While it was easy at first, real challenges started to plague the team regarding maintaining the system and adding new services. It also meant I had many more services within the scope of compliance and audit checks.

The problem ultimately came down to the following facts:

  • It is, in fact, an integration, and therefore, each integration will require a flavour of customisation.
  • Significant initial upfront investment, and considering what the service does, it is difficult to amortise and make money on it. It is a "necessary evil" for empowering the system, though
  • Maintenance—After the initial investment, these types of services always require 10-30% maintenance, which will degrade over time.
  • No IP—this is not a service with IP. I would rather have had my engineers focus on features we could actually sell and pay them bonuses.

Why It Matters

Handling webhooks is a necessary evil service. Although we can most likely never profit from them, they are still required to power the services that will. A robust webhook management solution ensures real-time event delivery, reduces integration complexity, and maintains system reliability.

As microservices, SaaS applications, and event-driven architectures have become more common over the years, companies need a scalable and secure approach to handling webhooks efficiently.

The reality is, without a robust webhook handling solution, you risk things as:

  • Missed events requiring additional polling and manual loading of data.
  • Having your services and system taken down when there are spikes, effectively being DDoS'd and compromised
  • Team moral deterioration—When a service is a thorn in the side of engineering teams, they tend to overfocus on it as the first point of failure.
  • Failing compliance audits.
  • Storing data that eventually becomes costly.

These things can be a time waste, especially if a different service was at fault.

I wrote this document to share the thoughts and considerations it took to build a secure, scalable, and reliable webhook handling system for those who wish to develop their own. It is educational and informative for those who think webhooks are uncomplicated, which they are at face value.

Business Needs and Impact

1.1. Real-Time Event Processing

Business Value:

  • Ensure timely updates for users and business operations. Real-time data is invaluable, but one must be realistic about what real-time means.
  • It reduces the need for building polling services with high bandwidth and computing costs, and that offers little reward. Polling services should be considered harmful value-adds that only cost and never add real value.
  • Improves responsiveness of applications, leading to better user experience. Happy users are key to low churn.

Technical Considerations:

  • An event-driven architecture approach makes sense for implementing "real-time" notifications. This approach uses explicit message queues (Kafka, RabbitMQ, SQS) to offload the initial event from the handler and decouple event processing.
  • Define what "Real-time" means to you. It will be part of your SLA, and answering what a realistic processing time is critical, as it will affect the overall architectural costs e.g. an SLA that defines an event to be processed and delivered in < 10ms will cost drastically more to achieve vs an SLA that defines the processing time to be less than 60 seconds. The answer should be in the context of your needs.
  • To achieve near 100% uptime, a multi-regional deployment and global API gateway for routing are required. This typically doubles the infrastructure cost and any "real-time" database state changes that need to occur to handle events where idempotency requirements are high.

1.2. Support for Multiple Webhook Sources

Business Value:

  • Allows the business to partner and expand its value proposition and decouples from vendor lock-in.
  • Allows a business to scale in volume and into new markets and verticals.
  • Future-proofs a business for a fast-changing landscape, where agility to adapt or die is critical.

Technical Considerations:

  • Make use of an adapter pattern to handle different webhook payload structures.
  • Implement payload validation using JSON Schema, Avro, or Protobuf and a schema registry to govern any breaking changes.
  • Support multiple common authentication types, such as API keys, OAuth, HMAC, mutual TLS, and Basic authentication.

1.3. Reliability and Guaranteed Delivery

Business Value:

  • Prevents lost events, ensuring system consistency and trust. A single lost event could cost a company a lot. For Imburse, this meant that we would have our clients write to the regulators because a payment was missed.
  • It reduces operational efforts by requiring less support staff, tickets, and product and engineering effort, which is typically spent troubleshooting failed event deliveries.
  • Business continuity by ensuring events are processed and checking the compliance boxes one at a time!

Technical Considerations:

  • As mentioned, the infrastructure must have well-defined SLAs and SLOs, multi-regional deployments over faultlines, and "real-time" data consistency.
  • Use an idempotency strategy using unique event identifiers that prevent replays (including replay attacks).
  • Use dead-letter queues to capture and later retry failed events using manual intervention.
  • Allow subscribers to define their retry policies, giving systems time to recover under load. An example is exponential backoff times and retry count.

1.4. Security and Compliance

Business Value:

  • Security threats are real, and compliance requirements and needs are met. This results in the protection of sensitive data and the prevention of security breaches and fines. If designed well, this should reduce the number of control points in an audit.
  • It allows ensuring compliance with the correct standards and regulations, such as GDPR, PCI DSS, SOC2, ISO27001, and HIPAA, over time.
  • Customer confidence - having a secure posture boosts their trust and security posture.

Technical Considerations:

  • Enforcing a webhook signature verification (HMAC, JWT, or signed payloads) when possible.
  • Implement fine-grained Role-Based Access Control (RBAC) to restrict webhook configurations and, in multi-tenanted systems, ensure strong data partitioning across tenants.
  • Encrypt data in transit and at rest.

1.5. Operations - Telemetry.

Business Value:

  • Reduced downtime occurs with real-time alerting and incident response.
  • Captured metrics can provide insights into webhook traffic patterns for capacity planning and trends.
  • Engineering becomes empowered with better debugging tools, reducing mean time to resolution (DORA MTTR).

Technical Considerations:

  • Make use of OTel for logs, metrics and traces.
  • Use paid or open source tools to capture and process the OTel Data and build monitoring and observation into the system health.
  • Define your internal SLAs and SLOs that you want to achieve.

2. Architectural Needs and Business Considerations

2.1. Scalable Webhook Ingestion

Business Impact:

  • Handling increasing webhook traffic without performance degradation builds confidence and allows for a higher volume of traffic, resulting in more positive cash flow (if billable volume)
  • Support for high availability and disaster recovery strategies.
  • Spikes can bring down systems during peak usage, so handling them is essential. This allows downstream consumers to have a great experience and avoid accidental DDoS or system overload.

Technical Considerations:

  • Use event streaming technologies for high-throughput ingestion.
  • Use the cloud-native elastic capabilities for auto-scaling to scale on demand.
  • Store events only when needed, using databases that are excellent at handling and querying large amounts of rows without much work.

2.2. Event Normalisation and Transformation

Business Impact:

  • Providing a consistent data format reduces complexity for downstream service handlers and results in less maintenance and upfront investment in new services.
  • Engineers save time by automating data transformation and testing requirements.
  • Interoperability across different platforms and services for extensibility is vastly improved, leading to easier expansion of new features and services.

Technical Considerations:

  • Transform as early on in the process as possible and send using the CloudEvent standard.
  • Support dynamic field mapping to extract values from payloads and headers.
  • Store normalised events using a structured schema (CloudEvents format preferred), using the JSON format if possible, as it's easier to query at rest.

2.3. Asynchronous Processing and Queueing

Business Impact:

  • By enhancing system resilience and decoupling event processing, infrastructure costs and opex costs are lower.
  • Reducing the risk of system overload during high-traffic periods means no system is compromised, and scaling requires costly infrastructure.
  • It enables better scalability with distributed event processing and, therefore, better volume handling in the future.

Technical Considerations:

  • For critical events, make use of worker queues with priority scheduling.
  • Enable retry mechanisms with dynamic time-based backoff based on the downstream subscriber's needs.
  • Process webhook events in parallel using worker pools.

2.4. Deliver Events and Subscriber Management

Business Impact:

  • Allows the distribution of events to power different workflows that unlock operational efficiency or workflows that power other business processes, e.g., billing and payment triggering.
  • Improves efficiency by filtering and transforming events before delivery.
  • Enhances developer experience by offering fine-grained subscription controls.

Technical Considerations:

  • Allow subscribers to define event filtering rules (e.g., by event type, source).
  • Support webhook replays for debugging and recovery.
  • Implement multiple delivery mechanisms (HTTP, WebSockets, Kafka, gRPC).

3. Handling Success and Failure Scenarios

3.1. Happy Path (Successful Execution)

  1. A webhook request is received and authenticated.
  2. The request is validated and transformed.
  3. The events are written to an event stream or queue for asynchronous processing.
  4. Background workers process the event and invoke necessary actions.
  5. The events are delivered to one or more subscribers.
  6. The system logs event metadata for traceability.

3.2. Sad Paths (Failures and Recovery Mechanisms)

Sad paths are where developers spend most of their time. Instead of listing the common sad paths here, I have created a separate post covering the sad paths and challenges of handling webhooks.

4. Performance Considerations Under Load

  • Make use of cloud-native capabilities for horizontal scaling to handle increased webhook traffic.
  • Use event streaming technologies over queue-based technologies to prevent event backlogs. You can still achieve dead lettering using change detection techniques.
  • Implement circuit breakers to avoid cascading failures.
  • Prioritise mission-critical events over lower-priority events.

5. Data Retention and Security

  • To avoid costly hot and cold storage, only store data that is needed. A data warehouse store should be opted in rather than defaulted.
  • Considering the payloads may contain PII or other sensitive data, the locality of data storage is a requirement and must be taken seriously to avoid compliance pains in the future.
  • Implement access-controlled event storage and audit logs for compliance.
  • Use fine-grained RBAC for managing webhook subscribers.
  • Automate secret rotation for webhook authentication.
  • Define data retention policies to minimise storage overhead.

Conclusion

A well-designed webhook handling system drives high business value and delivers developer efficiency, system reliability, and business scalability.

Implementing asynchronous processing, smart retries, security best practices, and observability tools can help organisations build a future-proof, high-performance webhook processing platform.

This document aims to provide a comprehensive, developer-focused technical foundation for designing a scalable, secure, and resilient webhook management solution that is optimised for real-world use cases.