Handling Webhooks – The Sad Paths

Posted on the 5th of February, 2025

Karl the Koala being an architect.

Introduction


In a previous post, "Free Webhook Handling Design Spec – A Scalable, Secure, and Reliable Approach", I listed the considerations, including the business value and technical considerations. In it, I mentioned Sad paths - the parts that are most likely to happen that affect you operationally in cost, time and damaged reputation. This post looks to cover some of the key pain points and sad paths you have to acknowledge and cover if you were to do so if you were to build your own webhook handling service.

1. Payload Parsing Purgatory

Webhooks are technically an integration, which means there is a level of customisation in handling each one. Every provider knows what a webhook content payload should look like. Although JSON is standard, you'll also encounter older services, such as XML, form-encoded data, binary data, and sometimes, just plain custom weirdness. You'll spend weeks trying to write generic parsers, handling unique edge cases and optional values. What is worse, they'll break the moment a provider updates their API without testing backwards compatibility, which they will, and often without notice. Understanding the sad path in these scenarios requires constant maintenance from engineering teams and potential mistrust from brittle integrations. You and your team will forever chase and map payload changes, fix bugs, and try to keep your system from exploding.

2. Authentication Challenges

Each provider has its authentication scheme, even though they will use a standardised authentication method. Common ones are API keys, OAuth, signatures, and mutual TLS, but you will come across custom headers. Handling each type's security storage and requirements is a security and integration nightmare. You must implement and manage all these different flavours, and getting one wrong could lead to a security breach. The sad path is security vulnerabilities and the constant fear of missing something. You must ask if it is worth managing API keys and certificates with compliance requirements such as cert detection and access controls for dozens of integrations.

3. Data Storage Despair

Webhooks can come in fast, with potential substantial spikes in usage. Your initial database design, which looked elegant on paper, must perform under the load. Without proper content maintenance, archiving and purging, your queries will slow down, andwrites will time out. It will leave you forming fixed squads to swarm and scale your database while trying to avoid data loss. Generally, the outcome here is painful migrations, stringent data retention and merging patterns. The sad path is performance bottlenecks and the constant threat of database outages. And don't even get me started on data retention policies and compliance scope topics for things like GDPR, ISO 27001, SOC2 and PCI DSS. All these services were in scope, and each came with a dozen controls at audit time, taking weeks to prove compliance.

4. Destination Deterioration

Destination endpoints go down or struggle to keep up with the demand. It's physics. Your webhook service needs to handle this gracefully, but "gracefully" is complex and always an "it depends" answer. The idea here is to talk to your destinations and not shout. If they stop listening, you must take a bubble bath break (something my wife and I use when an argument gets too heated). You'll need to implement intelligent retry mechanisms with exponential backoff and lower rate limiting for lesser non-performant endpoints, but even those can fail. And what happens when the target endpoint returns a 500 error? Do you retry? Do you notify the user? Do you give up? The sad path is lost webhooks, eventual broken workflows, and angry clients or users who blame you for the destination endpoint's problems.

5. Transformation Trauma

Clients and users will inevitably want to transform webhook payloads before sending them to their endpoints; this makes sense as you cannot expect one global conforming united world. Clients and users have their requirements, and the workflows may require different data. It sounds simple enough, but it's a recipe for disaster, especially when detecting schema changes due to the transformation. It is also yet another failure point on the delivery journey. You'll write custom transformation logic for every integration or throw stateful payloads at the subscribers who manage the transformations themselves, adding to their integration costs and bad developer journey. It also allows you to be replaced in the future, as switching to a different provider is more effortless, considering they have the mapping in place. The sad path is unmaintainable code, integration failures, easier provider switching and the constant need to rewrite transformations as providers change their APIs.

6. Monitoring Blackbox

Without the right metrics and logs in place, you capture too much or too little, costing you an arm and a leg for hosting the logs and metrics (looking at your Datadog) or being unable to solve operational issues due to a lack of insights. You'll also get potential false alerts for every minor blip that could auto-recover, and you'll likely miss the slow degradation that eventually leads to a significant outage. You'll spend time chasing phantom issues and figuring out why things have broken without accurate data. The sad path here is delayed problem detection, prolonged downtime, and the frustration of not knowing what's happening - costing in developer and operational time instead of focusing on other critical requirements the company needs to deliver.

7. Security Scares

Security is critical, but it's also incredibly complex. You'll need to worry about authentication, authorisation, data encryption, origin IP filtering, payload scanning and validation, and a dozen other things. One missed requirement or hiccup can lead to a data violation or breach, a nightmare you don't want to experience. The sad path is data breaches, fines, reputational damage, and legal liabilities.

8. Testing Tribulations

Testing a webhook service is fine if you have simple payloads you control. Still, they are not when hundreds of permutations can occur. You need to simulate different providers, payload formats, and error scenarios. You'll spend a lot of time writing test cases. You'll still miss edge cases that will come back to you in support tickets and bugs on the backlog, requiring further engineering and maintenance time. The sad path is integration failures, unexpected behaviors, and the constant lack of confidence that your service can deliver on the promise you made.

Conclusion

These are just some of the high-level and common sad paths you'll have to handle if you decide to build your webhook handling service. It's a complex undertaking, far more than the original face value time estimate you would have thought of, and it's easy to underestimate the challenges. So, before you jump in, ask yourself: Is it worth it for my business? Is it where the company needs time and attention to unlock its financial or delivery goals? Or would it be better to let someone else handle the effort?