Hello Javier,
Thank you for the notice.
We have struggled working with this feature in proiduction since it has been released. Disabling a webhook after only three failures can cause unintended outages—for example, a brief Kubernetes node restart would permanently cut off events. Worse, we need a human to restart the webhook every time. Could you consider a more tolerant retry policy (e.g. exponential backoff with a longer failure window) before disabling the webhook?
In the meantime, are there any recommended best practices or alternate approaches to ensure webhook reliability in a production environment?
Best regards,
Thomas
Hi @thomas_achache,
I have talked with the team and we acknowledge that we can evaluate:
- An exponential backoff with a longer failure window
- Notifications when your webhook is disabled so that you can immediately take action
- A means of automatically re-enabling your webhook
I suggest voting on this idea to help our Product team prioritize. Unfortunately, we don’t have alternate approaches at the moment beyond the monitoring you are doing on your end. If you suffer downtime or suspect your webhook might be invalidated, you can contact our Support team with the name of your company and app UID so they can check the status of your app’s webhooks for you. We will work to improve this in the future.
FYI, that idea is no longer active because it got merged. Can’t vote on it.
I can’t express how flabergasted I am at this policy. Not only is the limit ridiculously low, but the failure is completely silent. THEN, if you have a partnered/published app you can’t even solve this yourself. You have to message the support team and wait.
This is going to cause us and our customers (who are also Fronts customers) massive headaches. I’m trying to do the uptime math and I’m realizing that it’s basically impossible for most companies to NOT have an annual outage with webhooks.
If my math is right, a single user of the app doing just 3.5 actions per hour is enough to pretty much ensure we’ll have at least one annual outage with our webhooks.
- Four nines of uptime allows for 52.6 minutes of downtime annual
- If this downtime is continuous, the 3rd request will hit just before the system recovers (at 51.4 minutes)
Obviously, that example is a bit contrived, but I’m trying to show how little usage it takes to have major uptime concerns. At something like 1 request per minute, you need seven nines of uptime to avoid your webhook going down. A 1 request per second, seven nines of uptime (3 secs of downtime per year) barely keeps the webhook up.
------
I get there are dev constraints here, but can we get this bumped up at all? Even at 10 consecutive failures, the uptime situation changes pretty drastically.
- Four nines puts us at ~11 requests per hour
- Five nines gets us to ~2 request per minute
- Six nines gets us to ~20 request per minute
- Seven nines gets us to ~3 request per second
That’s not even considering the fact that a single successful request resets the window. It’s drastically easier to avoid 10 consecutive failures than just 3 failures.
Surely, if the goal is to kill off long running stale dev instances firing off bad requests, it’s okay to wait just a tiny bit longer to block them.
Hi @wesley_harding_conveyor,
Please vote on https://front.ideas.aha.io/ideas/PRD-I-7941, which is the idea that the other votes got merged into. The team is evaluating how this can be improved.
One point of clarification: webhooks that belong to published apps do not get disabled automatically, so if your app is intended for our App Store it won’t have this problem. We recognize that there are many webhook apps in private customer instances that still serve a production and not testing purpose, so the improvements should still be evaluated for those cases.
Thanks for the clarification!
FYI, I don’t have access to that link, either.
@wesley_harding_conveyor thanks for letting me know. Our Product team is investigating the configuration of the idea, but in the meantime I cast a proxy vote for you in the backend.
Hi @Javier,
We’ve experienced another production incident due to the webhook disabling policy. While adding notifications is a step in the right direction, it doesn’t fully address the issue.
The current failure window is too narrow and it will eventually have to be extended. For example, by disabling webhooks only after several hours without a 200 response, possibly using exponential backoff.
I understand that implementing this change will take time, but in the meantime, it’s difficult to justify repeated interruptions to production workloads.
I’d strongly recommend completely rolling back the current disabling mechanism until a more balanced policy is in place. Please let us know what can be done.
Thanks,
Thomas
Hi @thomas_achache
Thanks for the thoughtful feedback. I’m really sorry to hear this has caused another production incident. We're working on improvements that include extending the retry window before disabling webhooks and adding proactive notifications. While we can’t fully roll back the current policy, I think we can find a solution that will meet your needs.