Refactor Webhook System for Improved Reliability and Scalability
Timeline
Time | Description |
---|---|
2024-10-31 12:00:00 Europe/Lisbon | Creation of the document |
Status: Proposed
Context
The current webhook system faces several issues that are impacting reliability and maintainability. These problems include:
- Error Handling and Retries: The current design processes events in the same receive handler, which makes managing errors and retries challenging. This leads to poor error recovery and unreliable processing of events.
- Infrastructure Complexity: Each event type is published to its own Pub/Sub topic, leading to unnecessary complexity and additional infrastructure overhead.
- Event Storage Issue: Events are only stored after being successfully processed through Pub/Sub, which introduces the risk of losing events if they fail during processing.
- Dependencies on GCP services: The system is implemented using GCP specific services which prevent from migrating the system to another cloud provider without changing the code.
The new architecture aims to address these shortcomings by decoupling event storage and processing, thereby improving scalability, reliability, and maintainability.
Decision
The decision is to refactor the current webhook system to use a new architecture that includes the following key changes:
- Remove Pub/Sub: The new design eliminates the use of Pub/Sub for publishing and subscribing to events, simplifying the infrastructure.
- Immediate Event Storage: Store raw events immediately upon receipt in a Cloud Storage archive, ensuring that events are recorded before processing. This prevents event loss during processing.
- Decoupled Event Processing: Event processing is separated from the receiving step, enabling asynchronous processing. Multiple event handlers can process events independently, enhancing flexibility.
- Retry Mechanism: Introduce retry capabilities for event processing by using a database-driven event handler mechanism. This will allow better error handling and reprocessing of failed events.
- Scalability Improvements: The new design provides better scalability properties, allowing any number of handlers to process events concurrently and making it easier to manage high traffic.
- Replay Support: Events can be replayed from the database if needed, providing more control over event reprocessing and auditing.
Consequences
- Improved Reliability: Events are stored immediately, reducing the risk of losing them. Errors in processing do not result in event loss, as retries are possible.
- Simplified Infrastructure: Removing Pub/Sub reduces the complexity of the infrastructure and the maintenance overhead, leading to a cleaner and more manageable system.
- Asynchronous Processing: Decoupling event storage from processing allows event handlers to operate independently, improving reliability and making the system more resilient to failures.
- Scalability: The system can scale more effectively as multiple instances of the event handler can process events concurrently.
- Replay Support: The ability to replay events from the database adds robustness, allowing failed events to be retried or historical events to be reprocessed for auditing purposes.
- Implementation Effort: The refactoring requires implementation changes to decouple the event receiving, storage, and processing steps, which may initially require more development time and testing.
- Only dependencies is postgres: The architecture does not depend on any vendor specific services. Instead it only uses a postgres database.
Architecture diagrams
Current architecture:
Proposed architecture