Sunday Morning - Monday Morning:
- A final lingering impact required manual restarts on some POS systems and card readers to pull the latest infrastructure changes. Some shops also needed to update to the latest V3 card reader. By end of day Monday, all systems were fully stable.
Root Cause Summary
The outage was caused by instability within our Redis cluster running on AWS, which handles real-time socket communication. When the physical hardware failed, it created a feedback loop: our systems attempted to sync at high frequencies under degraded conditions, overwhelming the servers and preventing the POS from maintaining a stable connection.
A socket is essentially an endpoint, think of it like a phone jack. When two programs want to communicate, each opens a socket, they connect, and data flows between them. One side typically listens and the other initiates.
What went wrong with our fallback: Our offline mode is designed to be a safety net. When working as expected, offline mode allows offline swipe, tap, and dip payments via our updated V3 card reader connection. However, the specific way our servers were struggling prevented many tablets from successfully transitioning into offline mode during the outage. We recognize that a fallback is only useful if it works when the main system is down.
What We're Doing Next
Beyond fixing the symptoms, we are also changing the architecture itself. Our immediate and long-term roadmap now centers around:
- Strengthened Offline Reliability: We are re-engineering portions of our offline infrastructure to ensure POS systems can reliably transition into Offline Mode.
- Monitoring & Communication: We are expanding our automated recovery systems and strengthening our internal incident response protocols to provide faster, more transparent communication during a crisis. We are introducing quarterly disaster recovery practice, including offline-mode-only operation tests and all hands on deck drills. Additionally, we recommend subscribing to status.dripos.com, if you are not already.
- Updated Support Platform: We are transitioning to Zendesk in the coming months to replace our current phone support platform, Quo, which was overwhelmed by Saturday’s volume preventing our team from answering many incoming calls. This upgrade will expand our capacity to handle high-traffic events and ensure our support lines remain reliable during critical situations.
- Regional Redundancy: Over the coming months, we are migrating critical real-time systems to a multi-region architecture so a single data center failure cannot take down the platform. For example, we will have fallback systems to switch our AWS instances from their US-East data centers to US-West in the case of an AWS outage.
- Reduced Socket Dependency: We are reducing our reliance on high-frequency socket operations for non-critical tasks to lower infrastructure strain.
Checklist to Ensure Stability in your Shop
To ensure your POS is running on our most stable architecture, please verify the following settings on your tablets:
1. Transition to V3 Reader (for iPad users only)
- Disable V1: Go to POS Settings → Advanced Settings → Toggle “Enable V1 Reader” OFF.
- Migrate: Go to Card Reader Settings on the POS → Click the pencil icon next to your reader → Select “Migrate to V3.”
- Restart: Restart both your POS app and card reader.
- Reconnect: Once restarted, select your host device on the reader to pair.
- V3 Reader allows for offline card present payments, has an improved UI, and processes transactions at faster speeds.
2. Reset Offline Status
- Go to POS Settings → Advanced Settings → Toggle Off “Enter Offline Mode.”
3. Standardize Fire Ticket Settings
- Go to POS Settings → Tickets & KDS → Fire Ticket Settings → Select “Checkout is Complete (Default).”
- Fire Ticket when Added to Cart will be fully restored this week.
Additional Note: Supply Chain & Inventory Update
- We have disabled real-time supply chain tracking during checkout to prioritize transaction stability and speed. This will remain in effect for the next week as we refactor the feature and prioritize stability work.
Closing Notes
We understand that an apology alone by no means makes up for the disruption this caused to your businesses, your staff, or your customers. Many of you were forced to troubleshoot operational issues in real time during Mother’s Day weekend, and we recognize the stress, loss, and frustration that created. It is our responsibility as a platform to own these mistakes.
Over the past 48 hours, our teams have worked around the clock not only to restore stability, but to identify the underlying failures that allowed this incident to escalate as it did. This incident is driving critical upgrades to our infrastructure, offline systems, and operational readiness going forward.
Our responsibility is not just to recover quickly when issues arise, but to build systems resilient enough for your shops to continue operating reliably even during broader infrastructure failures. This incident made it very clear that we still have work to do, and that work is already underway.
If you have any lingering issues or pending transactions that failed to process, please reach out to support@dripos.com. Our team is actively prioritizing all follow-up cases related to the incident and will work directly with you to resolve them as quickly as possible.
Thank you for your patience, your feedback, and for continuing to hold us to a higher standard.
Sincerely,
Jack Pawlik
CEO, Dripos