resources
|
July 1, 2025

Mastering Operational Excellence: Preventing and Managing Outages Effectively

Mastering Operational Excellence: Preventing and Managing Outages Effectively
Share article

Production outages in popular online game platforms like My11Circle and RummyCircle can create a huge dent in revenues and cause reputation losses, especially in peak season. We, at Games24x7, aim to ensure at least 99.98% availability of our critical systems. This means that we cannot have an accumulated downtime of more than 1 hour 45 minutes across all our products within the year. While this is achieved by building resilient, fault-tolerant, and highly available systems, at the same time there is a huge role in setting up robust processes that can help in the prevention of production issues, reduction of time to detect and fix issues, and making sure we learn from the past outages. This blog will try to give you insights into the many engineering and operational practices followed at Games24x7 to keep the outages at bay.

Prevention:

  1. Having the right attitude: Too much has been written online on best practices to prevent production issues, but mainly it is the mindset that has a big role to play here. At Games24x7, we try to inculcate a sense of ownership in the mind of every engineer. In different forums, leaders often talk about how exponentially we have grown over the years and are still growing. With such a large user base, even one minute of downtime can be extremely detrimental to the business.
  2. Effective engineering methods: Our engineers are trained to always optimize for scale and solve for resiliency in every piece of code they write or test. The developers make sure critical flows are not impacted by auxiliary functionalities by appropriate error handling or having separate thread pools. We also have infra-level segregation of different sets of components to keep things isolated and independent. We encourage having on-the-fly configurable switches like feature/functionality level toggles, bypassing slave DBs, etc. This coupled with some other good practices like stringent code reviews, test case walkthroughs, cross-team architecture review forums, minimum 80% unit test coverage, Functional Testing Automation, Performance Tests, User Acceptance Tests, Regression Tests, Canary Deployments, Deployment plan approvals, etc helps us avoid most of the functional and non-functional issues going to production.
  3. Planning for major events: To take it a step further in terms of better preparation before major events, we do dry runs on production. This blog by one of my colleagues can help you deep dive into this philosophy of dry runs. In our fantasy platform My11Circle where we see sudden traffic spikes, we don't rely on the autoscaling of ASGs or HPA alone. Instead, we have built our own Data Science powered scaling system, which can predict the traffic based on the type of the event and current liquidity and pre-scale our systems accordingly.

Monitoring and alerting:

Having an exhaustive monitoring and alerting system in place is extremely crucial. In Games24x7, we monitor different dimensions of our systems:

  1. Infrastructure metrics: This mainly covers the usage of resources like CPU, Memory, Disk space, process crashes, etc. Tools used: AWS Cloudwatch, Prometheus, Grafana.
  2. Application metrics: This covers the throughput and latencies of different APIs and event consumers of our microservices. We also monitor the different types of error codes emitted from our load balancers and target groups. Tools used: AWS Cloudwatch, Prometheus, StatsD, Graphite, Grafana.
  3. Exceptions/Errors: This covers the exceptions or errors occurring in your application code. Tools used: Overops, Harness, Glitchtip, Sentry.
  4. Critical Business Metrics: We also plot our critical business metrics like user logins, user registrations, game joins, etc from our microservices. We also detect abnormal trends in traffic patterns using tools like Anodot and our in-house Dynamic monitoring tool.
  5. Centralized logs dashboard: We use Last9 which ingests logs from different sources and enables our engineers to search for patterns in the logs for better and quicker debugging.

Most of the above metrics also have corresponding alerts in place and these alerts are integrated with Pagerduty for critical alerts, and with Slack/Email for warnings.

Nipping the issue in the bud:

It is extremely crucial to not overlook any symptoms and give every alert its due, so that a small issue doesn't snowball into an outage. We have classified the production issues into different categories and defined different Standard Operating Procedures for tackling each of them:

  1. Email/Slack alerts: These are the non-critical types of alerts that are not urgent but can be signs of an upcoming major issue or can hamper the experience for some users if not resolved in time. The on-call engineer is supposed to evaluate the alert, do the impact analysis, and create a Jira ticket with the right severity. The severity can help in prioritizing the ticket appropriately.
  2. PagerDuty Alerts: These are the alerts that mostly indicate something is wrong with production and needs to be addressed before it gets converted into an outage. These alerts again go to the On-call engineer who has to create a Jira ticket of a higher priority and start the investigation and mitigation. There are run-books in place for common known issues which have happened in the past and are handy to the engineer in our confluence workspace.
  3. Production Issues raised by users: These are the Jira tickets that are either raised by our customer support team on the complaints of our end-users or by our business teams. These issues indicate that a set of users are facing a bad experience on the app. The on-call engineer gathers various inputs from the customer support/business teams to gauge the blast radius of the issue, and the functionality that is impacted and accordingly assigns a severity to the Jira ticket. There is an SLA defined for every severity for mitigation and permanent fix.
  4. Outages: These are the issues that impact a large number of end users and/or impact the revenue numbers adversely. A Jira ticket is raised on every outage, and with the help of Jira integration with PagerDuty, an alert is generated for all the tech leaders. A Slack channel with a Zoom link and the Jira ticket details is automatically created. The entire team jumps onto the Zoom call to address the outage without any delay. The focus is more on mitigation rather than on permanent fixes in such scenarios. Once the issue is mitigated, then the next natural step is a detailed RCA and a plan for a permanent fix.

Alert Dashboards and Reviews:

As you might have observed in the previous section, every production issue needs to have a corresponding Jira ticket for tracking. Each Jira ticket has an owner and an ETA as per the severity. These Jira tickets are grouped into different Jira Projects and dashboards are created out of those projects. The tech leadership does a weekly review of all these dashboards to make sure that things are moving as per the SLAs and that any delays/blockers are called out. These reviews help us to keep ourselves accountable. The review meeting is also a forum to discuss the Correction of Errors (COE) which I will cover in the next section.

Learning from the mistakes:

Mistakes are part and parcel of our lives, but learning from them and not repeating them is what makes us better from our past selves. Every outage or rollback that happens on production needs to have a COE (Correction of Errors) document. This document has all the details of the outage, the Root Cause Analysis, the learnings and the action items (again tracked as Jira tickets) so that we improve our systems and don't repeat the same mistakes. The COE document is supposed to be created as soon as the outage is mitigated. The COE document is reviewed with the tech leaders and architects to ensure it is comprehensive. We follow a standard template for the COE doc so that nothing is left to be answered. The template consists of the following sections:

  1. Summary
  2. Customer/Revenue Impact
  3. Detailed Timeline of the issue and the mitigation
  4. Questions around the issue discovery:
    1. How was the issue discovered?
    2. How long did it take to know about the issue from the first impact?
    3. Was there a known bug that caused the issue?
    4. Were there existing backlog items for this issue?
    5. Was this a known failure mode?
    6. Was the issue caused by deployment or build release?
    7. Was it caused by manual deployment or a configuration change?
    8. As a thought experiment what could be done to reduce the time to detection by half?
  5. Learnings
  6. Corrective actions

Closing Note:

In conclusion, building a resilient team capable of preventing and efficiently responding to outages is not just a technical necessity; it's also a cultural commitment. At Games24x7, we recognize that our success hinges on our ability to maintain an exceptional user experience, especially during critical times. By fostering a mindset of ownership and accountability among our engineers, implementing robust monitoring systems, and learning from every incident, we continuously improve our processes. With a focus on prevention and proactive engagement, we aim to uphold our promise of 99.98% system availability, ensuring that our platforms remain reliable and enjoyable for all users. Together, we can turn challenges into opportunities for growth, solidifying our reputation as leaders in the online gaming industry.