No one enjoys being woken up in the middle of the night or having a weekend interrupted because of a major incident disrupting application reliability or performance. When an application is truly down and impacts business operations, few desire the pressure of the war room. Agile developers should focus on their sprint commitments and spend as little time as possible investigating the root causes of major incidents. Yet responding to major incidents, providing support to resolve issues, and participating in root-cause analysis is everyone’s responsibility.
In the best of circumstances, operations teams have monitoring systems that detect, alert, and resolve issues. The reality is that operating environments can have problems outside of everyone’s controls, such as security breaches, major cloud outages, third-party service trouble, or major infrastructure failures that disrupt operations. Even the most robust agile processes, software development lifecycles, or devops best practices can’t assure that applications are risk-free and 100 percent reliable.
Operations and site reliability engineers can often fix common issues without impacting the development team. Common problems can be cleared up with automation or by maintaining runbooks that prescribe how to address them. But developers are likely needed to help unravel more complex or less frequent mishaps, and there are many ways they can help prevent operational problems from occurring in the first place.
Incident management is a critical business process
Many organizations today develop software applications as part of customer-facing products, customer experiences to support business services, or workflows to enable employees to fulfill their jobs. When these applications fail or underperform, it can have significant business implications, such as revenue loss, unbudgeted costs, brand reputation impacts, project delays, and poor employee morale.
When applications experience frequent or lengthy outages, poor performance, or unexpected errors, it also reflects poorly on the agile software development teams. IT departments that survey employees and measure customer satisfaction are unlikely to receive high scores if unreliable applications impact people’s work. It’s also harder for IT management to get budget increases, training, added compensation, or other benefits if the organization feels that the software development teams can’t release new capabilities reliably.
Development teams must take proactive steps to prevent problems, provide support during incidents, participate in the analysis of root causes, and prioritize work to address critical defects.
Let’s look at these responsibilities in more detail.
Prioritize quality when developing and releasing applications
Agile development teams often focus their efforts on developing and releasing new features, enhancing user experiences, and addressing technical debt. Teams instituting devops practices such as CI/CD (continuous integration/continuous delivery) pipelines must also shift-left their testing practices and automate most testing to ensure that new code doesn’t break software builds and that automated tests all pass.
Developers and quality assurance testers should shift-left security and institute coding practices to ensure the reliability of applications. Development teams should also partner with operations teams on infrastructure configuration, automation, and monitoring. Best practices include:
- Standardize and centralize application logging and exception handling to ensure that application issues are traceable.
- Minimize applications and databases locking, which can create bottlenecks, especially under heavier loads.
- Configure applications, services, and databases for high reliability, and load-balance them across multiple cloud zones.
- Centralize monitoring and alerts and proactively look for longitudinal performance variances.
- Automate procedures that restart, scale up, and shut down services based on demand.
Lastly, it’s critically important to document the application’s architecture and code because it’s highly likely that people who weren’t involved in the application’s development will have responsibilities to support it. Even when code is modular or uses microservices, it’s vital to leave documentation for developers and site reliability engineers to resolve issues and improve applications.
Be prepared to support incident response teams
Before incidents happen, software development teams should establish protocols and processes to better support incident response teams and site reliability engineers:
- Ensure that software developers understand that providing off-hour support to incident response teams is part of their job. Develop policies with human resources, especially if there are regulations on working off-hours or if overtime is required.
- Publish on-call schedules and provide the proper tools and devices so that developers are reachable when needed.
- Identify and document who the subject matter experts are, by application, service, database, and other software components.
- Prescribe what developers should or should not do to resolve major incidents. For example, most organizations want developers to help diagnose, suggest workarounds, and resolve incidents, but fixing and deploying code is usually not recommended or allowed as part of incident response.
- Clarify and standardize what, where, and how developers must communicate during and after an incident.
Resolve incidents and participate in war rooms
During an incident, software developers should aid in fixing the issue and restoring service in minimal time. Once the developers are called in, the assumption must be that operational engineers have already reviewed and possibly ruled out infrastructure-related concerns, and that site reliability engineers have already explored a list of common problems with the application.
When there is a major incident, incident managers will often set up bridge calls, chat sessions, and physical war rooms to assemble a multidisciplinary team to work through the problem collaboratively. Developers who are called in should know and follow the incident response and communications protocols established for these war rooms.
In the war room, developers should be application experts. After reviewing monitors, log files, and other alerts, they should make recommendations on courses of action. It’s essential to use specific language and separate fact from speculation. Try to avoid the wrong turns and added delays that occur when response teams overly pursue symptoms that turn out to be dead ends.
Developers should participate in this collaboration until the incident manager closes the issue or rules out the need for their participation and excuses them.
Identify root causes and resolve application defects
Major incidents are closed once the application or service is back to normal operating conditions. At this point, in ITIL (Information Technology Infrastructure Library), they are assigned problems so that teams can identify root causes. The goal is to perform a full diagnostic over all the underlying issues and circumstances. What caused the incident? What factors defined the severity and magnitude of the business impact? What conditions, factoring in the duration and the expense, were required to resolve the issue?
Once the root cause is determined, agile development teams should assign one or more defects that either address root causes, lower risks, or lessen business impacts. Development teams may have different definitions and processes around defects in their agile process and software development lifecycle. What’s most critical is that when known issues repeatedly create problems or cause major business interruptions, that agile development teams and their product owners receive this feedback and prioritize making improvements.
After all, delivering new capabilities through software is only part of a developer’s responsibilities. Ensuring that applications are reliable, secure, perform well, and have positive user experiences is where teams truly deliver on business needs.
This story, “How agile teams can support incident management” was originally published by