How agile teams can support incident management

No one enjoys being woken up in the middle of the night or having a weekend interrupted because of a major incident disrupting application reliability or performance. When an application is truly down and impacts business operations, few desire the pressure of the war room. Agile developers should focus on their sprint commitments and spend as little time as possible investigating the root causes of major incidents. Yet responding to major incidents, providing support to resolve issues, and participating in root-cause analysis is everyone’s responsibility.

In the best of circumstances, operations teams have monitoring systems that detect, alert, and resolve issues. The reality is that operating environments can have problems outside of everyone’s controls, such as security breaches, major cloud outages, third-party service trouble, or major infrastructure failures that disrupt operations. Even the most robust agile processes, software development lifecycles, or devops best practices can’t assure that applications are risk-free and 100 percent reliable.  

Operations and site reliability engineers can often fix common issues without impacting the development team. Common problems can be cleared up with automation or by maintaining runbooks that prescribe how to address them. But developers are likely needed to help unravel more complex or less frequent mishaps, and there are many ways they can help prevent operational problems from occurring in the first place.

Incident management is a critical business process

Many organizations today develop software applications as part of customer-facing products, customer experiences to support business services, or workflows to enable employees to fulfill their jobs. When these applications fail or underperform, it can have significant business implications, such as revenue loss, unbudgeted costs, brand reputation impacts, project delays, and poor employee morale.

When applications experience frequent or lengthy outages, poor performance, or unexpected errors, it also reflects poorly on the agile software development teams. IT departments that survey employees and measure customer satisfaction are unlikely to receive high scores if unreliable applications impact people’s work. It’s also harder for IT management to get budget increases, training, added compensation, or other benefits if the organization feels that the software development teams can’t release new capabilities reliably.

Development teams must take proactive steps to prevent problems, provide support during incidents, participate in the analysis of root causes, and prioritize work to address critical defects.

Source link