At your workplace, you might have noticed that people have different perceptions of DevOps. Some say it’s a cultural shift, some believe it’s all about automation and tooling, and others think it simply brings Dev and Ops together, among other ideas. With so many different viewpoints of DevOps, a clear definition of DevOps success is going to vary. In this blog, I am going to focus on key DevOps metrics, providing the information necessary to meet business goals. Usually I prefer to categorize DevOps metrics into 4 groups: Release Confidence, Velocity, Quality and Stability. If you have implemented efficient DevOps tooling and best practices, then deriving these metrics will be very easy using Splunk, Athena, and BI reporting platforms.
Let’s start with first group, “Velocity,” which focuses on all key elements contributing to speed.
Build duration (hours): The amount of time it takes to build a package indicating relative levels of automated and manual processes involved and thus potential to improve in speed and reliability.
Test duration / QA cycle time (hours): The total time for a built package to go through the QA cycle including unit testing, functional testing, performance testing, and security testing. You can also categorize the cycle time as separate metrics per testing phase. This helps me understand how long it takes a build to go through an entire QA cycle & how to improve speed and shift testing towards the left in development cycle.
Deployment duration (hours): The amount of time it takes for a deployment. My team uses this to understand whether deployments are causing excessive delays, bulky run sheets, or manual processes involved. Speedy deployments help a lot in reducing total release cycle time and positively impact release frequency.
Deployment frequency: The number of deployments per application in specific period indicating how frequently code is getting deployed into SIT, Pre-prod, prod, etc.
Environment provisioning duration (hours): Time taken to provision a new copy of environment like Dev, integrated test setup, etc. This helps me understand the effort of creating environments, manual processes, the wait time to get a stable environment, and the environment availability impacting project work.
You can fetch the above duration data from Jenkins by running automated jobs, manual process time windows, or tools such as Maven, Docker, AWS, Chef, Puppet, Selenium, or JMeter and recording duration with custom code snippets.
Change volume: Volume of code changes focus on the new lines of code deployed per build. The key is to understand the risk of a given deployment with more user stories and more lines of code.
Commits per day/Committers per day: The number of check-ins (commits) per day or per build. This helps me understand how frequently developers are pushing their code to source base. And, I can derive the number of active developers and branches receiving check-ins.
Merge frequency: The number of merges recorded for a feature branch over a period, to ensure that developers are integrating their work regularly and development is against the latest code set. I also measure merge duration to identify efforts required to accept changes and identify scope of improvement. I highly recommend running a complete set of checks automatically on every merge request event.
Jenkins runs SCM checked out, SCM Audit, On-commit, and On-Merge pipelines/Version controlled tools like GIT, Mercurial, Subversion, bitbucket, RTC, and ClearCase, recording the above information with custom code snippets.
Just focusing on speed won’t help, though. You need to give equal attention to quality as well. If you can release changes biweekly but your package is full of defects and bad code quality, then customers are not going to be happy. Focus on shifting QA checks early into development and measure if your product is ticking all the boxes.
Code quality - Bugs, vulnerabilities, technical debt, duplications, pass rate (%): If you have implemented SonarQube and integrated it with your orchestration tool, Jenkins or developer’s IDE, then measuring these quality concerns and blocking bad code is straightforward. SonarQube can give you a report card with a line wise detailed analysis of issues, making a developer’s life much easier.
Test coverage (%): The amount of code associated with automated test scripts. You can further break it down to Unit Test Coverage, Automated Test Coverage (the amount of the test model covered with automated tests), and more. The higher the percentage, the lower the risk of performing refactoring exercises. You can use tools such Cobertura or coverage.py.
Test pass rate (%): The percentage of tests failed per build. Test pass rate will combine the percent success of the unit test, functional test, performance test, and security test. You can get this data from an orchestration tool such as Jenkins or your respective test tool such as Selenium, JMeter, JUnit etc.
Defect density (%)/Defects per story points delivered in release: Most of the time, Dev and Ops silos arise because it’s not possible to identify from where the problem is coming. You can relate development quality to escaped defects to production by counting the defects raised compared to the work delivered in terms of story points per release. If defect density is consistently bad, then focus on identifying the root cause such as whether test and prod environments having different configurations, or whether the only package being tested and approved is getting deployed.
Defect leakage/Defect reintroduction rate: Understand the effectiveness of local developer unit testing. It will be number of defects breaking other functionalities and causing other defects to be raised.
Defect ageing: The number of days a defect is opened. This helps me understand how quickly a fix is being made. You can fetch defect count and ageing data from tracking tools such as JIRA, ServiceNow, or Remedy. I recommend integrating your orchestration tool with JIRA so that even development defects are created automatically on test failure and will be tracked along with the feature story.
Now we have speed and quality introduced in development. But what about the situation on other side of fence? Yes, Ops. One of the objectives of DevOps is breaking the silos between Dev and Ops with shared responsibility, common objectives, use of the same tools and most importantly, collaboration. To achieve this, we need to measure if our processes and automation are reliable, repeatable and consistent with the below metrics.
Build pass rate (%): Percentage of successful builds per release.
Deployment pass rate (%): The percentage of successful deployments per release.
Env. provisioning pass rate (%): The percentage of successful environments created per release.
Incidents/Deployment: The number of incidents and defects reported for application releases over a period.
MTTR (Mean time to recover): The time it takes to develop a fix and roll it out to production when an incident is reported. Overall time to restore a service when the disruption is due to a defect as a result of deployment.
Change success rate (%): The number of changes that successfully went in production.
Hotfixes: The number of incidents for which emergency hot-fixes are performed. You can get the number of hotfixes and total number of incidents from change management tools such as ServiceNow, Remedy, and JIRA.
Deployment downtime (hours): The amount of time that the service is completely unavailable during a deployment helps us understand the cost of deployment over loss of service.
Uptime (availability) (%): The percentage of continuous availability of service over a period. You can get downtime and availability data from your data centers or cloud services such as AWS or Azure.
Once you have focused on the 3 main pillars of delivery, (velocity, quality and stability), then automatically you will gain confidence to push releases in production.
Release readiness metrics
Release confidence (%): During your PI planning and sprint planning, how many teams are confident that they will be able to deliver on time, high quality work without any defect, deployment failure or post prod incidents? Very few, because most of them are not measuring what went wrong last time.
Release confidence % = Average of last 3 releases (Test pass rate + Build pass rate + Deployment pass rate + Env provisioning pass rate + Code quality pass rate + Change success rate)
Release cycle time: Time from check-in to deployment, i.e. when new code starts development to when it successfully gets deployed into production. Cycle time is an important indicator of an efficient process. Lean tools such value stream mapping is applied to identify waste removal, automation needs in release cycle time.
Release frequency: How frequently you are releasing your changes in production.
Release days remaining: The number of days remaining for release.
Release (%) completed: The number of story points completed for release.
Release size (story points): The total number of story points per release.
End to end traceability (%): The percentage of requirements linked with tests. To achieve end to end traceability, requirements should be linked with test cases. I also recommend integrating requirements in JIRA with ongoing development by Jenkins, GIT, and SonarQube pushing updates to JIRA. You can easily get the above release related metrics from tracking tools such as JIRA.
If you want to drive your DevOps journey in the right direction, start focusing on the above precise set of DevOps metrics and measures. Collect them using the tools mentioned and push it to an analytics platform such as a Splunk, Athena, or BI reporting tool to generate useful dashboards. Once you start measuring these, you can immediately identify bottlenecks coming your way to match business pace. In turn, you will foster a culture of continuous improvement.