Uncategorized Dynatrace : Service Level Objectives (SLOs) at Scale (Tips and Tricks) –

As our customers adopt agile software development and continuous delivery to drive value faster, they face new risks that could impact availability, performance, and business KPIs. These new risks have driven our customers to adopt Site Reliability Engineering (SRE) teams to help create reliable and scalable software systems without slowing down. However, adding more stakeholders can also run the risk of silos developing between internal organizations.
At Dynatrace, we recognized this increased need for SRE teams and the need to break down silos. The solution is a cohesive platform that enables SRE teams, application developers, support teams, and other stakeholders to work together from the same intelligence.
To meet this need for cross-team collaboration, the Dynatrace Software Intelligence Platform provides a place for SREs to define Service-Level Objectives (SLOs) and Service-Level Indicators (SLIs). By defining a common set of SLOs in a unified observability platform, stakeholders can work together to build reliable and scalable software systems that automatically meet agreed-upon service levels without slowing down.
Because of the ever-evolving nature of continuous delivery, it’s not enough to just create a single SLO. Teams need to continuously define SLOs. To define SLOs at scale, Dynatrace provides an all-in-one SLO API. We’ll explore the process of defining SLOs and using the API to scale-out SLOs.
An SLI is a measurement of the performance or availability of an entity. The measurement equates to a metric that captures expected results. The first step to defining an SLO is to identify the success metric. Dynatrace provides many Built-in metrics you can use, or you can create your own calculated metrics for any of the following entities:
Here is an example of a calculated metric for a service. This calculated metric is a Request Count of all requests for a service with a response time of less than 4 seconds (see red boxes). Later, we’ll explore creating an SLO example for this calculated metric. (To see the example, skip to the measure section in the Creating SLOs):
Tip: The calculated metric should be a count/value metric, which counts all the instances that meet the filter criteria, in this case, Response time ≤ 4s. This is a good practice because the objective of SLIs is to measure the expected results. In this scenario, the expectation is that requests will process under or equal to 4 seconds.
Once you identify the success metric, you can create the SLO. In this example, the success metric is a performance measure of request-response times of less than 4 seconds, which can result in the following SLO “95% of service requests will respond in less than 4 seconds.” To further automate this process; You can use the calculated metric APIs to create a template of a calculated metric.
Tip: When generating calculated metrics via API, use a naming convention. This makes it easier to keep track of your calculated metrics. One example of a naming convention can be {Environment}_{App Name}_{Component}_{Measure type}, where:
Below is a list of possible SLI measurements possibilities sorted by entity and measure. Use this list to copy and create your SLOs. This list contains metrics used in the numerator of the SLO.
Calculated Service Metric
Now we’ve identified the SLI, we can create the SLO. There are four components of an SLO:
You can define the measure as either a rate or metric expression.
A rate measure simply consists of a single rate metric, for example, the availability rate for HTTP monitors. The metric expression consists of a numerator, the SLI we identified above, and the denominator (the total count). To learn more about how to use metric expressions in SLOs, see Level up your SLOs by adding math to the equation. Once you’ve identified the numerator and respective denominator, the metric expression is complete. In our example, we’ll use the following metric expression for all SLOs – (100)*((numerator)/(denominator)).
Below is a list of the Overall Count metric sorted by entity (denominator) and the metric used for each:
Tip: SLO w/Metric Expression Example (Service Performance):
The entity selector identifies the entities (applications, services, user actions, and so on) the SLO should apply to. SLOs have problem indicators to let us know there’s an active problem with the identified entities. Dynatrace uses the entity selector during problem analysis. As such, all SLOs should have an entity selector to allow Dynatrace to apply problem analysis. We’ve identified the typical entity selectors, but to study more combinations, see Environment API v2 – Entity Selector.
The SLO target can be considered as the SLO requirement. An SLO requirement is the agreed-upon threshold for the metric and equates to some amount of acceptable downtime. The SLO target is made up of a target and warning. An example of an SLO requirement can be “95% of service requests will respond in less than 4 seconds”. The target will be 95%. The warning is a way to be aware of when the measurement is still in an acceptable range but is approaching the target (for example, 96% would be displayed in yellow text). Any percentage above the warning threshold will be shown in green text (for example, 97% or greater). The chart below breaks down possible SLO targets and their downtime.
The evaluation timeframe is the agreed-upon period you are measuring to evaluate the SLO. You can define the evaluation timeframe using Dynatrace TimeFrame Selector Expressions. A typical evaluation timeframe can be a full month, which you can express with the timeframe selector ‘-1M/M to now/M’. This selector reads this way: Start 1 month ago and round the month (to always get the start of the previous month) to the end of the current month and round up the month (to always get the start of the current month).
Once you generate the SLO, you can use the SLO API to get the JSON template of the SLO to automate SLOs for similar measures.
Tip: When generating SLOs using the SLO API, apply a naming convention. For example, you can use the naming convention we expressed earlier: {Environment}_{App Name}_{Component}_{Measure Type}.
Start utilizing SLOs TODAY to take control of your agile software development and continuous delivery practices. First, start small and end big: start with a single application and identify all the key user actions, key service requests, and so on, and define SLOs for each. Eventually, set a requirement for each application to have a preset of SLOs defined. You can then work your way to full-on SLO automation.
Tip: This community tool can help automate SLOs – MONACO. MONACO is a CLI tool to automate the deployment of Dynatrace monitoring configuration.
As you build out your SLO implementation utilize this Dynatrace Community SLO Forum, to read more about upcoming SLO features.
Happy tracking SLOs!
To learn more about how Dynatrace does SLOs, check out the on-demand performance clinic, Getting started with SLOs in Dynatrace.

Dynatrace Inc. published this content on 18 January 2022 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 18 January 2022 14:39:01 UTC.


Author Details

Sign up for our newsletter to stay up to
date with tech news!