How to identify the golden metrics for SRE

29 Apr 2023

News, Observability

How to identify the golden metrics for SRE

This is part 1 of the 3 part series “The path to your first SLO”.

When talking about building an observability practice, many customers we talked to struggled on what to observe and usually frustrated with the alarm storms or false alarms. ITOps are concerned about centralized monitoring and gather metrics from different systems for proactive monitoring. App Owners are interested in the ability for fast root cause analysis and end-to-end tracing capabilities. Usually the ITOps take the role of first tier monitoring on the vital health signals of different systems and alert the right app teams for in-depth diagnostics.

The requirements are clear. Applications need to supply the right metrics to ITOps. That may range from simple system up/down availability metrics, K8s metrics, CPU/Memory utilization to disk consumption information. The challenge usually comes from gathering application metrics from the app layer. No matter what monitoring tools you use the following golden signals for observability are usually what you need.

Latency – the request response time

Traffic – the number of requests per second

Errors – the number of errors or error rate

Saturation – the resource constraints on higher loadings

There are in-depth explanations of each of these all over the web so we do not repeat the details here. The important thing is to observe all 4 signals for each “user journey” or “service endpoint”. As an example, for an ecommerce application, that will be the user journey of “Login”, “Browse Catalog”, “Add to Cart” and “Checkout”.

These high level metrics are what we called “Work metrics”. Combined together with the lower level system metrics – “Resource metrics”. From here, organizations can define the important SLOs (Service Level Objectives) and how to monitor and meet those SLOs with the selected SLIs (Service Level Indicators). These SLIs are the metrics of what to set alerts on – to observe what matters most to your organization. In the next article we will talk about common practice to gather these metrics from leading monitoring tools.

New to SLO?
#SLOconf is a free, virtual event focused on #SLOs! 🔥
Whether you are doing SRE, SLO, or DevOps, or Ops, or a Dev – SLOconf is the perfect platform to share insights and ideas on the latest trends and developments in SRE/SLO.
Vsceptre is a sponsor at SLOconf 2023, hosted by Nobl9! 📢
For more details & speaker lineup, register here: 👇
www.sloconf.com

Related Articles

The Disruptive Effects of Mobile Application Outages on Large Enterprises in Hong Kong

The Disruptive Effects of Mobile Application Outages on Large Enterprises in Hong Kong

In today’s digital age, mobile applications are essential for large enterprises to connect with customers and drive growth. However, even the most meticulously tested apps can experience outages, leading to significant consequences for both users and the organizations behind them. This article explores the impact of unforeseen downtime, the repercussions on end users and company reputation, and how tools like LaunchDarkly can help alleviate these challenges. Learn how enterprises can uphold application reliability and ensure customer satisfaction amidst unexpected disruptions, leveraging Observability tools with the help of Vscetpre and LaunchDarkly.

Implementing a production ready chatbot solution with governance and monitoring

Implementing a production ready chatbot solution with governance and monitoring

As a company focused on IT consultancy and system integration, we have accumulated a large number of sales and solution briefs for various products over the past few years. We decided to implement an internal chatbot solution to better support sales activities. To minimize the investment required, we opted for a RAG approach instead of fine-tuning, building a chatbot solution based on a few products we are familiar with. Below is a high-level overview of how everything connects.

Uncovering Suspicious Domain Access in a company Network with Threatbook’s OneDNS and Splunk Stream

Uncovering Suspicious Domain Access in a company Network with Threatbook’s OneDNS and Splunk Stream

As your trusted ally in fortifying digital defenses, we understand that it can be difficult to pinpoint the users who have accessed dubious domains within your network. This task can be even more daunting in a larger-scale environment where the underlying on-prem infrastructure is subject to strict limitations on modifications. Furthermore, you may also ask the questions, how do we classify a domain as a threat, how can we obtain a list of domains that are deemed as malicious and how can we utilise this domain list to correlate the users in your network who have accessed them?