best-practice – Service management for applications and cloud services

Posted 6 September 2018

In Analyzing TCP sessions, best practices, Visibility-as-a-Service

Typically, SLA’s (Service Level Agreements) of applications, cloud services and networks are based on availability and limited to the boundaries that are in direct control of a specific provider. The accompanied, monthly service level reports show that most of the times things are in the green.

But what about the real user experience? What if more often than not, the status bar of your browser is showing “Waiting for <service>“ for 10 seconds (or longer)?

In this blog you will learn how service management benefits from TCP session based SLO’s (Service Level Objectives) for applications and cloud services.

This is completed with easy to understand analysis if an SLO is not met; including the business impact in terms of affected clients and where they are coming from.

While perhaps somewhat outside the box, ITOM teams who have these kind of SLO’s experience this as a key element for a constructive monthly service management review. This is because any conversation around user complaints and services being slow is now based on real user data and facts.

Context and definitions

Regardless how your SLA’s look like, users are evaluation the performance on 2 aspects:

(1) – the things that are happening on the screen and

(2) – if not what expected: the time it takes going back to “normal”.

Meaning the single most important SLO is the End User Response Times (from here on EURT).

Most apps and cloud services are made available through secure, encrypted connections. Because of this there is no understanding of the actual application transactions users are starting. Therefore, we define EURT as the aggregated result of the Server Response Times (from here on SRT), the Data Transfer Time (from here on DTT) and the network RTT (i.e. Round Trip Time).

The SRT is the time between the first packet because of a user click and the first packet of the response coming from the server/service; sometimes referred to as first-byte-response.

The DTT is the time it takes to transfer the data of a transaction. Beside the well-known TCP events RST and FIN | FIN-ACK ending a transaction, it is assumed that the pause between 2 consecutive packets belonging to a certain transaction will never exceed 1 second. If it does, it is considered a data transfer belonging to a new transaction within the existing connection.

The network RTT is defined as the time between any packet and its corresponding TCP acknowledgment; packets with RST or FIN flags are excluded.

Defining thresholds

The first step is defining a threshold by measuring the EURT for at least 5 consecutive working days; ideally including a breakdown by client and server/cloud zone (figure 1).

In this example we are defining the thresholds for the real users (i.e. the “Office”-zone) and the business-critical apps labelled as “G-suite”, “IaaS” and “mail-Online”. To be specific: the thresholds are set at 1.2, 2.1 and 1.5 seconds for G-suite, IaaS and mail-Online respectively.

figure 1 – The EURT (End User Response Time) for different client- and cloud zones

Once these thresholds are set, your dashboard with business-critical apps should look like the one in figure 2.

As you have probably guessed: the colored areas are the results during day times. While the gray areas are the results during night times when there are limited users (if any at all).

With such a dashboard, it is easy to see what apps and cloud services require your attention by doing a quick problem analysis on their condition; ideally including the business impact.

figure 2 – Once the thresholds are set…

Problem analysis and business impact

Analyzing their condition starts with selecting the troublesome period based on the user complaints. In our example we picked 2018-08-31 from 8 AM to 6 PM (i.e. August 31^st– figure 3). Here you see that users from “mail-Online” are suffering throughout the day.

figure 3 – Start with problem analysis for mail-Online

With a drill-down to the app. dashboard you see the ratios of the 3 contributing EURT metrics. You also see what hosts are involved in executing the user requests and where these users are coming from. Both indicate that the DTT is the biggest contributor to the EURT.

figure 4 – Breakdown by EURT metrics, hosts and client zones

To have an understanding of the business impact, click on the DTT area of the client zone for a list of affected clients, their IP address and contextual aspects of their experience for the specified reporting period (figure 5).

figure 5 – Business impact based on affected clients

Since the ratios between DTT and SRT/RTT appear to be in the hundreds-plus range, the prime suspect for the root-cause is either the client, the throughput of the network path or a combination of these.

If the ratios between DTT and SRT/RTT were well below the hundreds-plus range, then most likely, the root-cause would be on the server site.

Similar for SRT and RTT: if both are in the same (milli)seconds range, then most likely, the root-cause would be related to the network.

A deep dive into the details of the TCP application flows is recommended for having either of these hypotheses substantiated. Use our best practice for a detailed, 5 step approach identifying the healthy and not-so-healthy TCP connections.

Conclusion

In today’s hybrid application and infrastructure landscape it can become a real challenge to define and monitor performance related SLO’s (Service Level Objectives).

A network-oriented approach gives you a shortcut as it includes the relevant KPI’s, defining thresholds for these SLO’s, problem analysis (including business impact) and TCP-connection deep dives in all the flows.

The approach works for all your business-critical cloud services, applications and hybrid infrastructures; regardless if sessions are encrypted (or not).

How about giving it a try? It is free, fun and above all, educational! Click here to get started.