Reduce Azure Log Analytics Cost by Tweaking Health Checks Logging

Aug 31, 2023ยท

8 min read

Open any book on building software & you will eventually spot a piece of wisdom resembling the following:

Observability is important

I agree with that statement. It's crucial to be able to ascertain that a system is functioning correctly, and when implemented well, it eases the troubleshooting process for when something inevitably goes awry.

Unfortunately, oftentimes the above statement gets translated into the following:

Log everything

In a world with free infinite storage, I say go ahead. But that world & our world are not the same.

Cloud providers are acutely aware that observability is important & will reinforce that fact throughout their documentation as a best practice. They will also charge for that observability accordingly. Couple that with the log-everything mantra & that is a recipe for a woefully melancholic invoice at the end of the month.

We experienced this melancholy recently with a simple ASP.NET Core Web API deployed to Azure Container Apps. The application had some supporting resources such as Azure SQL Database, Key Vault, App Configuration, Front Door, and because we strive to be good observability citizens, Application Insights & Log Analytics.

We deployed the API into a development environment, left it running for a couple of days & upon analysing our costs noticed the following:

Upon this realisation, the question that came to mind was:

"Why the **** is Log Analytics' cost so high?"

Keep in mind that this was in a development environment with no real user traffic apart from us doing some initial tests to verify that the app was working. This was while the application was just sitting there. So, we started digging.

We checked the log categories that we were sending to Log Analytics to see if we could spot some offenders:

CategoryUsage
AppTraces5,72 GB
AppRequests690 MB
AppDependencies460 MB
AzureDiagnostics330 MB
AppPerformanceCounters60 MB
AACHttpRequest40 MB
AzureMetrics20 MB

That's a lot of AppTraces. Digging into what these AppTraces were, we determined that these were generated by another thing that is supposed to be good practice - health checks. A lot of health checks.

The API we deployed included two kinds of health checks: liveness and readiness. These terms are widely used in the Kubernetes world to determine whether an application is running (live) & ready to receive requests (ready) & the Kubernetes API uses these checks to determine if pods should be restarted and if traffic should be sent to them. In our API's case, the liveness check was a /health/live endpoint which returned a simple 200 OK response to indicate that the API was running, whereas the readiness check was a /health/ready endpoint which queried critical dependencies to check that all of them were happy & healthy as well.

Digging into the logging for our API, a call to the /health/live endpoint generated the following logs:

[09:23:32 INF] Request starting HTTP/2 GET https://localhost:5001/health/live - -
[09:23:32 INF] Executing endpoint 'Health checks'
[09:23:32 INF] Executed endpoint 'Health checks'
[09:23:32 INF] HTTP GET /health/live responded 200 in 355.8161 ms
[09:23:32 INF] Request finished HTTP/2 GET https://localhost:5001/health/live - - - 200 - text/plain 417.8957ms

Calling the /health/ready endpoint (which is a more involved check) generated the following logs:

[09:23:45 INF] Request starting HTTP/2 GET https://localhost:5001/health/ready - -
[09:23:45 INF] Executing endpoint 'Health checks'
[09:23:45 INF] Start processing HTTP request GET https://login.microsoftonline.com/<tenantId>/v2.0/.well-known/openid-configuration
[09:23:45 INF] Sending HTTP request GET https://login.microsoftonline.com/<tenantId>/v2.0/.well-known/openid-configuration
[09:23:46 INF] Received HTTP response headers after 1102.9673ms - 200
[09:23:46 INF] End processing HTTP request after 1131.027ms - 200
[09:23:47 INF] Executed endpoint 'Health checks'
[09:23:47 INF] HTTP GET /health/ready responded 200 in 1286.3275 ms
[09:23:47 INF] Request finished HTTP/2 GET https://localhost:5001/health/ready - - - 200 - application/json 1297.1640ms

Keen eyes may notice that we're using Serilog which has been configured to use the Serilog.Sinks.ApplicationInsights package to send logs to Application Insights as TraceTelemetry which explains where all the AppTraces in Log Analytics came from.

That is a non-trivial amount of logging for something that is purely infrastructural. One could argue that generating all these logs for successful calls to health check endpoints is unnecessarily verbose. Of course, I don't expect everyone to share the same opinion. If these logs were collected & analysed for response time trends or something similar, they would be necessary. I only really care when health checks fail and when they do, I would like to be notified that something blew up. It's not like I'm going to send myself PagerDuty alerts every five minutes stating that nothing has blown up yet. So, we came to a collective agreement that we didn't want to store these successful health check logs until we needed them.

Another question we asked was:

"Why is the sheer volume of logs so high?"

The answer to that question has some nuance to it.

Azure Container Apps has an awesome feature called Health Probes. These probes are configured on a container level. Container Apps can also scale out your app to multiple containers based on a Scale Rule.

See where I'm going with this?

Our API was scaled to three containers, each with its own liveness and readiness probes, configured to call their respective endpoints every 10 seconds. That's 14 log entries x 3 containers x 6 = 252 log entries per minute being sent to Log Analytics - just for health checks.

That was not the only contributing factor though. The API is protected by an Azure Front Door resource which has its own Health Probes feature to check if the origin is healthy. In our case, this was configured to probe the backend API's /health/live endpoint every 30 seconds on top of the built-in probes in Container Apps.

And that is why the volume of logs was so high.

With more wisdom gained, we set out on a path of optimisation. The first port of call was to implement the guidance from Andrew Lock's excellent blog post on Excluding health check endpoints from Serilog request logging. This involved setting the log level for log entries generated by the health check endpoints to Verbose by using a custom GetLevel function for the Serilog UseSerilogRequestLogging middleware as shown in Andrew's example repo.

The pertinent code is shown here for reference:

public static Func<HttpContext, double, Exception, LogEventLevel> GetLevel(LogEventLevel traceLevel, params string[] traceEndpointNames)
{
    if (traceEndpointNames is null || traceEndpointNames.Length == 0)
    {
        throw new ArgumentNullException(nameof(traceEndpointNames));
    }

    return (ctx, _, ex) => 
        IsError(ctx, ex) 
        ? LogEventLevel.Error
        : IsTraceEndpoint(ctx, traceEndpointNames)
            ? traceLevel
            : LogEventLevel.Information;
}

static bool IsError(HttpContext ctx, Exception ex) 
    => ex != null || ctx.Response.StatusCode > 499;

static bool IsTraceEndpoint(HttpContext ctx, string[] traceEndpoints)
{
    var endpoint = ctx.GetEndpoint();
    if (endpoint is object)
    {
        for (var i = 0; i < traceEndpoints.Length; i++)
        {
            if (string.Equals(traceEndpoints[i], endpoint.DisplayName, StringComparison.OrdinalIgnoreCase))
            {
                return true;
            }
        }
    }
    return false;
}

This GetLevel method is wired up in Startup.cs/Program.cs as follows:

app.UseSerilogRequestLogging(opts => {
    opts.EnrichDiagnosticContext = LogHelper.EnrichFromRequest;
    opts.GetLevel = LogHelper.GetLevel(LogEventLevel.Verbose, "Health checks");
});

Implementing the above resulted in the logs generated from health check endpoints to be reduced to:

That's right. Zero. Zip. Nada.

It's worthwhile to note that these logs are only filtered out when calls to the health check endpoints are successful. If an error happens to occur, the log level will be set to Error and those logs will still be emitted.

But we can do more!

The amount of AppTraces logs decreased, though there were some more low-hanging fruits in the form of AppRequests and AppDependencies. These log categories are sent by the Application Insights SDK for ASP.NET Core whenever requests reached our API & when a dependency was called, respectively. Therefore, we were generating AppRequests telemetry for each call to the /health/live and /health/ready endpoints, as well as AppDependencies telemetry when the /health/ready health check pinged critical dependencies such as SQL Server, Microsoft Entra ID, Key Vault, etc.

These sets of telemetry also constituted a fair amount of data in Log Analytics:

For these, we could not reach for Serilog as this was outside of its realm. Fortunately, the Application Insights SDK we used in the API project allows for filtering and preprocessing telemetry using ITelemetryProcessor.

With this revelation in mind, we settled on the following:

public class HealthCheckRequestFilter : ITelemetryProcessor
{
    private readonly ITelemetryProcessor _next;
    private readonly IHttpContextAccessor _httpContextAccessor;
    public HealthCheckRequestFilter(ITelemetryProcessor next, IHttpContextAccessor httpContextAccessor)
    {
        _next = next ?? throw new ArgumentNullException(nameof(next));
        _httpContextAccessor = httpContextAccessor ?? throw new ArgumentNullException(nameof(httpContextAccessor));
    }
    public void Process(ITelemetry item)
    {
        var excludedHealthCheckPaths = new[]
        {
            "/health/live",
            "/health/ready"
        };
        if (excludedHealthCheckPaths.Any(path =>
                _httpContextAccessor.HttpContext?.Request.Path.Value?.Equals(path,
                    StringComparison.OrdinalIgnoreCase) ?? false)) return;
        _next. Process(item);
    }
}

The HealthCheckRequestFilter uses IHttpContextAccessor to determine if the incoming request path matches one of our predefined health check endpoints, and if it does, we tell the Application Insights SDK to skip processing the telemetry.

Then the filter is wired up in Startup.cs/Program.cs:

services.AddApplicationInsightsTelemetryProcessor<HealthCheckRequestFilter>();

After deploying this change, graphs trended downwards:

We managed to reduce our daily log ingestion from ยฑ1.1GB to under 50MB. This resulted in cost savings by cutting our Log Analytics cost from a touch over $3/day to 25c/day.

Seinfeld gif. Jerry sits with his feet on a table and a cigar in his mouth, smiling and nodding as Julia Louis-Dreyfus counts out cash in front of him, looking annoyed.

Moral of the story: checking if your app is healthy can come at a cost. Especially if those checks are performed often. But it doesn't have to be - some considered optimisation can bring down your cloud bill without sacrificing best practices.

Through this effort, we gained a better understanding of exactly how our application generates logs, the triggers for these logs, and the various mechanisms to fine-tune exactly which logs get stored.

I hope this post leads to some cash savings on a couple of invoices. ๐Ÿ™‚

ย