What startups should know about observability

By Merlin Carter

Generally speaking, observability is about measuring the health of your application’s internal organs. But if your tech team is tiny, you can’t observe everything under the sun. Nevertheless, there are some vital signs that you should always keep an eye on. I talk to our CTO, Stephan, about how he approaches observability when advising startups.

If you prefer to read rather than listen, I’ve included a full transcript further down.

What’s the difference between observability and monitoring?

Merlin: Today, we want to talk about observability. It’s a relatively new term that people sometimes confuse with monitoring, so let’s first clarify the difference between the two.

Stephan, in your opinion, what’s the difference between monitoring and observability?

Monitoring is watching from the outside, observability is watching from the inside

Stephan: It’s not an easy question, but I think the key difference is the level of granularity.

When you’re monitoring a system, it doesn’t matter if it’s a black box.

You don’t know how it works internally, you’re just listening for signs that the system is “healthy”.

For example, your web application is displaying product details in an acceptable time frame.

On the other hand, with observability, you’re looking into the internals of the system and checking for more than just vital signs.

You’re looking for ways you can improve and optimize your system.

For example, maybe your web application is making a lot of database requests to display product details.

As a developer, you could optimize the code so that all the necessary information is returned in one request.

That way, you can reduce complexity and maybe free up some system resources.

You’re not going to get these kinds of insights with monitoring.

Monitoring is looking at your system from the outside, from the perspective of an end-user.
Whereas observability is looking at your system from the inside, from the perspective of an engineer.

I’m not sure if that matches what you’ve read elsewhere?

Merlin: Actually, it matches pretty well. The most kind of concise definition that I’ve seen is that monitoring will tell you when something unexpected happened, and observability will tell you why.

So, that lines up with your explanation, right?

Observability is essential for running an effective post-mortem

Stephan: Yes, that makes sense. I would say it like this: monitoring lets you know something is going wrong.

Observability allows you to travel back in time. But this only works if you have all of the necessary information and data.

Take the example that I mentioned previously — the one with the database queries.

If you have the right technology, it’s easier to trace technical problems back to their source. You can find any ugly unoptimized database queries that could bring your application down. You can look at all the events that preceded the incident and build a picture of what really happened.

For example, maybe there was a bot that suddenly hit your site with a lot of traffic. And that specific part of the site was running a lot of unoptimized queries.

So observability metrics can tell you that there was an unusual amount of stress on your system, but it wasn’t caused by any real end-users. It was all because of a bot.

Nevertheless, you found out that there’s a part of our site that you still need to optimize.

And all of this information and data is also crucial when you’ve had an incident, and you’re trying to run a post-mortem.

You need this data to understand what happened and arrive at the right conclusions to prevent the same incident in the future.

How do you figure out what to observe?

Merlin: Yes, it’s generally better to have too much data rather than too little data — but still, don’t you run the risk of being overwhelmed? There’s a universe of data points that you could collect. Where do you store it all? And at what level of detail?

How do you make these kinds of decisions?

It all depends on your application and business

Stephan: Well, the thing about storage is that it’s relatively cheap. So you can store all your event data on an S3 bucket and retain it for a few weeks in its original format.

And after that, it gets compressed. Even if you need to investigate something further back in time, you can rehydrate that compressed data from storage.

The more difficult problem is deciding what data to collect and observe. Of course, you can get more insights if you have more data.

But you should consider this carefully because it can have a huge impact on costs. You could also choose to have different levels of detail for different systems.

On a production system, maybe you just log critical signals like application errors or warnings. And maybe you only enable debug messages for a specific part of your application — a part that you need to understand better in production

Personally, I prefer a more differentiated approach rather than logging a shitload of information. But there are pros and cons on both sides.

What is also interesting for me is tracking product-related metrics.

For example, you can use observability tools to look at the timing of different events and create a metric such as “successful checkouts” or “successful or failed logins”.

The idea behind this is that you can look at these metrics in a specific time frame and compare them to the same time frame last week.

That way, you can look for specific trends and see how your application is performing.

Merlin: That’s interesting — what you’re talking about is a business metric. Checkouts have a direct impact on your bottom line. So you pay closer attention to metrics like that rather than database requests or I/O operations?

Stephan: Your focus should really depend on what you want to achieve. Certainly, developers should and will still be interested in the more technical metrics.

But these product metrics can also give you an indication that your application has a problem and it’s probably not a technical issue.

For example, let’s say you’re implementing a new feature and want to see any impact that it might have on the user experience.

You could track the number of successful logins versus the number of failed logins.

If you see a spike in the number of failed logins, you might have to roll back the feature because it’s really affecting your users.

This is a reminder that you’re running a business and not just a system.

An e-commerce business doesn’t care about I/O operations — it cares about sales and purchases.

So, I would say that developers need to monitor both the technical data points AND the business KPIs.

Merlin: OK, so there’s no universal list of data points that all developers should monitor. You really need to think about the metrics that are important for your specific application and business.

Stephan: Exactly, it all comes down to the problem that you’d like to solve.

If we look at a concrete example.

We’re supporting an e-commerce company that is migrating its application from bare metal servers to a Kubernetes setup in the cloud.

As this happens with production systems, there is a lot of business risk involved. In that case, we decided to track successful checkouts next to a couple of other metrics.

Being able to compare the number of successful checkouts from the current day with the ones from last week or yesterday gave us a pretty good safety net that prevented us from not running into any hidden problems.

And as we had only limited information available, that metric was easy to track because we just looked at the number of users who saw the “thank you” page and the end of the checkout process.

These kinds of tools also help you to filter out any misleading data, such as cases when users fail to check out because of an ad blocker or similar.

If your metrics change suddenly, how do you tell whether it was due to an external event versus an internal event?

Merlin: Actually, that’s a good point — when these metrics change dramatically, how do you differentiate between an application failure and something that’s beyond your control? For example, people tune in to a football game instead of shopping online.

Stephan: The main thing is that you never focus on a single KPI.

Ideally, you have a set of indicators — a mixture of technical KPIs and product-focused KPIs. You can also ping your product team and see if they have any explanations for a sudden change in the metrics.

Maybe they’re running some kind of internal tests. It could also be that your marketing team has started a huge advertising campaign, and they forgot to tell the product and tech teams.

It’s usually a combination of things that you have to unravel.

But if you’re tracking the right metrics, you can respond quickly to any issues with your application’s performance or questions from management about a dip in sales.

How do you manage alerts and notifications?

Merlin: Yes, but you can’t analyze all metrics around the clock — you also need to be proactively alerted when something is really off. The problem with alerts, however, is that they’re often false alarms. Or there are generally too many. So there’s a risk that people will ignore alerts or become complacent.

How do you decide when an alert is really necessary?

Stephan: It’s a good question. And I see this problem in some of our portfolio companies. For example, one company has alerts sent to a “monitoring and alerting” Slack channel.

And what happened? Every 15 minutes, an alert about a failing cron job was automatically posted in this channel. It was actually related to a minor infrastructure issue that no one bothered to fix.

And this is a problem. From a cultural perspective but also from an alerting perspective.

This kind of noise will lead people to mute notifications from the “monitoring and alerting” channel.

And it’s the wrong way to go. Alerts always have to be taken seriously. But the thresholds and notifications must also be adjusted in a way that they are rare enough for people to take them seriously.

Would machine learning and anomaly detection help to reduce the noise?

Merlin: Yeah, I wonder if you can automate the process of detecting genuinely unusual behavior. In machine learning, there’s anomaly detection, where you only pay attention to outliers in your data set over a certain time frame.

So, for example, if you get the same type of warning every day, the system will suppress the alerts, and it will only alert you when there’s a new type of warning that it hasn’t seen for a while.

You can easily get overwhelmed by new tools if you don’t devote time to optimizing them

Stephan: Sure, you could do that — I think there are already some advanced tools that do what you’re talking about.

However, I would be concerned about missing the signs that a service is gradually degrading.

For example, suppose that you have an API where the response time is gradually getting slower. Perhaps only by a millisecond each day.

But after 100 days, it’s 100 milliseconds slower….which isn’t great. An anomaly detection algorithm probably wouldn’t catch that.

So you can’t completely rely on technology to do all the observing for you.

The other thing is……the observability tech landscape is growing fast, and there are so many tools that give you insights that weren’t previously possible.

But all of these new metrics can be overwhelming and distract you from your main focus. I recently had this discussion with some colleagues from one of our portfolio companies.

They were introducing one of these sophisticated Application Management Tools, and we were discussing this.

I said, “Hey, look, guys, I know you’re already quite busy, does it really help you to have this huge new chunk of information on your plate?

“Don’t you have enough data already? Do you think you can even spend enough time to understand how this tool works and how it can create value?”

And this is really a critical consideration. You shouldn’t underestimate the time it takes to set up these tools properly; otherwise, you’re going to be overloaded with information and potentially get a bunch of false positives which everyone ignores.

So these tools create a lot of operating expenses with licensing fees and setup time. After all that, if no one uses these tools, you’ve basically thrown a whole bunch of money out the window.

Money that you could have spent on something else.

Don’t you need a dedicated person to manage all this information?

Merlin: Yes, very true. It’s almost like you need someone dedicated to managing the tool. I was thinking about the telemetry data that all these very complicated distributed systems produce. And with regards to observability, you have to be able to see the patterns and spot correlations in your data.

It’s almost like you need a data scientist to analyze the metrics coming from these distributed systems ….or do you think that’s going a bit overboard?

Good observability tools automate this stuff for you

Stephan: Well, these tools do it for you. That’s their main USP…..their Value Proposition. If they’re set up properly, they let you trace the journey of a request through your distributed system.

So you can trace a request from the initial customer interaction in the frontend to writing and reading data in the backend and all the way back again.

Given the enormous volume of requests each hour, it would be very time-consuming to correlate all that data manually.

These tools allow you to correlate all of that diagnostic data at scale — that’s why they’re so valuable.

Merlin: OK, now I’m curious — can you name a few of these tools?

Stephan: Yes, sure — so there’s New Relic, there’s Dynatrace, Sentry, and there’s DataDog which a few of our portfolio companies use.

There are also tools that target specific languages. For example, there’s Tideways for PHP-based applications.

I need to look up the equivalent tools for other languages….

But anyway, those first few are some of the most well-known ones.

What observability tools should small startups try first?

Merlin: OK, thanks for that. But as you say, some of these tools can be quite expensive to implement.

What would be your recommendation for a very early-stage startup with a team of… say, 10, 15, or 20 people?

Are there some cheaper, simpler tools that they could start off with?

Start with something like Sentry and move to a more sophisticated tool later

Stephan: I would first start by asking: “Do you already have enough insights into your application? and how critical is the application?”

One of the tools I could suggest from the outset is Sentry.

It doesn’t cover all the areas that the other tools cover, but it can still provide you with some very interesting insights.

And it’s also quite cheap……and if you really want to, you can host it yourself.

But once you get bigger, you’ll probably need to move to one of those three tools that I mentioned — New Relic, DataDog, or Dynatrace.

These are enterprise software solutions, and they have enterprise price tags, but they do offer you a much broader spectrum of functionality.

They have everything from infrastructure and network monitoring to frontend monitoring, automated tests, and application monitoring — basically, everything you need.

What role does observability play when you’re highly dependent on other people’s APIs?

Merlin: Yes, I can imagine the price tag is worth it when you’re running a highly distributed system that depends on a lot of external services.

A typical example would be a trading application that depends on a bunch of data feeds from other financial institutions — what role does observability play in protecting your application if one of these feeds goes down?

You treat external APIs like they were your own

Stephan: It’s an interesting point because most tech companies nowadays are highly dependent on other parties — we’re all in one big mesh.

If you’re relying on someone else’s API, and that service goes down, you have a big problem.

Let me share two complementary approaches to address this:

One is defensive programming, where you expect the other side to fail at some point and build your system around that certainty.
On the other hand, you could also monitor that external API like it was a part of your system so that you instantly see when it’s failing.

Perhaps you even find out that their API is failing before they do. However, that would mean that they have a very poor setup on their side.

But, anyway, with a tool like Data Dog or any of the others that I mentioned, you can create a synthetic test that calls any HTTP address, which could also be an API endpoint.

Then you can create some assertions where you say something like, “I always expect to get a 200 response code, and I always expect to get a response in 100 milliseconds or less.”

And if that’s not the case, then try again or send me a notification.

And as it is super easy to put this on a dashboard, you can keep a close eye on any external APIs that you’re highly dependent on.

But generally, you should use a combination of both approaches: defensive programming AND proactive monitoring.

When you see a metric changing, at what point do you decide that further investigation is necessary?

Merlin: Yes, I’ve seen the kinds of reports you can get out of DataDog for production systems, and I’ve seen the alerts that these tests generate.

One thing I noticed is that some alerts only trigger from time to time. So maybe a response took too long, once every four hours. This alert by itself is nothing that you’d act on.

So what do you track exactly? The percentage of alerts out of the total volume of requests for each day?

DevOps culture is crucial to agree on thresholds and act on them

Stephan: Yes, this metric is similar to the SLAs that many of the big tech companies publicize.

They promise an uptime of 99.9%, which is not exactly the same metric that you would apply to a single API endpoint.

But generally, you want to keep the percentage of hourly timeouts below a certain level.

So, for example, you might say that 99% of responses being returned on time is an acceptable hourly target.

And if the endpoint starts consistently to fall below that threshold, you need to investigate it. However, one problem is that it’s often unclear whose responsibility it is to track this.

There are still a lot of companies where the engineering and operations teams are very siloed.

And where the engineers don’t have direct access to these metrics because the operations team owns the monitoring systems. And the operations teams don’t have any insight into what the application is doing.

But even when engineers DO have access to these metrics, they’re sometimes not interested in them. Because as far as they’re concerned, their job is done.

So you can spend a lot of money on an enterprise observability tool, but it won’t help you much if the engineering and operations functions aren’t integrated.

You need to foster a DevOps culture where developers also take responsibility for the operational aspects of running an application — and not just deploying it.

If you don’t enable developers to understand or act on the insights from these observability tools, they’re not going to take observability seriously.

Merlin: Thanks, I think that’s a great closing thought.

We’re running out of time, as usual, so let’s wrap it up there.

If anyone listening or reading has any questions for Stephan about his experiences with these tools, feel free to write them in the comments or on his LinkedIn feed.

Otherwise, thanks again, Stephan, for sharing your time and wisdom.

Stephan: Always a pleasure, any time. And I’m looking forward to seeing what questions people have.