By Merlin Carter
This week, I talked to our CTO Stephan Schulze about a recent article titled The Cost of Cloud, a Trillion Dollar Paradox. It was published on the Andreessen Horowitz (aka a16z) blog on May 27, 2021.
The article outlined how cloud costs can become a burden when you get to the scale of a company like Dropbox. But I wanted to know if there were any takeaways for startups that are small but growing fast. So I asked Stephan if this issue was something CTOs should consider in their long-term planning.
You can listen to our discussion or read the summarized version of it below.
Moving into the cloud, moving out of the cloud
Merlin: Recently, there was an interesting piece of analysis published by Andreessen Horowitz last Sunday. It looked at how the market cap of publicly listed tech companies was impacted by spiraling cloud costs.
They used Dropbox as an example. Dropbox saved over $75M in the two years before it went public, mostly by “repatriating” workloads from the public cloud (i.e., services such as AWS, Azure, or Google) to private solutions and shared data centers.
Their key finding was a certain kind of paradox: the cost of cloud “takes over” at some point, locking up hundreds of billions of market cap, meaning: “You’re crazy if you don’t start in the cloud; you’re crazy if you stay on it.”
Their recommendation was that startups have a “repatriation strategy” for when the cloud gets too expensive and that infrastructure spending should be a first-class metric for developers?
What’s your take on this? Do you think early-stage startups need to worry about this, or is it something for larger companies with private equity investments?
Stephan: For me, it’s not a matter of whether you’re a small startup backed by a VC or a larger company backed by private equity. It’s a matter of what you’re doing in the cloud and what benefits the cloud offers you. For example, if you use managed services rather than having your own person who manages your database server and your Docker setup, etc.
It also depends on the business you’re running: If you’re a startup (even an early stage one) that needs to scale quickly, you should definitely start in the cloud — especially when it’s hard to predict your growth. The costs and time to set this up by yourself will be way too high compared to a cloud solution. Of course, you can also run into trouble in the cloud, especially if you don’t monitor costs closely.
Don’t forget you can also run a hybrid model and your own data center to run operations that would be expensive in the cloud. For example, you can use your own data center to train ML models with powerful GPUs and then use the cloud to run the parts that need to scale easily. Like the web server and the web application itself.
Incentivizing developers to save cloud costs
Merlin: OK, I’m also interested in exactly who takes responsibility for managing infrastructure costs. In the article, they mention a prominent industry CTO who used short-term incentives similar to Sales Performance Incentive Funds (SPIFFs) so that any engineer who saved a certain amount of cloud spend received a spot bonus — Do you think this is a good idea? Do you know of any startups that do this already?
Stephan: I think it’s a good idea to have engineers look at these numbers and sensitize them to infrastructure costs.
And for the short-term incentives: It always depends on what you want to achieve.
So sure, if you want to save costs, that could be an idea worth trying. But be aware of people trying to trick the system. There should definitely also be a strong countermeasure in place that prevents people from running on the smallest instance available with low performance just to save money and get the spot bonus.
But I’ve also seen that gamifying things like this can help a lot. And it definitely helps to make the real costs completely transparent.
Tales from the field: wins and fails when trying to manage cloud costs
Merlin: Yes, I’m not sure if I would worry about cheating so much. I know that engineers are already very conscious of those standard performance metrics, like speed and reliability. But you certainly need to have those metrics superimposed next to the cost metrics.
But let’s look at some practical examples of great cost savings or costs going out of control. What about our own projects? Do you have any tales from the field where engineers were able to save costs by making great infrastructure optimizations? And on the flip side, do you have any horror stories where people received giant surprise cloud bills because they weren’t paying attention?
Stephan: I have very good examples for both scenarios. They’re both from larger eCommerce projects.
In the first case, the application was running on AWS, and we were able to reduce costs significantly by using an external service. This service replaces the infrastructure used for the Kubernetes cluster with spot instances. Spot instances can be deleted unpredictably and instantly (so there is no uptime guarantee). This makes them significantly cheaper. The service we used had a prediction in place that replaced the Cluster nodes with these spot instances and with a 60% cost saving.
In the second case, the company used AWS CloudFront as a CDN. What happened there was that a cache invalidation script went crazy and invalidated a lot of entries. As cache invalidation is pretty expensive on CloudFront— the bill that the company received was about €50,000 or more.
Merlin: Okay, thank you. So the first example is the happy path, and the second example is the sad path. But let’s go to the happy path again. First, I’m quite curious about the name of the service that you mentioned. So you mentioned you use the service to replace the Kubernetes clusters with spot instances. Did you mention what it was called?
Stephan: At the time, it was called spotinst.com. And I think they’re now called spot.io. So this is the service that we used in the past. But there are several others out there that do the same thing
Merlin: And in the case of the sad path, the unfortunate incident…mistakes can always happen. [It’s OK] as long as you rectify them quickly, but do you think this could have been detected sooner? How do you detect something like this before the cloud costs grow up to 50k? How long did it take for that to happen? Could it have been spotted earlier do you think?
Stephan: Good question. I think in this particular case, it probably couldn’t have been detected in advance since it happened quite quickly.
But the majority of the SaaS or cloud services provide you with some kind of cost monitoring features, and they let you configure budget limits and alerts.
There are also APIs to retrieve and process all of this data with external tools.
Many of us don’t take enough notice of these features but I would highly recommend that people take a closer look at these settings so that they don’t get a nasty surprise at the end of the month.
Merlin: Like many of our discussions, this is a huge theme and bigger than what we can fit in 15 minutes.
So what I would say to anyone who’s watching is go and read the article (here’s the link again).
And also, if you’re watching and you have further questions that you’d like Stefan to answer ….or any other subjects that you would like him to cover…. feel free to also post those questions in the comments. We’ll get to them as soon as we can.
But without further ado, thank you very much, Stephan, for your insights.
Stephan: Thanks, it was a pleasure to talk to you, Merlin.
Bonus read: If you’re more at the beginner end of the spectrum and want to try out different cloud providers, beware of the so-called “free tier”. At least on AWS, the term “free” has many exceptions. This article by The Register covers the unpleasant surprises that some free-tier users have reported.