Autoscaling Is Hard

Mar 11, 2019 06:44 · 636 words · 3 minute read Kubernetes Tools Webscale

I was discussing autoscaling with a friend when it magically occurred to me that it’s a pretty hard problem unless you know all of the stack inside and out and have incredibly awesome and precise introspection capabilities. I’m pretty sure that even then, autoscaling is tantamount to bailing water out of a boat with a bucket. WHAT HAPPENS NEXT WILL SURPRISE YOU. (Your mileage may vary, terms and conditions apply, “autoscaling” here means out-of-the-box solutions to make your things scale, as provided for example in kubernetes, these opinions are mine and mine (probably) alone, definitely not my employer’s, I have no easy solution to sell you and I most definitely don’t make money with this blog. See the utter lack of ads for details.)

Once upon a time you had a computer…

…and it was SOOOOO SLOOOOOOW. Without looking at it, no cheating, which part do you need to improve? More RAM? A spinny disk that spinnies faster? A more powerful central processing unit? Or maybe you need to tell your little brother to chill out with the illegal peer-to-peer file transfer, because it’s saturating your router?

In my mind (and in my car, we can’t rewind we’ve gone too far), autoscaling is the exact same deal. “My memory usage/CPU usage/response time is spiking, let’s spin up more instances”, except it doesn’t help, because the reason for the spike is a few levels upstream, and it’s a misbehaving service completely unrelated to you that’s making the database hang because it really needs to do a join on those two fields that’s not indexed because reasons. Or that other service has an MLG1 library that transcodes from XML to JSON to YAML and it’s been handcrafted by a grandmaster hacker dude-man with such passion and fury that any attempt to make it simpler and faster somehow breaks it. If you have a decently-sized infra that’s broken out in a few services, you know what I’m talking about, it’s the dark beast hidden in the legacy unvisited crevices of your code that you (and nobody else, apparently), knows how it works. At the end of the day, you can scale upstream all you want, if you can’t pinpoint why that specific operation is slow, “ya ded”.

Okay cool let’s scale the database then

Hahahahaha no. So if your database is non-trivial in size, and also you’re not using the shiny “enterprise” tools, spinning up a new database is an error-prone (which you definitely don’t want) and time-consuming (which defeats the purpose) process. By the time you’re done spinning up a new database, your traffic spike will be long gone, and you’ll spin it back down again.

Spinning a database up “defensively” (as in, when there is a spike) is about as useful as going to buy protective headgear after being pelted in the head by a skyborne tortoise.

Know thyself

As I said in the introduction blurb in small(er) (e-)print, I have nothing to sell you and I don’t think there’s an “easy way out”. I’m not arguing that “spinning more things” is not useful. I’m arguing that it’s inefficient. I’m arguing that most of the time, it won’t help. It’s the infra equivalent of solving the performance problem by using a cache; you’re just buying extra time. Might be enough. Might not be enough. That’s for you to know. Not my circus, not my monkeys, so to speak.

On the other hand, you could learn about your system, figure out where the real bottleneck is, and fix it there. And/or figure out the exact specific cases in which each of your systems should scale up. I don’t know how much money it’ll save you, but it might save you sleep. Those of you with kids (plural) very well know you can’t buy those back.

  1. Multi-Layered Garbage [return]