Database Schema Validation

Mar 19, 2019 07:01 · 474 words · 3 minute read The Tool Data Databases Mysql

So a few years ago I have written a piece of code that reads a database schema and then writes out code to talk to that database. I thought useful back then to also include a function that would checksum the database’s tables and their columns (types, and all). The idea was that you could make sure you were talking to the correct version of the database. Turns out this is also useful to compare schemas between databases.

Why do that?

Sometimes, when you reach a certain scale, you realize that your single database that is replicated to a read-only replica just won’t do. You decide to shard it, for better or worse.

Now you have a shiny new problem: You need to apply schema migrations to two databases.

Even worse. Time passes, and this holds, but your little business finally takes off (yaaaay! Good for you!) and you now need about a bajillion database shards to sustain growth! That’s what you get when you cut corners and/or engineer stuff poorly, I hope it doesn’t eat up too much of your revenue, you capitalist swine! <insert trollface here>

Now you have a shiny new problem: How do you make sure that all the databases have an identical schema? I mean, you could keep track of all your database migrations across the board, and everybody on your rapidly-expanding engineering team knows how that work, and it’s integrated correctly and reliably in your CI/CD process, yes? And it raises catastrophic alarms when you apply a migration and it doesn’t work, yes? SO, now that we’ve established that your process is sane and sound, how do you just go in and make sure that all the schemas are identical across your whole cluster?

The Tool

That’s what my tool is used for. It will output two hashes and the database name, all separated by tabs. First hash is tables and column definition, second hash is indexes. The reason the database name is output and there’s two hashes is that it gives you a way to do nice manipulations on the output.

  • You can awk '{print $1,$2}' | sort | uniq -c the output to know if your databases all have the same schema and index
  • You can grep the output for a specific hash to see which databases are abnormal, schema- or index-wise

You can run the tool after a migration to see whether or not it fixed the issue, too, although I’d have to recommend a thing like Skeema for things like that, as you can use it to diff a known-good and known-bad database to see exactly how they differ.

The tool is good for diagnosis. I like the tool. I’ll keep calling it “the tool” until I have a better name for it.