- Published on
Databases and Servers - it's not glamorous but something has to store the data
- Authors
- Name
- Nick
- lead engineer @ vatsim
Introduction
VATSIM's database goes back to the dawn of time of for VATSIM. How it runs has changed over the years but in 2019 - 2020 it moved to a very different configuration to provide a stable platform for the future.
Backstory
In 2019 VATSIM had a choice to make. Central, the server running most of VATSIM started reaching end of life for both operating system and software. An inplace upgrade from Ubuntu 14.04 -> 18.04 wasn't an option as that can cause problems after upgrading and updated MySQL packages didn't exist for Ubuntu 14.04. We were a bit stuck but we did have an upgrade path.
Upgrading
Upgrading Ubuntu would have been a forklift upgrade. During the upgrade we would have to put some guard rails in place to keep the network running while moving data. We decided this would be best left as the last part of upgrades. Depending who you talk to people usually dislike running databases in containers. This is usually due to misconfigurations which will result in dataloss. We decided to move VATSIM's database to a container with the correct storage configuration to upgrade MySQL and then eventually to MariaDB. Dataloss comes from not configuration storage correctly for a container. Containers are ephemeral, unless storage is configured properly as any data written to them is lost upon shutdown/restart. Once on MariaDB we kept with LTS (long term support) releases from them.
All applications, websites and services on Central had to move as well. Kubernetes took over running most applications. This provides us a pool of compute resources that provide redundancy and scaling abilities for the network. It takes a few minutes not hours to add more capacity. Most applications at the time were PHP (now we have a mix) making containerization easy for the most part.
Forklift Day
Eventually came the day to upgrade. In early 2020 we closed off the network for new signups for 18ish hours while leaving the flight part of the network running. This way existing users could still login and fly or control. While we could have done an upgrade process that didn't require a large block of time, it became the most reliable. During these 18 hours, VATSIM's tech team copied every database to their new home. This upgrade gave us the ability to compress stored database data. This resulted in huge disk space savings which is helpful when some tables in the database go back to 2001. During this upgrade we also pivoted applications to Kubernetes. After upgrades, Central was no longer used. Everything was now running on Kubernetes.
Database Topology
Currently we are using this same setup today however the underlying servers have been changed due to operating system upgrades.
Before
After
After upgrades, we were moved into a fully redundant database setup that we call Stripbin. ProxySQL a battle tested MySQL/MariaDB aware proxy was put in the middle of applications using Stripbin. This is very helpful as ProxySQL is aware of database cluster topology in clusters using Galera replication. Galera is a multi-master replication option for MariaDB. We however don't use all replicas to write data. Using ProxySQL we are able to send all writes to one replica and then reads to other replicas. If the replica that is handling writes goes offline, ProxySQL selects a different replica to write data seamlessly without human interaction. That said humans are still notified that a replica is down. Backups are now stored in an object storage platform and get generated from a replica that temporarily desyncs itself from the cluster to have a stable point in time to take backups. ProxySQL ensures queries are not sent to this replica while it is generating backups.
Obersvability and monitoring of databases has improved as well. Both MariaDB and Galera expose metrics that we collect using Prometheus and mysqld_exporter. This information is used to create dashboards and alerts.
Stripbin has protected the network from outages. We have lost replicas during events without users noticing impact. Recovery from losing a replica is able to be done without impact to users as well. This works so well that in June 2023 we completely replaced all database servers with new servers due to an operation system version upgrade. Before Stripbin this would have resulted in downtime. This was done without impacting the network and on the weekend. We usually don't do maintenance on weekends due to peak usage.
Takeaways
Databases can be a pain but a properly configured system can be reliable for years. Our plans for the future include moving to Postgres and a brand new schema but this will require a lot of changes. Some of these changes are already in the works. For a migration to Postgres the community will be well informed before it happens because as with 2019's upgrade it will require some downtime in the same fashion.