Logo
Published on

Changing How Users Connect Two Decades Later

810 words5 min read
Authors
  • avatar
    Name
    Nick
    Twitter
    lead engineer @ vatsim

Introduction

Connecting to VATSIM with a simulator hasn't changed much since the founding of VATSIM — either you connected with an IP directly or through a static list of FSD servers. In Q1 of 2023, VATSIM set out to change how that connection is established.

Goals

We had a few goals for this project. All were achieved and subsequently verified during the most recent iteration of Cross the Pond.

  • Ensure users connect to the closest server to them. This will provide the best experience for the majority of cases.
  • Balance load between servers in locations that have multiple servers.
  • Ensure users no longer see Too many clients connected.

Discovery

Initial discovery into making this change showed that we technically didn't need to make any modifications to clients we currently support. All of them would accept a hostname, perform a DNS lookup, and then connect to whatever that hostname resolves to.

We evaluated both Cloudflare GeoDNS and a theoretical VATSIM-written geographical DNS server. In both cases, we would have to implement a server to collect the state of FSD servers and produce a configuration of what should be considered when resolving a DNS lookup. Cloudflare did work as expected and would have been a perfectly fine choice if we only kept to using DNS... more on this part later.

Costs were factored into our choice. Cloudflare would have had a cost per FSD server alongside its base cost. From a pure financial point of view, implementing it ourselves was cheaper.

Implementation

In the end, we decided to go with implementing our own server, named dnshaiku. Leveraging Golang's existing libraries, it didn't take long to have a working proof of concept. We were able to wrap our service discovery, placement logic, and metrics collection around the very proven miekg/dns.

During implementation, we realized that while DNS works well, it can sometimes be off by many miles when trying to geolocate a connection. In some cases, we would only see the IP address of the DNS resolver someone is using, not their real IP. Some DNS providers were better than others, however, this is super dependent on their network. For example, Google DNS and Cloudflare DNS worked the best, while some ISPs flat-out refused to connect to our DNS servers. EDNS ECS would have allowed for more accurate placement of a connection, however, several ISP resolvers don't currently support it. After this real-world testing, client developers were looped in, and we switched to an HTTP-based endpoint provided by the same DNS servers. Now, vPilot, xPilot, and Swift all primarily use HTTP and will otherwise fall back to DNS.

All of our FSD servers implement Prometheus metrics for monitoring and time series dashboards. Leveraging these existing endpoints, dnshaiku consumes Prometheus formatted metrics from FSD to build a map of FSD's state. From this data, every prospective client connection is compared against a set of rules, then dnshaiku returns the IP of the best FSD server for the user.

Rules
  • Remove all inactive servers from consideration
  • Find the closest server to a user that's accepting connections
  • If the closest server exists in a location with more than one, select the one that has the least connections
  • If unable to determine, return the default server

dnshaiku implements its own Prometheus metrics which are collected and stored for monitoring and time series dashboards. While these metrics are similar to some that come from FSD, it is critical to know what dnshaiku is thinking.

High-Level Architecture

high level system architecture

Cross the Pond 2023 Westbound

During CTP 2023 Westbound, dnshaiku ran into zero problems. Gone were the reports of servers being full, as the servers were balanced based on connection location and network capacity.

Below is a graph from CTP 2023 Westbound from our monitoring system. During the event, we always had network capacity for connections. However, servers that hit a high water line for connections would stop accepting new connections automatically.

ctp2023wb

FSD Protocol Improvement

As dnshaiku was written, FSD received an update as well. FSD now supports evacuating servers, forcing clients to reconnect. This complements dnshaiku well, as we are now more easily able to control which servers are taking connections.

Combining these two abilities means we can do network maintenance faster and at times that work better for our tech team.

Takeaways

Change is good, at least in this case. While some community members might prefer to select their own server, the move to this automatic system allows us to provide a better experience both for our users and our network maintenance.

https://github.com/vatsimnetwork/vatdns - it's open source!