
Quote: Poached from Reddit
I wrote this comment in reply to a post asking 'They refuse to add new NA servers, why? Can anyone explain this?'. I felt as though I could explain this, at least in part, and so I did. And here it is.
I only had about ten minutes to spare to write this up before I went back to scaling our own systems, so I'd love to hear some other people who work in the industry chime in with their experiences, examples, and opinions. Maybe we can get rid of this 'MOAR SURVURS DUH' attitude that some people seem to have.
Edit: Oh, and this post is intentionally dumbed-down, not because I doubt my fellow redditors but because I had to rush through writing it. Obviously the issues are significantly more complex than the examples I lay out here.
As someone who works in infrastructure/servers/networking/IT/etc. for a company that does large-scale multiplayer games, I might actually be able to.
First, not everything scales linearly. Within a given 'world', in the servers that handle people online, you may have one 'server' that can handle 5000 people online at once, but adding a second server may not get you to 10000; two servers together may only be able to handle 9000 people; three servers might only get 12000, four might allow for 14000, and five might support 16000.
This can be due to the overhead of managing multiple characters and multiple interactions. If you have two people in one zone, and you're updating their positions every 1 second, you have to send 2 updates every second per person (updating each person with their official location and everyone else's location), so 4 updates total. If you have 4 people (twice the number of people) you have to send 16 updates per second. If you have 8 people, you're sending 64 updates per second (each 8 person getting 8 updates). This is a really simplified example, but it shows how 2+2 can be a lot more than 4.
So if you have a case like that, you have the option of spending those 'five servers' on one world to handle 16,000 people, or five worlds to handle 25,000 people. This is why the solution to capacity problems in MMOs is usually to open new worlds, and not just to grow existing ones.
Obviously these systems are more complex. A single 'world' is made up of dozens of components (character servers, combat servers, chat servers, instance servers, dispatch servers, login servers, etc.), and each one of these systems could have the same issues, and completely different load profiles and scaling issues. Because 'instance servers' appear to be shared across all worlds, they have to handle the capacity of every single dungeon, instanced fight, etc. for every single NA/EU world concurrently, which means that they scale completely differently than the rest of the worlds.
Another problem those servers had was that because no one could create an instance, a lot of people got backed up at the same points; before Sastasha, before Ifrit, before your level 20 class quest, etc., so now instead of having players spreading out across the level curve, you have huge clusters of people catching up to each other like something out of Amazing Race.
Then they bring the instance servers back up, and everyone rushes to do their instances. All the Ifrit fights, all the lv5 quests, all the Haukke Manor runs. Now suddenly instead of having instances spread out because levels are spread out, you have a huge proportion of players all trying to get into instances at once, and your load spikes, and now no one can get in. Now it's a completely different problem; instead of being unable to handle the common case of instance requirements, you can't handle the case of a large proportion of people online trying to run an instance, all happening at once.
One of the problems with servers is that if your servers are overloaded, it's easy for your monitoring tools to start failing (because the system won't run them because there's too much else going on), and you can have problems logging in. In those cases, you can have servers which hit their capacity in unexpected ways, suddenly, before you have a chance to spot the problem and figure out what's happening. For example, a memory leak that only happens sometimes can take down a server rapidly, and make it extremely difficult to track down because once the server has died you can't log in to debug it.
It wouldn't surprise me if some of their downtime was trying to work around those issues while also adding a lot of debugging information so they could track down what exactly was happening on the server and find the source of the problem (instead of just trying to mitigate it).
So for that instance issue, it doesn't matter how many servers they add to run instances, if they're still going to have them die off too quickly because there's a software bug they need to fix, or because each server adds less and less capacity because of non-linear scaling.
That's just my two cents though. It could be a dozen other reasons.
Edit: holy f-balls, this thread blew up. Ive been trying to reply to everyone's questions but I've fallen behind, and my Free Company is having a meetup right now (I'm at the restaurant as I type this), so I'll have to come back to it all tonight and try to catch up.
Keep the questions coming and I'll try to answer them as best I can tonight!
![[Image: barkbarkbark.gif]](http://img.photobucket.com/albums/v497/Commando_Joe/barkbarkbark.gif)