Engineering decisions behind World of Tanks server
Transcription
Engineering decisions behind World of Tanks server
March 14th-18th, 2016 ENGINEERING DECISIONS BEHIND WORLD OF TANKS GAME CLUSTER Maksim Baryshnikov, Wargaming #GDC2016 2 ABOUT THE SPEAKER 12+ years of software development: as a developer, team lead, architect, CTO, even as a product manager. Currently — Solutions Architect in Wargaming WE DELIVER LEGENDARY ONLINE GAMES. GLOBALLY. WITH PASSION. 3 WORLD OF TANKS: RU 250+ servers Moscow 40 servers Amsterdam 80+ servers Novosibirsk ~70 servers Krasnoyarsk Frankfurt ~70 servers CLUSTER ANATOMY HOW SINGLE CLUSTER WORKS 5 CLUSTER COMPONENTS INTERNET Switch fabric LoginApp LoginApp BaseApp BaseApp BaseApp Switch fabric DBApp DBApp Switch fabric DATABASE CellApp DBAppMgr CellApp CellApp BaseAppMgr ServiceApp CellAppMgr 6 CLUSTER COMPONENTS LoginApp processes Responsible for logging user in. LoginApps have public IP. CellApp processes Power actual tank battles. Load is dynamically balanced among CellApps in real-time. BaseApp processes Proxy between user and CellApp. Runs all hangar logic. BaseApps have public IP. DBApp processes DBApps persist user data to the database. *Mgr processes Manage instances of corresponding *App processes. 7 CLUSTER COMPONENTS Client LoginApp DBApp BaseAppMgr CellAppMgr BaseApp CellApp DATABASE BaseApp BaseApp CellApp Base Base Base Cell Cell Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Base Base Base Cell Cell Entity Entity Entity Entity Entity Entity Entity New New Entity Entity Entity CLUSTER ANATOMY HOW BATTLE IS HANDLED WITHIN CLUSTER INFRASTRUCTURE 9 SPACES (BATTLE ARENAS) Cell load — amount of time cell spends in calculation of game situation divided by length of game tick. CellAppMgr changes cells' sizes in real-time in order to keep load of every cell below configured threshold. Cell 2 Cell 1 Cell 3 Cell 4 CellAppMgr Cell 5 Cell 6 CellApp CellApp CellApp CellApp CellApp Cell 7 10 MAINTAINING CELL LOAD Cell 1 Cell 2 Cell 3 Cell 1 Cell 6 Cell 7 Cell 1 Cell 4 Cell 3 Cell 4 Cell 5 Cell 2 NEW Cell 5 Cell 6 Cell 7 Cell 2 NEW Cell 4 Cell 3 Cell 5 Cell 6 Cell 7 time CellAppMgr can also add additional cells to space in order to maintain each cell's load below configured value. 11 AREA OF INTEREST Cell 2 500 m Your tank Enemy tank Area of interest Cell 1 12 “GHOST IN THE CELL” 500 m Cell 4 500 m Cell 4 13 LEVEL OF DETAILS Beyond classical function of rendering optimization, LODs are used also in client-server network traffic optimization: in far LODs entity updates from server are becoming more sparse, some property updates are not being sent at all A Your tank B Far LOD D C E Hysteresis Near LOD CLUSTER ANATOMY FAULT TOLERANCE 15 FAULT TOLERANCE: SENTINELS Reviver — a watchdog process used to restart other processes that fail. Reviver processes are typically started on machines reserved for fault tolerance purposes. Reviver LoginApp DBAppMgr BaseAppMgr CellAppMgr BaseApp CellApp *Mgr processes restart failed *App processes DBApp 16 FAULT TOLERANCE: BACKUPS Entities in CellApp store their back-up data in corresponding BaseApp entity CellApp CellApp CellApp BaseApp BaseApp BaseApp BaseApp backs up its entities to other BaseApps, holds cell entity backup data. Upon CellApp crash, cell entities will be restored from latest backup available. If a BaseApp dies, each of its entities is restored on the BaseApp that was backing it up. BaseApp CAP THEOREM Single cluster targets Availability and Partition Tolerance in terms of CAP theorem AP approach in this case means that battle state in case of components failure is eventually consistent (among server and all connected clients) *In theoretical computer science, the CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency, Availability and Partition Tolerance GEOGRAPHICALLY DISTRIBUTED CLUSTER-OF-CLUSTERS 19 WORLD OF TANKS: RU 250+ servers Moscow 40 servers Amsterdam 80+ servers Novosibirsk ~70 servers Krasnoyarsk Frankfurt ~70 servers 20 CLUSTER-OF-CLUSTERS CENTER Amsterdam BaseApp Account Proxy Krasnoyarsk PERIPHERY BaseApp Account Base Account Base CellApp CAP THEOREM Multi cluster targets Consistency and Partition Tolerance in terms of CAP theorem CP approach in this case means that account state is consistent among infrastructure components. This sacrifices Availability of the game for a particular client in case of Periphery cluster failure or network unavailability *In theoretical computer science, the CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency, Availability and Partition Tolerance EXTERNAL INTEGRATION Service Bus Event Bus Service call() pub/sub Service call() pub/sub call() pub/sub Service call() pub/sub call() pub/sub call() pub/sub Service call() pub/sub call() pub/sub EXTERNAL INTEGRATION: EVENT DRIVEN SOA Service 24 EXTERNAL INTEGRATION Krasnoyarsk Amsterdam BaseApp Account Proxy BaseApp Account Base ... eSports Service Account Base CellApp Clans Service CENTER Auth Service PERIPHERY LARGEST MULTI-CLUSTER IN NUMBERS 30+ million players Peak of 1.1+ million players simultaneously online 200+ logins/sec, spikes to 1000+ 100+ battles started every second 3000+ state exports to external services per second 500+ Gb of accounts data QUESTIONS? Maxim Baryshnikov m_baryshnikov@wargaming.net