Engineering decisions behind World of Tanks server

Transcription

Engineering decisions behind World of Tanks server
March 14th-18th, 2016
ENGINEERING DECISIONS BEHIND
WORLD OF TANKS
GAME CLUSTER
Maksim Baryshnikov, Wargaming
#GDC2016
2
ABOUT THE SPEAKER
12+ years of software development:
as a developer, team lead, architect, CTO,
even as a product manager.
Currently — Solutions Architect in Wargaming
WE DELIVER LEGENDARY ONLINE GAMES. GLOBALLY. WITH PASSION.
3
WORLD OF TANKS: RU
250+ servers
Moscow
40 servers
Amsterdam
80+ servers
Novosibirsk
~70 servers
Krasnoyarsk
Frankfurt
~70 servers
CLUSTER ANATOMY
HOW SINGLE CLUSTER WORKS
5
CLUSTER COMPONENTS
INTERNET
Switch fabric
LoginApp
LoginApp
BaseApp
BaseApp
BaseApp
Switch fabric
DBApp
DBApp
Switch fabric
DATABASE
CellApp
DBAppMgr
CellApp
CellApp
BaseAppMgr
ServiceApp
CellAppMgr
6
CLUSTER COMPONENTS
LoginApp processes
Responsible for logging user in. LoginApps have public IP.
CellApp processes
Power actual tank battles. Load is dynamically balanced among CellApps in real-time.
BaseApp processes
Proxy between user and CellApp. Runs all hangar logic. BaseApps have public IP.
DBApp processes
DBApps persist user data to the database.
*Mgr processes
Manage instances of corresponding *App processes.
7
CLUSTER COMPONENTS
Client
LoginApp
DBApp
BaseAppMgr
CellAppMgr
BaseApp
CellApp
DATABASE
BaseApp
BaseApp
CellApp
Base
Base
Base
Cell
Cell
Entity
Entity
Entity
Entity
Entity
Entity
Entity
Entity
Entity
Entity
Base
Base
Base
Cell
Cell
Entity
Entity
Entity
Entity
Entity
Entity
Entity
New
New
Entity
Entity
Entity
CLUSTER ANATOMY
HOW BATTLE IS HANDLED WITHIN CLUSTER INFRASTRUCTURE
9
SPACES (BATTLE ARENAS)
Cell load — amount of time cell spends in calculation of
game situation divided by length of game tick.
CellAppMgr changes cells' sizes in real-time in order to
keep load of every cell below configured threshold.
Cell 2
Cell 1
Cell 3
Cell 4
CellAppMgr
Cell 5
Cell 6
CellApp
CellApp
CellApp
CellApp
CellApp
Cell 7
10
MAINTAINING CELL LOAD
Cell 1
Cell 2
Cell 3
Cell 1
Cell 6 Cell 7
Cell 1
Cell 4 Cell 3
Cell 4
Cell 5
Cell 2 NEW
Cell 5
Cell 6 Cell 7
Cell 2 NEW
Cell 4 Cell 3
Cell 5
Cell 6 Cell 7
time
CellAppMgr can also add additional cells to space in order to maintain each cell's load below configured value.
11
AREA OF INTEREST
Cell 2
500 m
Your tank
Enemy tank
Area of interest
Cell 1
12
“GHOST IN THE CELL”
500 m
Cell 4
500 m
Cell 4
13
LEVEL OF DETAILS
Beyond classical function of rendering optimization,
LODs are used also in client-server network traffic
optimization: in far LODs entity updates from server are
becoming more sparse, some property updates are not
being sent at all
A
Your tank
B
Far LOD
D
C
E
Hysteresis
Near LOD
CLUSTER ANATOMY
FAULT TOLERANCE
15
FAULT TOLERANCE: SENTINELS
Reviver — a watchdog process used to
restart other processes that fail.
Reviver processes are typically started on
machines reserved for fault tolerance
purposes.
Reviver
LoginApp
DBAppMgr
BaseAppMgr
CellAppMgr
BaseApp
CellApp
*Mgr processes restart failed *App processes
DBApp
16
FAULT TOLERANCE: BACKUPS
Entities in CellApp store their back-up data in
corresponding BaseApp entity
CellApp
CellApp
CellApp
BaseApp
BaseApp
BaseApp
BaseApp backs up its entities to other BaseApps, holds cell
entity backup data.
Upon CellApp crash, cell entities will be restored from latest
backup available.
If a BaseApp dies, each of its entities is restored on the
BaseApp that was backing it up.
BaseApp
CAP THEOREM
Single cluster targets Availability and Partition Tolerance in terms of CAP theorem
AP approach in this case means that battle state in case of components failure is
eventually consistent (among server and all connected clients)
*In theoretical computer science, the CAP theorem, also known as Brewer's theorem,
states that it is impossible for a distributed computer system to simultaneously provide all
three of the following guarantees: Consistency, Availability and Partition Tolerance
GEOGRAPHICALLY DISTRIBUTED
CLUSTER-OF-CLUSTERS
19
WORLD OF TANKS: RU
250+ servers
Moscow
40 servers
Amsterdam
80+ servers
Novosibirsk
~70 servers
Krasnoyarsk
Frankfurt
~70 servers
20
CLUSTER-OF-CLUSTERS
CENTER
Amsterdam
BaseApp
Account Proxy
Krasnoyarsk
PERIPHERY
BaseApp
Account Base
Account Base
CellApp
CAP THEOREM
Multi cluster targets Consistency and Partition Tolerance in terms of CAP theorem
CP approach in this case means that account state is consistent among
infrastructure components. This sacrifices Availability of the game for a particular
client in case of Periphery cluster failure or network unavailability
*In theoretical computer science, the CAP theorem, also known as Brewer's theorem,
states that it is impossible for a distributed computer system to simultaneously provide all
three of the following guarantees: Consistency, Availability and Partition Tolerance
EXTERNAL INTEGRATION
Service Bus
Event Bus
Service
call()
pub/sub
Service
call()
pub/sub
call()
pub/sub
Service
call()
pub/sub
call()
pub/sub
call()
pub/sub
Service
call()
pub/sub
call()
pub/sub
EXTERNAL INTEGRATION: EVENT DRIVEN SOA
Service
24
EXTERNAL INTEGRATION
Krasnoyarsk
Amsterdam
BaseApp
Account Proxy
BaseApp
Account Base
...
eSports Service
Account Base
CellApp
Clans Service
CENTER
Auth Service
PERIPHERY
LARGEST MULTI-CLUSTER IN NUMBERS
30+ million players
Peak of 1.1+ million players simultaneously online
200+ logins/sec, spikes to 1000+
100+ battles started every second
3000+ state exports to external services per second
500+ Gb of accounts data
QUESTIONS?
Maxim Baryshnikov
m_baryshnikov@wargaming.net