STUDYING I/O IN THE DATA CENTER:

Transcription

STUDYING I/O IN THE DATA CENTER:
STUDYING I/O IN THE DATA CENTER:
OBSERVING AND SIMULATING I/O FOR FUN AND PROFIT
Simulation, Observation, and Software:
Supporting exascale storage and I/O
ROB ROSS
Mathematics and Computer Science Division
Argonne National Laboratory
rross@mcs.anl.gov
June 23, 2016
I/O: A STORAGE DEVICE PERSPECTIVE (2011)
40
35
GBytes/s
30
25
scheduled maintenance
missing data
scheduled maintenance
network maintenance
storage maintenance
EarthScience project usage change
scheduled maintenance
Read
Write
control system maintenance
and scheduled maintenance
20
15
10
5
0
6
/2
03/25
03/24
03/23
03 22
/
03/21
03/20
03/19
03/18
03/17
03 16
/
03/15
03/14
03/13
03/12
03/11
03 10
/
03/09
03/08
03/07
03/06
03/05
03 04
/
03/03
03/02
03/01
03/28
02/27
02 26
/
02/25
02/24
02/23
02/22
02/21
02/20
02 19
/
02/18
02/17
02/16
02/15
02/14
02 13
/
02/12
02/11
02/10
02/09
02/08
02 07
/
02/06
02/05
02/04
02/03
02/02
02 01
/
02/31
01/30
01/29
01/28
01/27
01 26
/
01/25
01/24
01/23
01
Aggregate I/O throughput on BG/P storage servers at one minute intervals.
§ The I/O system is rarely idle at this granularity.
§ The I/O system is also rarely at more than 33% of peak bandwidth.
§ What’s going on?
–  Are applications not really using the FS?
–  Is it just not possible to saturate the FS?
–  Did we buy too much storage?
P. Carns et al. "Understanding and improving computational science storage access through
continuous characterization." ACM Transactions on Storage. 2011.
OBSERVATION: DARSHAN
WHAT ARE APPLICATIONS DOING, EXACTLY?
Work was inspired by the CHARISMA project (mid 1990s), which had the
goal of capturing the behavior of several production workloads, on
different machines, at the level of individual reads and writes.
§ Studied Intel iPSC/860 and Thinking Machines CM-5
§ Instrumented FS library calls
§ Complete tracing of thousands of jobs in total
–  Jobs: sizes, numbers of concurrent jobs
–  File access patterns: local vs. global, access size, strided access
§ The community has continued to develop new tracing and profiling
techniques and apply them to critical case studies
§ … but tracing at a broad, system-wide level is out of the question on
modern systems at this point
Nieuwejaar, Nils, et al. "File-access characteristics of parallel scientific workloads.” IEEE Transactions on
Parallel and Distributed Systems. 7.10 (1996): 1075-1089.
4
DARSHAN: CONCEPT
Goal: to observe I/O patterns of the majority of
applications running on production HPC platforms,
without perturbing their execution, with enough detail to
gain insight and aid in performance debugging.
§ Majority of applications – integrate with system build
environment
§ Without perturbation – bounded use of resources (memory,
network, storage); no communication or I/O prior to job
termination; compression.
§ Adequate detail:
– basic job statistics
– file access information from multiple APIs
5
DARSHAN: TECHNICAL DETAILS
§ Runtime library for characterization of application I/O
–  Transparent instrumentation
–  Captures POSIX I/O, MPI-IO, and limited HDF5 and PNetCDF
functions
§ Minimal application impact (“just leave it on”)
–  Bounded memory consumption per process
–  Records strategically chosen counters, timestamps, and histograms
–  Reduces, compresses, and aggregates data at MPI_Finalize() time
–  Generates a single compact summary log per job
§ Compatible with IBM BG, Cray, and Linux environments
–  Deployed system-wide or enabled by individual users
–  No source code modifications or changes to build rules
–  No file system dependencies
§ Enabled by default at NERSC, ALCF, and NCSA
§ Also deployed at OLCF, LLNL, and LANL in limited use
http://www.mcs.anl.gov/research/projects/darshan
6
DARSHAN: I/O CHARACTERIZATION
786,432 process turbulence simulation on Mira
7
SHARING DATA WITH THE COMMUNITY
This type of data can be invaluable to researchers trying to
understand how applications use HPC systems.
§ Conversion utilities can anonymize and re-compress data
§ Compact data format in conjunction with anonymization makes it
possible to share data with the community in bulk
§ ALCF I/O Data Repository provides access to production logs captured
on Intrepid
§ Logs from 2010 to 2013, when ALCF I/O Data Repository Statistics
the machine was retired
Unique log files
152,167
Core-hours
instrumented
Data read
Data written
721 million
25.2 petabytes
5.7 petabytes
http://www.mcs.anl.gov/research/projects/darshan/data/
8
jobs on a platform, and platform peak.
modern USB thumb drives (writes average 239 MB/s and reads ave
DARSHAN: BUILDING
HIGHER-LEVEL VIEWS
MB/s on the 64 GB Lexar P10 USB3.0 [52]).
Mira: Jobs I/O Throughput
1TB/s
I/O Throughput
System peak - 240 GB/s
1% Peak
1 GB/s
Jobs Count
0 - 10
10 - 100
100 - 500
1 MB/s
500 - 1k
1k - 5k
5k - 10k
m
in
s
1 KB/s
tim
e
=
5
§ Mining Darshan data
– Ongoing work led by
M. Winslett (UIUC)
– Mining Darshan logs for
interactive visualization
and generation of
representative workloads
I/O
1 B/s
1B
1 KB
1 MB
1 GB
1 TB
1 PB
Number of bytes transferred
Web
dashboard
for ad-hoc
analysis:and total
Figure 3.12: Number
of jobs
with a given
I/O throughput
of bytes on Mira https://github.com/huongluu/DarshanVis
Luu, Huong, et al. "A multiplatform study of I/O behavior on
petascale supercomputers." Proceedings of HPDC, 2015.
9
41
TARGETING ATTENTION:
SMALL WRITES TO SHARED FILES
§  Small writes can contribute to poor performance as a result of poor file
system stripe alignment, but there are many factors to consider:
–  Writes to non-shared files may be cached aggressively
–  Collective writes are normally optimized by MPI-IO
–  Throughput can be influenced by additional factors beyond write size
§  We searched for jobs that wrote less than 1 MiB per operation to shared
files without using any collective operations
§  Candidates for collective I/O or batching/buffering of write operations
Summary of matching jobs:
§ 
§ 
§ 
§ 
§ 
Top example
Scale: 128 processes
Run time: 30 minutes
Max. I/O time per process: 12 minutes
Metric: issued 5.7 billion write
operations, each less than 100 bytes
in size, to shared files
Averaged just over 1 MiB/s per
process during shared write phase
P. Carns et al, “Production I/O Characterization on the Cray XE6,” CUG 2013.
FIRST STEPS IN APPLYING ML TO DARSHAN
MBytes Moved
Extensive rereading of file in
home directory!
Time (entire period is about a week)
Data from Sandeep Madireddy (ANL).
11
DARSHAN: STATUS
http://www.mcs.anl.gov/research/projects/darshan/
§ Open source (BSD-like license)
§ Git repository: https://xgitlab.cels.anl.gov/groups/darshan
–  Source code access
–  Issue tracking
§ Users mailing list
§ Routinely used on IBM BG/Q, Cray XC30 and XC40, and Linux
§ Enabled by default on Mira, Cori, Edison, and Blue Waters
§ Deep instrumentation:
–  POSIX
–  MPI-IO
–  BG/Q job information
§ Shallow instrumentation:
–  Parallel netCDF
–  HDF5
12
SIMULATION: CODES
Rensselaer Polytechnic Institute
Chris Carothers
Elsa Gonsiorowski
Justin LaPre
Caitlin Ross
Noah Wolfe
Argonne National Laboratory
Philip Carns
John Jenkins
Misbah Mubarak
Shane Snyder
Rob Ross
CODES
The goal of CODES is to use highly parallel
simulation to explore exascale and
distributed data-intensive storage system
design.
§ Project kickoff in 2010
§ Set of models, tools, and utilities intended to
simplify complex model specification and
development
–  Configuration utilities
–  Commonly-used models (e.g., networking)
–  Facilities to inject application workloads
§ Using these components to understand
systems of interest to DOE ASCR
14
CODES
Workflows
Workloads
Services
Protocols
Hardware
Simula6on(PDES)
PARALLEL DISCRETE EVENT SIMULATION
The underpinnings of CODES
–  Builds on MPI for parallelism
–  Utilizes Time Warp/Optimistic or
YAWNs/Conservative schedulers
–  Users provide functions for reverse
computation – undo the effects of a
particular event on the LP state
15
5.5e+11
Actual Sequoia Performance
Linear Performance
(2 racks as base)
5e+11
Event Rate (events/second)
§ Set of discrete components that
send time stamped messages to
each other.
§ Matches well with systems we want
to simulate – workloads contain
discrete actions, protocols have
discrete, well-defined steps, etc.
§ We use ROSS (Rensselaer
Optimistic Simulation System)
4.5e+11
4e+11
3.5e+11
3e+11
2.5e+11
2e+11
1.5e+11
1e+11
5e+10
0
2 8
24
48
96
120
Number of Blue Gene/Q Racks
Barnes et. al. “Warp speed: Executing time warp on
1,966,080 cores,” in Proceedings of the 2013 ACM
SIGSIM Conference on Principles of Advanced
Discrete Simulation (SIGSIM-PADS’13)
CODES: UNDERSTANDING OBJECT STORES
P. Carns et al. “The Impact of Data Placement
on Resilience in Large-Scale Object Storage
Systems.” MSST. 2016.
§  Investigating replicated object storage system response to a server fault
–  Relevant to declustered RAID discussions in CORAL, etc.
§  Simulating rebuild traffic:
–  CRUSH object placement algorithm
–  Standard Ceph configuration
–  Simulating pipelining of data movement between nodes and storage device
–  Similar study could be performed for parity recalculation
§  Rebuild rate does not scale well with addition of servers
16
CODES: UNDERSTANDING OBJECT STORES
§  Last server to complete the rebuild
only interacts with a few peers, one
of which is heavily loaded.
§  Placement groups aren’t sufficiently
balancing the load.
17
CODES: DRAGONFLY NETWORKS
Visualization from K. Li (UC Davis) of 1,728 process AMG comm. pattern with adaptive routing.
18
CODES: STATUS
http://www.mcs.anl.gov/research/projects/codes/
§ Open source (BSD-like license)
§ Git repository:
https://xgitlab.cels.anl.gov/groups/codes
–  Source code access
–  Issue tracking
§ Users mailing list
§ Routinely used on IBM BG/P, IBM BG/Q, and
Linux
§ Repository models include:
–  Torus (configurable dimensionality, etc.)
–  Dragonfly
–  DUMPI trace driver
“Summer of CODES” Workshop
–  File I/O driver
July 12-13, Argonne
§ Research models:
Users, Progress, Hackathon
–  Slim Fly
http://press3.mcs.anl.gov/summerofcodes2016/
–  Fat Tree
19
INTEGRATING DATA: TOKIO
Lawrence Berkeley National Lab
Nick Wright (Lead PI)
Suren Byna
Jialin Liu
Glenn Lockwood
Prabhat
William Woo
Argonne National Laboratory
Philip Carns
Rob Ross
Shane Snyder
Slides in this section were created by Glenn Lockwood.
THE TOKIO PROJECT
A holistic view is essential to understanding I/O performance.
Project Goals:
– Development of algorithms and a software framework that
will collect and correlate I/O workload data from
production HPC resources
– Enable a clearer view of system behavior and the causes
of behavior
– Audience: application scientists, facility operators and
computer science researchers in the field.
21
MULTIPLE DATA PATHS IN HIERARCHICAL
STORAGE
Compute Nodes
IO Nodes
Storage Servers
Burst buffer tier
•  data flows through or
around flash tier
•  one data path between
compute and storage is
now up to four paths
Burst Buffer
Nodes
22
TOKIO FRAMEWORK: SCALABLE COLLECTION
IO
Compute
Parallel FS
collectd, LMT++
darshan, IPM
DW acct
Burst Buffer
1.  Component-level monitoring data fed into RabbitMQ
– Slurm plugins (kernel counters, Darshan, IPM)
– Native support (procmon, collectd)
23
TOKIO FRAMEWORK: ANALYTICS FOUNDATION
2.  Component data stored on
disk
–  Native log format or HDF5
–  Divorce collection from
parsing
3.  Views: index of on-disk
component data
–  Align data on time, jobid,
topology
–  Expose via RDBMS/doc
store
4.  Query interface: expose
common analytics modules
–  Web UI, REST API for users
–  Leverage Spark ecosystem
MySQL ES
AMQP
Darshan
records
view
LMT
records
collectd
records
view
Correlated
view analysis
view
Statistical
analysis
Query
Interface
24
view
…
view
…
GiB Transferred"
TOKIO: ALIGNING LUSTRE & BURST BUFFER
SERVER DATA
100"
90"
80"
70"
60"
50"
40"
30"
20"
10"
0"
Slurm Job
Duration
BB Read"
BB Write"
Lustre Read"
Lustre Write"
0"
100"
200"
300"
Elapsed Time (seconds)"
25
400"
WHAT’S NEXT?
MORE STORAGE LAYERS!
HPC After 2016
HPC Before 2016
DRAM
Memory
Lustre
Parallel File
System
Parallel File System
HPSS
Parallel
Tape
Archive
§ Why
Memory
Burst Buffer
1-2 PB/sec
Residence – hours
Overwritten – continuous
4-6 TB/sec
Residence – hours
Overwritten – hours
Parallel File System
1-2 TB/sec
Residence – days/weeks
Flushed – weeks
Campaign Storage
100-300 GB/sec
Residence – months-year
Flushed – months-year
Archive
10s GB/sec (parallel tape
Residence – forever
–  BB: Economics (disk BW/IOPS too expensive)
–  PFS: Maturity and BB capacity too small
–  Campaign: Economics (tape BW too expensive)
–  Archive: Maturity and we really do need a “forever”
Slide from Gary Grider (LANL).
ECOSYSTEM OF DATA SERVICES
Application
Should monitoring system account for all the
Executables
Intermediate
participants,
or can
the ecosystem
adopt a
Checkpoints
and Libraries
Data Products
Data model?
common Application
monitoring
SPINDLE
DataWarp
SCR
FTI
28
Kelpie
MDHIM
THIS WORK IS SUPPORTED BY THE DIRECTOR, OFFICE OF
ADVANCED SCIENTIFIC COMPUTING RESEARCH, OFFICE OF
SCIENCE, OF THE U.S. DEPARTMENT OF ENERGY UNDER
CONTRACT NO. DE-AC02-06CH11357.