ZFS UTH Under The Hood

Transcription

ZFS UTH Under The Hood
ZFS I/O Stack
Object-Based Transactions
• “Make these 7 changes
to these 3 objects”
• All-or-nothing
ZFS
Transaction Group Commit
ZFS UTH
Under The Hood
• Again, all-or-nothing
• Always consistent on disk
• No journal – not needed
Superlite
Transaction Group Batch I/O
Jason Banham & Jarod Nash
• Schedule, aggregate,
and issue I/O at will
Systems TSC
Sun Microsystems
DMU
Storage
Pool
• No resync if power lost
• Runs at platter speed
1
ZFS Elevator Pitch
“To create a reliable storage system from
inherently unreliable components”
• Data Integrity
>
>
>
>
Historically considered “too expensive”
Turns out, no it isn't
Real world evidence shows silent corruption a reality
Alternative is unacceptable
• Ease of Use
> Combined filesystem and volume management
> Underlying storage managed as Pools which simply admin
> Two commands: zpool & zfs
> zpool: manage storage pool (aka volume management)
> zfs: manage filesystems
2
ZFS Data Integrity
2 Aspects
1.Always consistent on disk format
> Everything is copy-on-write (COW)
> Never overwrite live data
> On-disk state always valid – no “windows of vulnerability”
> Provides snapshots “for free”
> Everything is transactional
> Related changes succeed or fail as a whole
–
AKA Transaction Group (TXG)
> No need for journaling
2.End to End checksums
> Filesystem metadata and file data protected using checksums
> Protects End to End across interconnect, handling failings
between storage and host
3
ZFS COW: Copy On Write
1. Initial block tree
2. COW some blocks
3. COW indirect blocks
4. Rewrite uberblock (atomic)
4
FS/Volume Model vs. ZFS
FS/Volume I/O Stack
Block Device Interface
• “Write this block,
then that block, ...”
ZFS I/O Stack
Object-Based Transactions
FS
Transaction Group Commit
• Workaround: journaling,
which is slow & complex
• Loss of power = resync
• Synchronous and slow
DMU
• Again, all-or-nothing
Volume
• Write each block to each disk
immediately to keep mirrors in
sync
ZFS
• All-or-nothing
• Loss of power = loss of ondisk consistency
Block Device Interface
• “Make these 7 changes
to these 3 objects”
• Always consistent on disk
• No journal – not needed
Transaction Group Batch I/O
Storage
Pool
• Schedule, aggregate,
and issue I/O at will
• No resync if power lost
• Runs at platter speed
5
ZFS End to End Checksums
Disk Block Checksums
• Checksum stored with data block
• Any self-consistent block will pass
• Can't even detect stray writes
• Inherent FS/volume interface limitation
Data
Data
Checksum
Checksum
ZFS Data Authentication
• Checksum stored in parent block pointer
• Fault isolation between data and checksum
• Entire storage pool is a
self-validating Merkle tree
Address
Address
Checksum Checksum
Data
Disk checksum only validates media
✔ Bit rot
✗
✗
✗
✗
✗
Phantom writes
Misdirected reads and writes
DMA parity errors
Driver bugs
Accidental overwrite
Address
Address
Checksum Checksum
Data
ZFS validates the entire I/O path
✔ Bit rot
✔ Phantom writes
✔ Misdirected reads and writes
✔ DMA parity errors
✔ Driver bugs
✔ Accidental overwrite
6
Traditional Mirroring
1. Application issues a read.
Mirror reads the first disk,
which has a corrupt block.
It can't tell.
2. Volume manager passes
3. Filesystem returns bad data
bad block up to filesystem.
If it's a metadata block, the
filesystem panics. If not...
to the application.
Application
Application
Application
FS
FS
FS
xxVM mirror
xxVM mirror
xxVM mirror
7
Self-Healing Data in ZFS
2. ZFS tries the second disk.
3. ZFS returns good data
ZFS mirror tries the first disk.
Checksum reveals that the
block is corrupt on disk.
Checksum indicates that the
block is good.
to the application and
repairs the damaged block.
Application
Application
Application
ZFS mirror
ZFS mirror
ZFS mirror
1. Application issues a read.
8
ZFS Administration
• Pooled storage – no more volumes!
> All storage is shared – no wasted space, no wasted bandwidth
• Hierarchical filesystems with inherited properties
> Filesystems become administrative control points
–
–
Per-dataset policy: snapshots, compression, backups, privileges, etc.
Who's using all the space? du(1) takes forever, but df(1M) is instant!
> Manage logically related filesystems as a group
> Control compression, checksums, quotas, reservations, and more
> Mount and share filesystems without /etc/vfstab or /etc/dfs/dfstab
> Inheritance makes large-scale administration a snap
• Online everything
9
FS/Volume Model vs. ZFS
Traditional Volumes
•
•
•
•
•
ZFS Pooled Storage
Abstraction: virtual disk
Partition/volume for each FS
Grow/shrink by hand
Each FS has limited bandwidth
Storage is fragmented, stranded
•
•
•
•
•
FS
FS
FS
Volume
Volume
Volume
Abstraction: malloc/free
No partitions to manage
Grow/shrink automatically
All bandwidth always available
All storage in the pool is shared
ZFS
ZFS
ZFS
Storage Pool
10
Dynamic Striping
• Automatically distributes load across all devices
• Writes: striped across all four mirrors
• Reads: wherever the data was written
• Block allocation policy considers:
> Capacity
> Performance (latency, BW)
> Health (degraded mirrors)
ZFS
ZFS
ZFS
Storage Pool
1
2
3
• Writes: striped across all five mirrors
• Reads: wherever the data was written
• No need to migrate existing data
> Old data striped across 1-4
> New data striped across 1-5
> COW gently reallocates old data
ZFS
ZFS
Storage Pool
Add Mirror 5
4
ZFS
1
2
3
4
11
5
Snapshots for Free
• The combination of COW and TXGs means constant time
snapshots fall out for free*
• At end of TXG, don't free COWed blocks
> Actually cheaper to take a snapshot than not!
Snapshot
root
Live
root
*Nothing is ever free, old COWed blocks of course consume space
12
Disk Scrubbing
• Finds latent errors while they're still correctable
> ECC memory scrubbing for disks
• Verifies the integrity of all data
> Traverses pool metadata to read every copy of every block
> Verifies each copy against its 256-bit checksum
> Self-healing as it goes
• Provides fast and reliable resilvering
> Traditional resilver: whole-disk copy, no validity check
> ZFS resilver: live-data copy, everything checksummed
> All data-repair code uses the same reliable mechanism
–
Mirror resilver, RAIDZ resilver, attach, replace, scrub
13
ZFS Commands
ZFS
ZFS
ZFS
• zfs(1m) used to administer
filesystems, zvols, and
dataset properties
Storage Pool
• zpool(1m) used to control
the storage pool
14
ZFS Live Demo
15
ZFS Availability
• OpenSolaris
> Open Source version of latest Solaris in development (nevada)
> Available via:
> Solaris Express Developer Edition
> Solaris Express Community Edition
> OpenSolaris Developer Preview 2 (Project Indiana)
> Other distros (Belenix, Nexenta, Schilix, MarTux)
• Solaris 10
> Since Update 2 (latest is Update 4)
• OpenSolaris versions will always have the latest and
greatest bits and therefore best version to play with and
explore the potential of ZFS
16
ZFS Under The Hood
• Full day of ZFS Presentations and Talks
> Covering:
> Overview – more of this presentation and “manager safe”
> Issues – known issues around current ZFS implementation
> Under The Hood – how ZFS does what it does
• If you are seriously interested in ZFS and want to know
more, or have discussions or just plainly interested in how
it works, then drop us a line:
> Jarod.Nash@sun.com
> Jason.Banham@sun.com
17