Near-Data Computation: It`s Not (Just) About Performance
Transcription
Near-Data Computation: It`s Not (Just) About Performance
Near-‐Data Computa.on: It’s Not (Just) About Performance Steven Swanson Non-‐Vola0le Systems Laboratory Computer Science and Engineering University of California, San Diego 1 Solid State Memories • NAND flash – Ubiquitous, cheap – Sort of slow, idiosyncra0c • Phase change, Spin torque MRAMs, etc. – On the horizon – DRAM-‐like speed – DRAM or flash-‐like density 2 1000000 Bandwidth Rela.ve to disk 100000 5917x à 2.4x/yr 10000 1000 PCIe-‐Flash (2012) PCIe-‐PCM (2010) PCIe-‐PCM (2015?) 100 10 DDR Fast NVM (2016?) Hard Drives (2006) PCIe-‐Flash (2007) 7200x à 2.4x/yr 1 1 10 100 1000 10000 100000 1000000 100000 1/Latency Rela.ve To Disk 3 Programmability in High-Speed Peripherals • As hardware gets more sophis0cated, programmability emerges – GPUs – fixed func0on à Programmable shaders à full-‐blown programmability – NICs – Fixed func0on à MAC offloading à TCP offload • Storage has been leb behind – It hasn’t goden any faster in 40 years 4 Why Near-data Processing? Well-‐ studied Worth a Closer Look Data Dependent Accesses Internal bandwidth ≫ interconnect ✓ bandwidth Moving data takes energy ✓ Trusted Cefficient omputa0on NDP compute can be energy ✓ Storage latencies ≤ Interconnect Latency ✓ ✓ Storage latencies ≈ CPU latency ✓ ✓ ✓ NDP compute Seman0c can be trusted Flexibility Improved Real-‐0me capabili0es ✓ 5 Diverse SSD Semantics • File system offload/OS bypass – [Caulfield, ASPLOS 2012] – [Zhang, FAST 2012] Lessons Learned • Caching support Implementa0on Our Goal: Make costs are high Database transac0ons support Programming an SSD Easy and Flexible. Fit with applica0ons is – [Bhaskaran, INFLOW 2013] – [Saxena, EuroSYS 2012] • – [Coburn SOSP 2013] – [Ouyang HPCA 2011] – [Prabhakaran OSDI 2008] • NoSQL offload – Samsung’s SmartSSD – [De FCCM 2013] • Sparse storage address space – FusionIO DFS uncertain 6 The Sobware Defined SSD 7 The Software-defined SSD SDSSD Host Machine Memory Memory Memory Memory Memory Memory Application Generic SSDAPP interface PCIe Ctrl PCIe Bridge 3GB/s 3GB/s 3GB/s Memory Controller SSD CPU Memory Controller SSD CPU Memory Controller SSD CPU 4 GB/s 8 SDSSD Programming Model RPC Ring – Send and receive RPC requests • SSD-‐CPU – Send and receive RPC requests – Manage access to NV storage banks PCIe Host CPU SDSSD CPU Data streamer • The host communicates with SDSSD processors via a general RPC interface. • Host-‐CPU Memory 9 The SDSSD Prototype • FPGA-‐based implementa0on • DDR2 DRAM emulates PCM – Configurable memory latency – 48 ns reads, 150 ns writes – 64GB across 8 controllers • PCIe: 2 GB/s, full duplex 10 SDSSD Case Studies • • • • Basic IO Caching Transac0on processing NoSQL Databases 11 Basic IO 12 Normal IO Operations: Base-IO • • • • Base-‐IO App provides read/write func0ons. Just like a normal SSD. Applica0ons make system calls to the kernel The kernel block driver issues RPCs 1. SysCall user accesses kernel module 2. Read/Write RPCs SDSSD CPU NVM 13 Faster IO: Direct access IO • OS installs access permissions on behalf of a process. • Applica0ons issue RPCs directly to HW • The App checks permission user accesses 1. userspace channel kernel module 3. userspace channel direct IO accesses 2. kernel installs perms SDSSD CPU NVM 14 Bandwidth (MB/s) Preliminary Performance Comparison 1800 Read 1800 1600 1600 1400 1400 1200 1200 1000 1000 800 800 Direct-‐IO 600 Direct-‐IO-‐HW 400 Direct-‐IO 600 Direct-‐IO-‐HW 400 Base-‐IO 200 Write Base-‐IO 200 0 0 0.5 1 2 4 8 16 32 64 128 0.5 1 2 4 8 16 32 64 128 15 Caching 16 Caching • NVMs will be caches before they are storage • Cache hits should be as fast as Direct-‐IO • Caching App File System Applica.on file file Cache Library Userspace Kernel – Tracks dirty data – Tracks usage sta0s0cs system reads/writes Cache Hit Cache Miss Cache Manager … Device Driver Disk SDSSD Caching 17 4KB Cache Hit Latency 18 Transac0on Processing 19 Atomic Writes • Write-‐ahead logging guarantees atomicity • New interface for atomic opera0ons – LogWrite(TID, file, file_off, log, log_off) • Transac0on commit occurs at the memory interface à full memory bandwidth Logging Module inside SDSSD Transaction Table …" TID 15 PENDING TID 24 FREE …" X …" TID 37 COMMITTED …" Metadata File Log File Data File New D Old D New B Old B New A Old A New C Old C 20 Atomic Writes • SDSSD atomic-‐writes outperform pure sobware approach by nearly 2.0x • SDSSD atomic-‐writes achieve comparable performance to a pure hardware implementa0on Bandwidth (GB/s) AtomicWrite SDSSD−Tx−Opt SDSSD−Tx 2.0 Moneta−Tx Soft−atomic 1.5 1.0 0.5 0.0 0.5 2 8 32 Request Size (KB) 128 21 Key Value Store 22 Key-value store • Key-‐value store is a fundamental building block for many enterprise applica0ons • Supports simple opera0ons o Get to retrieve the value corresponding to a key o Put to insert or update a key-‐value pair o Delete to remove a key-‐value pair • Open-‐chaining hash table implementa0on. 23 Direct-IO based Key-Value Store Get K3 Key K3 Hash(K3) K1 V1 0 1 K2 V2 K3 V3 K8 V8 K9 V9 K4 V4 M-‐1 Index Structure Compare Keys K7 V7 Key-‐value data I/O Mismatch Match Host SSD 24 SDSSD Key-Value Store App Get K3 0 1 Hash(K3) Key K3 K1 V1 Hash Idx K2 V2 K3 V3 K8 V8 K9 V9 K4 V4 M-‐1 Index Structure K7 V7 Key-‐value data RPC(Get) Hash Idx Key K3 K3 V3 Compare Keys DMA Mismatch Match Host SDSSD SPU 25 MemcacheDB performance Throughput (in Million ops/s) 0.6 0.5 0.4 Direct-‐IO 0.3 Key-‐Value-‐HW 0.2 Key-‐Value 0.1 0 Get Put Workload-‐A Get 0.6 0.45 0.4 0.35 0.3 0.25 Direct-‐IO 0.2 Key-‐Value 0.15 Key-‐Value-‐HW 0.1 0.05 0 Throughput (Million ops/s) Throughput (Million ops/s) 0.5 Workload-‐B Put 0.5 0.4 Direct-‐IO 0.3 Key-‐Value 0.2 Key-‐Value-‐HW 0.1 0 1 2 4 8 Avg. chain length 16 32 1 2 4 8 Avg. chain length 16 32 26 Ease-of-Use 27 Conclusion • New memories argue for near-‐data processing – But not just for bandwidth! – Trust, low-‐latency, and flexibility are cri0cal too! • Seman0c flexibility is valuable and viable – 15% reduc0on in performance compared to fixed-‐ func0on implementa0on – 6-‐30x reduc0on in implementa0on 0me – Every applica0on could have a custom SSD 28 Thanks! 29 Ques0ons? 30