HPEC 2010 - Lightwave Research Laboratory
Transcription
HPEC 2010 - Lightwave Research Laboratory
Enabling High Performance Embedded Computing through Memory Access via Photonic Interconnects Gilbert Hendry Eric Robinson Vitaliy Gleyzer Johnnie Chan Luca P. Carloni Nadya Bliss Keren Bergman * This work is sponsored by the Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory HPEC 2010 - 1 Hendry, et al. 9/17/2010 Columbia University Silicon Photonics: Advantages and Disadvantages Advantages High bit rates (40 Gb/s demonstrated) Speed of light latency Distance not a factor (power, latency) Bandwidth density through WDM Disadvantages Unsolved challenges (stability, packaging, cost) MIT Lincoln Laboratory HPEC 2010 - 2 Hendry, et al. 9/17/2010 Photonic Interconnects hold potential for on-chip computing. However, the target applications must be considered to determine if photonics will be beneficial for them Columbia University Embedded Computing: ISR Applications Image Registration SAR Image Formation Where is the image in relation to other images already taken? How many pulses can feasibly be combined and what size of an image can we take? Image Sharpening Can image fidelity be improved through using additional information or multiple pictures? MIT Lincoln Laboratory HPEC 2010 - 3 Hendry, et al. 9/17/2010 Columbia University Image Registration Image Registration Involves: • Image Orientation and Scaling • Image Alignment Produces an image that “fits” properly with other registered images to get a global view of the area. MIT Lincoln Laboratory HPEC 2010 - 4 Hendry, et al. 9/17/2010 Columbia University Image Sharpening Image Fusion: Filtering: Fuses two low resolution images to form a high resolution result. Enhances image fidelity by combining filters with the original image (Bicubic, Bilinear, Halfband...) MIT Lincoln Laboratory HPEC 2010 - 5 Hendry, et al. 9/17/2010 Columbia University SAR Image Formation Synthetic Aperture Radar (SAR) is an imaging technique that uses RADAR pulses rather than photography SAR Processing: • Image formation nontrivial, requires combining pulses • The more pulses that can be processed, the higher the image resolution • SAR can operate in conditions where traditional photography fails (low light, cloud cover) MIT Lincoln Laboratory HPEC 2010 - 6 Hendry, et al. 9/17/2010 Columbia University ISR Application Kernels ISR Kernels: Matrix Multiply Fourier Transform • Matrix Multiply, Projective Transform, Fourier Transform • Used in a broad range of ISR applications • Typically a performance bottleneck • Demand high throughput from the memory and network modules Projective Transform MIT Lincoln Laboratory HPEC 2010 - 7 Hendry, et al. 9/17/2010 Columbia University Characteristics of ISR Applications ISR Applications Ideal Candidates for Photonic Interconnects • Large Memory Access Size • Low Power Requirements • High Memory Access to Compute Ratio • High Throughput Requirements MIT Lincoln Laboratory HPEC 2010 - 8 Hendry, et al. 9/17/2010 Columbia University Silicon Photonic Waveguide Technology 1.28 Tb/s Data Transmission Experiment (occupies small slice of available WG BW) 100 ps [Vlasov and McNab, Optics Express 12 (8) 1622 (2004)] before injection into waveguide after 5-cm waveguide and EDFA C28 C23 C46 C51 (1559 nm) (1555 nm) (1541 nm) (1537 nm) [B. G. Lee et al., Photon. Technol. Lett. 20 (10) 767 (2008)] MIT Lincoln Laboratory HPEC 2010 - 9 Hendry, et al. 9/17/2010 Low-loss (1.7 dB/cm), high-bandwidth (> 200 nm) silicon photonic waveguides can be fabricated in commercial CMOS process. Columbia University Silicon Photonic Modulator and Detector Technology (CW) LASER modulator detector Ge-on-Si Detectors: 40-GHz bandwidths 1 A/W responsivities Receivers (detectors w/ CMOS amplifiers): 1.1 pJ/bit demonstrated at 10 Gb/s Scalable to < 50 fJ/bit [M Lipson, Optics Express (2007)] 85 fJ/bit demonstrated at 10 Gb/s Scalable to < 25 fJ/bit MIT Lincoln Laboratory HPEC 2010 - 10 Hendry, et al. 9/17/2010 [M Watts, Group Four Photonics (2008)] Columbia University [S Koester, J. Lightw. Technol. (2007)] Ring Resonators • Modulator/filter Broadband λ MIT Lincoln Laboratory HPEC 2010 - 11 Hendry, et al. 9/17/2010 λ Columbia University Higher Order Switch Designs MIT Lincoln Laboratory HPEC 2010 - 12 Hendry, et al. 9/17/2010 [A. Biberman, IEEE Phot. Tech. Letters (2010)] Columbia University Two Network Designs • • Purely circuit-switched TDM-arbitrated MIT Lincoln Laboratory HPEC 2010 - 13 Hendry, et al. 9/17/2010 Columbia University Circuit-switched P-NoCs Electronic p- Control n-region region 1V Ohmic Heater Thermal Control 0V 0V Transmission 1V Off-resonance profile On-resonance profile MIT Lincoln Laboratory HPEC 2010 - 14 Hendry, et al. 9/17/2010 Injected Wavelengths Columbia University Peripheral Memory Access Processor Core Network Router Memory Access Point MIT Lincoln Laboratory HPEC 2010 - 15 Hendry, et al. 9/17/2010 Columbia University Memory Access Point On Chip Chip Boundary From Memory Module Data plane To/From Network-on-Chip To Memory Module Modulators Off Chip Control plane Memory Control [V. R. Almeida et al. Cornell] MIT Lincoln Laboratory HPEC 2010 - 16 Hendry, et al. 9/17/2010 Columbia University Two Network Designs • • Purely circuit-switched TDM-arbitrated MIT Lincoln Laboratory HPEC 2010 - 17 Hendry, et al. 9/17/2010 Columbia University Photonic TDM Network • Mesh topology • Distributed switch control • Single dimension transmission • Controlled by fixed time slots : MIT Lincoln Laboratory HPEC 2010 - 18 Hendry, et al. 9/17/2010 Columbia University Vertical Memory Access Vertical Coupler [J. Schrauwen et al. U of Ghent.] MIT Lincoln Laboratory HPEC 2010 - 19 Hendry, et al. 9/17/2010 Columbia University SDRAM DIMM Anatomy DRAM_Bank DRAM_Chip data IO Cntrl Banks (usually 8) Row addr/en Col Decoder Row Decoder Col addr/en data Sense Amps DRAM cell arrays Addr/c ntrl Ranks SDRAM device DRAM_DIMM MIT Lincoln Laboratory HPEC 2010 - 20 Hendry, et al. 9/17/2010 Columbia University Optical Circuit Memory (OCM) Anatomy Mux Chip Laser In drivers To Mux Chip DRAM Chip Addr/ cntrl IO Cntrl Bank data From Mux Chip IO Gating Rx Dec. AWG AWG AWG AWG AWG AWG AWG AWG AWG AWG Addr/cntrl Laser Source Waveguide MIT Lincoln Laboratory HPEC 2010 - 21 Hendry, et al. 9/17/2010 Waveguide Coupling VDD, Gnd Columbia University Results: Circuit Switched Application Performance Performance 47.3 Performance (GOPS) 50 40 31.82 27.8 26.51 30 17.76 20 10 1.04 0.78 4.74 1.75 13.48 4.32 3.12 0 Emesh EmeshCS PS-1 PS-2 Network Type Projective Transform Matrix Multiply FFT EmeshCS yields the best performance, but PS-1 and PS-2 are competitive MIT Lincoln Laboratory HPEC 2010 - 22 Hendry, et al. 9/17/2010 Columbia University Results: Circuit Switched Power Network Power 19 20 Power (W) 15.8 15 11.2 11.1 11.4 11.2 10 4.37 5 4.35 4.28 2.21 2.17 2.15 0 Emesh EmeshCS PS-1 PS-2 Network Type Projective Transform Matrix Multiply FFT PS-1 and PS-2 use much less power than electronic alternatives MIT Lincoln Laboratory HPEC 2010 - 23 Hendry, et al. 9/17/2010 Columbia University Results: Circuit Switched Performance/Watt Comparison Performance per Watt Improvement Improvement Factor 100 86.7 89.33 87.64 80 68.6 60 40 26.9 29.01 20 1 1 9.67 6.72 2.82 1 0 Emesh EmeshCS PS-1 PS-2 Network Type Projective Transform Matrix Multiply FFT PS-1 and PS-2 give the best performance per unit of power MIT Lincoln Laboratory HPEC 2010 - 24 Hendry, et al. 9/17/2010 Columbia University Results: Circuit Switched Power Budget Breakdown Projective Transform Electronic components dominate the power of all the systems in question PS-1 and PS-2 both dominated by Electronic Crossbar The Electronic Crossbar requires a significant amount of power. However, in the Electronic Mesh, the Electronic Buffers dominate the energy consumption MIT Lincoln Laboratory HPEC 2010 - 25 Hendry, et al. 9/17/2010 Emesh dominated by Electronic Buffer EmeshCS dominated by Crossbar and Electronic Wire Columbia University Results: TDM Projective Transform 30 15.49 16.02 11.22 40 30 10 20 5 10 0 0 Emesh Pmesh P-TDM P-ETDM Network Type 51.04 50 GOPS Power(W) 20 15 60 23.97 25 Performance per Watt Improvement Performance 20.87 7.55 1.11 Improvement Factor Network Power 25 22x 20 13x 15 10 5x 5 0 Emesh Pmesh P-TDM P-ETDM Network Type Emesh Pmesh P-TDM P-ETDM Network Type TDM Results: • Performed on a smaller image (256x256) • Yields the best performance when packets can be sent in a single time slice • Constant setup cost means smaller packages can be sent with less overhead TDM yields advantages when message sizes are smaller MIT Lincoln Laboratory HPEC 2010 - 26 Hendry, et al. 9/17/2010 Columbia University Conclusions • ISR front-end application performance is of increasing importance in the community • These applications put large demands on the memory and network subsystems • Photonics offers a low-powered approach to meeting these performance demands • Photonics-specific network architecture design important here MIT Lincoln Laboratory HPEC 2010 - 27 Hendry, et al. 9/17/2010 Columbia University Fin • Sponsored by DARPA MTO (PhotoMAN) For the full details on these photonic architectures, see our other publications in the Journal of Parallel and Distributed Computing (JPDC) 2011 and Supercomputing (SC) 2010 MIT Lincoln Laboratory HPEC 2010 - 28 Hendry, et al. 9/17/2010 Columbia University