MVPX: A Media-oriented Vector Processing
Transcription
MVPX: A Media-oriented Vector Processing
Cyberscience Center, Tohoku University c i MVPX: A Media-oriented Vector Processing Mechanism Cyberscience Center Background and Purpose Music Players Games Animations Issues of Conventional Approaches Features of Next Generation MMAs M ultiM edia A pplications (MMAs) Recognition Plenty of data level parallelism (DLP) Difficult to efficiently execute MMAs of various vector length Various vector length Inefficient for MMAs with short vectors Large amounts of data transmission High power is consumed to achieve high data transmission ability Proposal of this Research Research Targets Next generation MMAs are required to have: To improve high computing power by using DLP Higher quality of media processing Use more computational-intensive algorithm Process larger data sets More Varieties of MMAs Execute various MMAs on the same platform Out-of-order vector processing mechanism Focus on Vector Architectures improve the performance of vector architecture on short vector processing To improve the data transmission ability Multi-banked cache memory Focus on Memory Sub-system Obtain a high capability of data transmission with lower power consumption OVPM: an Out-of-Order (OoO) Vector Processing Mechanism Behavior of In-order Issue and OVPM MVL128 MVL256 MVL512 A Simple Example (MVL = Maximum Vector Length = Vector Register Length) 40% 30% for( i = 0; i < N; i ++) { vload va0, addr1 vload va1, addr2 vadd va2, va0, va1 vstore va2, addr3 } 20% 10% 0% sphinx3 faceRec raytrace vips MxM VxM avg. vips MxM VxM Vector Length 4096 173 1080 79 1000 1000 • Most of the modern vector architectures obey the in-order instruction issue policy • In the case of executing MMAs with long vectors Stalls caused by in order issue policy arehidden by using large vector registers Long memory latencies arehidden by using large vector registers Pipeline latencies of functional units arehidden by using large vector registers • In the case of executing MMAs with short vectors vload VLSU Pipeline vadd vstore vstore (a) Time-Space Diagram of vector extension of IVPM when Executing the Program with Long Vectors cycles vload vload vload vload vadd vload Vector Memory Instruction Buffer Vector Ready Instruction Buffer Vector Load & Store Queue vload vload vadd VFUs Pipeline vadd vstore vstore VLSU Pipeline (b) Time-Space Diagram of vector extension of IVPM when Executing the Program with Short Vectors VLSU Pipeline vadd vstore VLSU Pipeline Experimental Setup Vector Load Store Unit Enhanced with vector extension Parsec Benchmark Suite ALPbench Benchmark Suite Vector Memory Instruction Buffer (VMIB): OoO processing for Memory Instr. Vector Arithmetic Instruction Buffer (VAIB ): OoO processing for Arithmetic Inst. (b) Time-Space Diagram of vector extension of OVPM when Executing the Program with Short Vectors IVPM • Benchmarks • Add two new instruction buffers to realize OVPM Exe. Cycles Reduced vstore OVPM 90% Simplescalar Toolset Main Memory vadd vstore VLSU Pipeline Impacts of OVPM • Simulator development MVP-cache vadd VFUs Pipeline vstore (c) Time-Space Diagram of vector extension of OVPM when Executing the Program withLong Vectors Pipeline stage cycles vload vload vload vload Exe. Cycles Reduced vadd VFUs Pipeline Vector Registers D Cache VLSU Pipeline VLSU Pipeline Vector Arithmetic Instruction Buffer Vector Function Units LSU vload cycles Instruction FetchQueue FUs vload vload A stall is caused due to the in-order issue policy cycles These vloads overtake thevadd and vstore in the previous iteration, respectively, due toOoO processing OVPM General PurposeRegisters Pipeline Stages Min(MVL, VL) Number of parallel pipelines A stall is caused due to the in-order issue policy vload General Purpose Processor Decoder Simplify The Time-Space Diagram of the Simple Example Pipeline stage Microarchitecture of OVPM Fetcher Pipeline Latency VLSU Pipeline An OoO issue policy is required for vector architectures, in order to execute MMAs with short vectors efficiently …… …… VFUs Pipeline Stalls caused by in order issue policy areexposed Memory latencies areexposed I Cache Pipeline Stages … raytrace Cycles … faceRec Cycles …… … Sphinx3 Data Parallel Pipelines … Benchmarks OVPM: OoO Vector Processing Mechanism IVPM: In-order Vector Processing Mechanism Time-Space Diagram Pipeline stage 50% MVL64 Pipeline stage MVL32 Vector ALU latency 10 cycles Vector Multiplier latency 15 cycles Vector Division latency 50 cycles Number of Parallel Pipelined VFUs 8 Main memory latency 100 cycles Entries per Vector Register 128 entries Frequency 3GHz Computational Efficiency Computational Efficiency Issues on Conventional Vector Processors 80% 70% 60% 50% 40% 30% 20% 10% 0% sphinx face ray vips MxM VxM The computational efficiency of IVPM achieves 17%, while that of OVPM achieves 55.2% The computational efficiencies improve, especially for the MMAs with short vectors Both of MMAs with short vectors and long vectors achieve high utilization of hardware MVP-Cache: A High Bandwidth Cache System Interconnection 3000 2000 1000 16 0 Large overheads on increasing the memory bandwidth It is necessary to propose a high bandwidth cache system to increase the effective memory bandwidth MVP-cache 4000 2 4 8 Number of Memory Ports 2.00 bank 0 … bank m-1 Bus Memory Channel 0 … … bank (n-1)•m bank … n •m - 1 Bus Memory Channel n-1 Achieve high bandwidth by accessing multiple independent banks concurrently Hide the access latencies by using the interleaved memory access method SC13 Denver, Colorado 1.76 1.50 1.64 1.00 1.00 0.99 1.26 1 bank 1.33 sphinx ray Speedup = vips VxM (8 byte/cycle) 1.33 0.50 0.00 Impacts of Cache Bandwidth face MxM avg. Exe. Time of OVPM w/o MVP-cache Exe. Time of OVPM w/ MVP-cache Cache bandwidth is twice higher than Memory bandwidth 1.33x performance improvement ( 2x improvement in theory ) MVP-cache bridges the gap between main memory and OVPM 12 Relative Performance Area (mm^2) 5000 1 Performance Evaluation of MVP-cache OVPM Speedup Peak Dynamic (W) 400 350 300 250 200 150 100 50 0 MVP-cache Peak Dynamic Power of Memory Ports (W) Area of Memory Ports (mm2) Costs of Increasing Memory Bandwidth 2 banks 4 banks face ray 8 banks 16 banks (16 byte/cycle) (32 byte/cycle) (64 byte/cycle) (128 byte/cycle) 32 banks (256 byte/cycle) 10 8 6 4 2 0 sphinx vips MxM VxM avg. Relative performance is normalized by 1 bank $ Most of MMAs are sensitive to cache bandwidth The higher cache hit rate is, the more performance is improved (URL) http://www.sc.isc.tohoku.ac.jp/ (E-mail) gaoye@sc.isc.tohoku.ac.jp