Latest Intel® VTune™ Amplifier XE 2016 Improvements
Transcription
Latest Intel® VTune™ Amplifier XE 2016 Improvements
Intel® VTune™ Amplifier XE Performance Profiler Latest 2016 Beta Improvements David Anderson, May 2015 Agenda • Intel® VTune™ Amplifier XE Performance Profiler • Intel® VTune™ Amplifier XE 2016 Beta: • Improvements in Intel® OpenMP Analysis • Better Support of MPI+OpenMP Hybrid Applications • Enhanced Intel® HD Graphics Profiling • Latest Ease-of-use Improvements • New IDE and OS Support • Additional Material • Demo Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 2 Tune Applications for Scalable Multicore Performance Intel® VTune™ Amplifier XE Performance Profiler Is your application slow? Does its speed scale with more cores? Tuning without data is just guessing Accurate CPU, GPU1 & threading data Powerful analysis & filtering of results Easy set-up, no special compiles “Last week, Intel® VTune™ Amplifier XE helped us find almost 3X performance improvement. This week it helped us improve the performance another 3X.” 1 Windows* only. Claire Cates Principal Developer SAS Institute Inc. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. http://intel.ly/vtune-amplifier-xe Optimization Notice 3 Two Great Ways to Collect Data Intel® VTune™ Amplifier XE Software Collector Hardware Collector Uses OS interrupts Uses the on chip Performance Monitoring Unit (PMU) Collects from a single process tree Collect system wide or from a single process tree. ~10ms default resolution ~1ms default resolution (finer granularity - finds small functions) Either an Intel® or a compatible processor Requires a genuine Intel® processor for collection Call stacks show calling sequence Optionally collect call stacks Works in virtual environments Works in a VM only when supported by the VM (e.g., vSphere* 5.1) No driver required Requires a driver - Easy to install on Windows - Linux requires root (or use default perf driver without stacks) No special recompiles - C, C++, C#, Fortran, Java, Assembly Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 4 A Rich Set of Performance Data Intel® VTune™ Amplifier XE Software Collector Basic Hotspots Hardware Collector Advanced Hotspots Which functions use the most time? Which functions use the most time? Where to inline? – Statistical call counts Concurrency General Exploration Tune parallelism. Colors show number of cores used. Locks and Waits Tune the #1 cause of slow threaded performance: – waiting with idle cores. Any IA86 processor, any VM, no driver Where is the biggest opportunity? Cache misses? Branch mispredictions? Advanced Analysis Dig deep to tune access contention, etc. Higher res., lower overhead, system wide No special recompiles - C, C++, C#, Fortran, Java, Assembly Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 5 Improvements in Intel® OpenMP Analysis Realizing Gain with Less Pain Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 6 Identify Where to Invest Effort in OpenMP Applications • Intel OpenMP analysis enhancements • Serial time vs. parallel time • OpenMP parallel region potential improvements Is serial time of my application significant to prevent scaling? How efficient is my parallelization towards ideal parallel execution? How much theoretical gain I can get if invest in tuning? Which regions have the highest potential ROI? Links to grid view for more details on inefficiency Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 7 What is Hindering Parallel Performance? VTune Amplifier Identifies Parallel Region Inefficiencies • Details of efficiency in Bottom-up view • Precise, trace-based imbalance calculation especially useful for small region instances Imbalance Likely culprit: Dynamic scheduling overhead Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 8 Jump to Parallel Region Source Code • View data specific to the region at the source code level • Tip: use ‘-parallel-source-info=2’ compiler option to embed source file name in region name (enables drill down to source file) Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 9 Understanding Large Parallel Regions New Barrier-to-Barrier Analysis • Essential for regions with more than 1 barrier (implicit or explicit) • • Critical for model with 1 parallel region and work-sharing constructs inside with pragma single to do sequential work Intel OpenMP runtime library traces sub-region segments from region fork or previous barrier points Parallel Region Sub-region 4 Sub-region 1 Sub-region 3 Sub-region 2 User Barrier omp single Single barrier omp for Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. omp for barrier Region Join Optimization Notice 10 Details on Barrier-to-Barrier Instances • Use the “/Barrier-to-Barrier Segment” grouping • Lexical loop constructs with different schedule types or chunk sizes displayed separately Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 11 Better Support of MPI+OpenMP Hybrid Applications VTune Amplifier + Intel® MPI = Success! Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 12 MPI+OpenMP Hybrid Application Analysis Optimize Where it Will be Most Effective • MPI communication spinning metrics for Intel® MPI • Showing OpenMP metrics per process sorting by processes laying on critical path of MPI execution • Don’t optimize OpenMP, if MPI communication is the bottleneck! Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 13 Deeper and Wider Analysis Get the Details and the Big Picture • Hyperlinks in Summary jump to Bottom-up view with ‘/Process /OpenMP Region/ …’ groupings to get details of the OpenMP metrics aggregated perprocess • MPI communication spin time can highlighted on the timeline Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 14 Insight Provided by Multiple Tools! • Intel® MPI provides ‘-gtool’ option in version 5.0.2 and higher to simplify selective rank profiling • Intel® Trace Analyzer 9.0.2 and higher can identify CPU-bound processes and generate VTune Amplifier command line • Modifies ITA command for use with VTune Amplifier Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 15 Enhanced Intel® HD graphics Profiling Visual GPU Execution Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 16 Review of GPU Profiling Support Hardware and OpenCL* Metrics • Collect Intel® Integrated Graphics hardware metrics • Details regarding OpenCL™ activity on the GPU • Presented correlated with CPU processes and threads • Available on Windows* * See this Getting Started article Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 17 Intel® HD Graphics Profiling on Linux* • Analyze OpenCL* kernels on Linux systems • GPU hardware metrics not yet available • Intel® Media Server Studio support • Extended counter set for 5th generation Intel® Core™ processors (code name: Broadwell) Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 18 Enhanced Intel® HD Graphics Profiling Visualize Hardware Interaction • GPU Architecture Diagram annotated with metrics • Available for OpenCL* Compute tasks Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 19 Latest Ease-of-Use Improvements Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 20 Analyze with Confidence! Low Sample Counts Reduce Metrics Validity • Summary, Bottom-up, and Source views gray out metrics when amount of collected data puts their validity in question • Currently in hardware-based sampling General Exploration data • Hovering over cell provides explanation Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 21 Platform Tab Replaces Tasks and Frames Can now Display Additional Information • Will include: • GPU Usage/Queue • Bandwidth • and CPU Freq ratio • depending on collection options Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 22 More “Big Picture” Support • “Super Tiny” bird’s-eye view to help recognize application phases and behavioral patterns • Use context menu to select Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 23 Better Linux* Support • Linux ‘perf’ supported for users without root access • Linux* operating systems based on kernel 2.6.32 or higher • That export CPU PMU programming details over /sys/bus/event_source/devices/cpu/format file system • VTune Amplifier hardware-based sampling driver provides additional features, such as: • Stacks† • Uncore events • Multiple, precise events • New events for the latest processors, even on older operating systems † Newer Linux releases include support for stacks-collection with PMU events Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 24 New IDE and OS Support • Microsoft* Visual Studio* 2015 integration support • Microsoft Windows* 10 “Threshold” tech preview build #9926 tested with hardware-based sampling • Software-based sampling types not supported (basic hotspots, concurrency, and locks and waits) Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 25 Summary • Intel® VTune™ Amplifier XE 2016 Beta: • Improvements in Intel® OpenMP Analysis • Better Support of MPI+OpenMP Hybrid Applications • Enhanced Intel® HD Graphics Profiling • Latest Ease-of-use Improvements • Let us know what you think, participate in the Beta Program today! • Register at bit.ly/psxe2016beta Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 26 Additional Material Intel® VTune™ Amplifier – Performance Profiler: Product page – overview, features, FAQs… Training materials – movies, tech briefs, documentation… Evaluation guides – step by step walk through Reviews Support – forums, secure support… Additional Analysis Tools: Intel® Inspector XE - memory and thread checker / debugger Intel® Advisor XE – thread prototyping tool for architects Additional Development Products: Intel® Software Development Products Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 27 Demo Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 28 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 29