Home Programming GCC Information for Ampere Processors — SitePoint

GCC Information for Ampere Processors — SitePoint

GCC Information for Ampere Processors — SitePoint


This text was initially revealed by Ampere Computing.

This paper describes the way to successfully use GNU Compiler Assortment (GCC) choices to assist optimize utility efficiency on Ampere Processors.

When making an attempt to optimize an utility, it’s important to measure if a possible optimization improves efficiency. This contains compiler choices. Utilizing superior compiler choices could lead to higher runtime efficiency, doubtlessly at the price of elevated compile time, extra debug difficulties, and infrequently elevated binary measurement. Why compiler choices have an effect on efficiency is past the scope of this paper, though the quick reply is that code technology, trendy processor architectures and the way they work together are very sophisticated! One other vital level is that completely different processors could profit from completely different compiler choices due to variations in laptop structure, and the particular microarchitecture. Repeated experimentation with optimizations is vital to efficiency success.

The way to measure an utility’s efficiency to find out the limiting components, in addition to optimization methods have already been lined in articles beforehand revealed. The paper, The First 10 Inquiries to Reply Whereas Working on Ampere Altra-Based mostly Situations, describes what efficiency information to gather to grasp your entire system’s efficiency. A Efficiency Evaluation Methodology for Optimizing Ampere Altra Household Processors explains the way to optimize successfully & effectively utilizing a data-driven method.

This paper first summarizes the most typical GCC choices with an outline of how these choices have an effect on purposes. The dialogue then turns to current case research utilizing GCC choices to enhance efficiency of VP9 video encoding software program and MySQL database for Ampere Processors. Related methods have been successfully used to optimize extra software program working on Ampere Processors.

GCC Suggestions

The GCC compiler gives many choices that may enhance utility efficiency. See the GCC web site for particulars. To generate code that takes benefit of all of the efficiency options out there in Ampere Processors, use the gcc -mcpu possibility.

To make use of the gcc -mcpu possibility, both set the CPU mannequin or inform GCC to make use of the CPU mannequin based mostly on the machine that GCC is working on by way of -mcpu=native. Observe on legacy x86 based mostly techniques, gcc -mcpu is a deprecated synonym for -mtune, whereas gcc -mcpu is absolutely supported on Arm based mostly techniques. See Arm’s information to Compiler flags throughout architectures: -march, -mtune, and -mcpu for particulars.

In abstract, at any time when attainable, use solely -mcpu and keep away from -march and -mtune when compiling for Arm. Under is a case research highlighting efficiency features by setting the gcc -mcpu possibility with VP9 video encoding software program.

Setting the -mcpu possibility:

  • -mcpu=ampere1: Generate code that can run on AmpereOne Processors. AmpereOne is the following technology of Cloud Native Processors from Ampere, extending the household of high-performance processors to new business main core counts. Observe, this may generate code that won’t run on Ampere Altra and Altra Max Processors. This feature was initially out there in GCC model 12.1 and later, then backported to GCC 10.5 and GCC 11.3.

  • -mcpu=neoverse-n1: Generate code that can run on Ampere Altra, Ampere Altra Max in addition to Ampere AmpereOne. Whereas utilizing this feature for code that can run on Ampere AmpereOne is supported, it would doubtlessly not make the most of all the brand new efficiency options out there. Observe, GCC model 9.1 or larger is required to allow CPU particular tunings for Ampere Altra and Ampere Altra Max processors.

  • -mcpu=native: Generate code setting the CPU mannequin based mostly on the CPU GCC is working on. Observe, GCC model 9.1 or larger is required to allow CPU particular tunings for Ampere Altra and Ampere Altra Max processors.

Utilizing -mcpu=native is doubtlessly simpler to make use of, though it has a possible downside if the executable, shared library, or object file are used on a special system. If the construct was achieved on an Ampere AmpereOne Processor, the code could not run on an Ampere Altra or Altra Max Processor as a result of the generated code could embody Armv8.6+ directions supported on Ampere AmpereOne Processors. If the construct was achieved on an Ampere Altra or Altra Max processor, GCC is not going to make the most of the newest efficiency enhancements out there on Ampere AmpereOne Processors. It is a normal challenge when constructing code to make the most of efficiency options for any structure.

The next desk lists what GCC variations that assist Ampere Processor -mcpu values.

Processor -mcpu Worth GCC 9 GCC 10 GCC 11 GCC 12 GCC 13
Ampere Altra neoverse-n1 ≥ 9.1 ALL ALL ALL ALL
Ampere Altra Max neoverse-n1 ≥ 9.1 ALL ALL ALL ALL
AmpereOne ampere1 N/A ≥ 10.5 ≥ 11.3 ≥ 12.1 ALL

Our suggestion is to make use of the gcc -mcpu possibility with the suitable worth described above (-mcpu=ampere1, -mcpu=neoverse-n1 or -mcpu=native) with -O2 to ascertain a baseline for efficiency, then discover extra optimization choices and measuring if completely different choices enhance efficiency in comparison with the baseline.

Abstract of widespread GCC choices:

  • -mcpu Advisable when constructing on Ampere Processors to allow processor particular tuning and optimizations. (See dialogue “Setting the -mcpu possibility” part above for particulars.)

  • -Os Optimize to scale back code measurement, doubtlessly in case your utility is restricted by fetching directions.

  • -O2 Thought-about customary GCC optimization possibility and good to make use of as a baseline to match with different GCC choices.

  • -O3 Provides extra optimizations to generate extra environment friendly codes for loops, helpful to attempt in case your utility efficiency is dominated by time spent in loops.

  • Profile Guided Optimization (PGO): -fprofile-generate & -fprofile-use. Generate profile information that the compiler will use to doubtlessly make higher selections on optimizations corresponding to inlining, loop optimizations and default branches. That is thought-about a complicated optimization because it requires adjustments to the construct system, see beneath.

  • Hyperlink-Time Optimization (LTO): -flto. Allow link-time optimizations, permitting the compiler to optimize throughout particular person supply recordsdata. This allows features to be inlined throughout supply recordsdata amongst different compiler optimizations. That is additionally thought-about a complicated optimization and doubtlessly requires adjustments to the construct system. This feature will increase total construct time, which could be dramatic for giant purposes. It’s attainable to make use of LTO simply on efficiency crucial supply recordsdata to doubtlessly lower construct occasions.

VP9 Video Encoding Case Research with gcc -mcpu

VP9 is a video coding format developed by Google. libvpx is the open-source reference software program implementation for the VP8 and VP9 video codecs from Google and the Alliance for Open Media (AOMedia). libvpx gives vital enchancment in video compression over x264 with the expense of extra computation time. Further info on VP9 and libvpx is accessible on Wikipedia.

On this case research, the VP9 construct is configured to make use of the gcc -mcpu=native possibility to enhance efficiency. As talked about above, use the -mcpu possibility when compiling on Ampere Processors to allow CPU particular tuning and optimizations. Initially libvpx was constructed utilizing the default configuration after which rebuilt utilizing -mcpu=native. To guage VP9 efficiency, a 1080P enter video file, original_videos_Sports_1080P_Sports_1080P-0063.mkv from the YouTube’s Consumer Generated Content material Dataset was used. See Ampere’s ffmpeg tuning and construct information for particulars on the way to construct ffmpeg and numerous codecs together with VP9 for Ampere Processors.

Default libvpx Construct:

$ git clone https://chromium.googlesource.com/webm/libvpx
$ cd libvpx/
$ export CFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdeclaration-after-statement -Wdisabled-optimization -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wuninitialized -Wunused -Wextra -Wundef -Wframe-larger-than=52000 -std=gnu89"
$ export CXXFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdisabled-optimization -Wextra-semi -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wmissing-declarations -Wuninitialized -Wunused -Wextra -Wno-psabi -Wc++14-extensions -Wc++17-extensions -Wc++20-extensions -std=gnu++11 -std=gnu++11"
$ ./configure
$ make verbose=1 
$ ./vpxenc --codec=vp9 --profile=0 --height=1080 --width=1920 --fps=25/1 --limit=100 -o output.mkv /dwelling/joneill/Movies/original_videos_Sports_1080P_Sports_1080P-0063.mkv --target-bitrate=2073600 --good --passes=1 --threads=1 –debug

The way to Optimize libvpx Construct with -mcpu=native

$ # rebuild with -mcpu=native
$ make clear
$ export CFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdeclaration-after-statement -Wdisabled-optimization -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wuninitialized -Wunused -Wextra -Wundef -Wframe-larger-than=52000 -std=gnu89"
$ export CXXFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdisabled-optimization -Wextra-semi -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wmissing-declarations -Wuninitialized -Wunused -Wextra -Wno-psabi -Wc++14-extensions -Wc++17-extensions -Wc++20-extensions -std=gnu++11 -std=gnu++11"
$ ./configure 
$ make verbose=1 
# confirm the construct makes use of the sdot dot product instruction:
$ objdump -d vpxenc | grep sdot | wc -l
$ ./vpxenc --codec=vp9 --profile=0 --height=1080 --width=1920 --fps=25/1 --limit=100 -o output.mkv /dwelling/joneill/Movies/original_videos_Sports_1080P_Sports_1080P-0063.mkv --target-bitrate=2073600 --good --passes=1 --threads=1 --debug

An investigation utilizing Linux perf to measure the variety of CPU cycles within the features that took probably the most time embody the features vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon. The libvpx git repository reveals these features had been optimized by Arm to make use of the Armv8.6-A USDOT (mixed-sign dot-product) instruction which is supported by Ampere Processors.

The CPU cycles spent in vpx_convolve8_horiz_neon was diminished from 6.07E+11 to 2.52E+11 utilizing gcc -mcpu=native to allow the dot product optimization on an Ampere Altra processor, decreasing the CPU cycles by an element of two.4x.

For vpx_convolve8_vert_neon, the CPU cycles had been diminished from 2.46E+11 to 2.07E+11, for a 16% discount.

Total, utilizing -mcpu=native to allow the dot product instruction sped up transcoding the file original_videos_Sports_1080P_Sports_1080P-0063.mkv by 7% on an Ampere Altra processor by enhancing the appliance throughput. The next desk reveals information collected utilizing the perf file and perf report utilities to measure CPU cycles and directions retired.

Construct Config Image Cycle(%) Cycles Directions(%) Directions
Default Construct vpx_convolve8_horiz_neon 8.72 6.07E+11 7.52 1.13E+12
vpx_convolve8_vert_neon 3.53 2.46+E11 2.51 3.78E+11
Total Utility 100 6.97E+10 100 1.48E+11
-mcpu=native vpx_convolve8_horiz_neon 3.89 2.52E+11 3.87 5.71E+11
vpx_convolve8_vert_neon 3.19 2.07+E11 3.29 4.86E+11
Total Utility 100 6.48E+10 100 1.48E+11

GCC Profile Guided Optimization

This part gives an summary of GCC’s Profile Guided Optimization (PGO) and a case research of optimizing MySQL with PGO. Profile Information Optimizations allow GCC to make higher optimization selections, together with optimizing branches, code block reordering, inlining features and loops optimizations by way of loop unrolling, loop peeling and vectorization. Utilizing PGO requires modifying the construct atmosphere to do a 3-part construct.

  1. Construct utility with Profile Guided Optimization, gcc -fprofile-generate.
  2. Run utility on consultant workloads to generate the profile information.
  3. Rebuild utility utilizing the profile information, gcc -fprofile-use.

A problem of utilizing PGO is the extraordinarily excessive efficiency overhead in step 2 above. Because of the gradual efficiency working an utility constructed with gcc -fprofile-generate, it might not be sensible to run on techniques working in a manufacturing atmosphere. See the GCC guide’s Program Instrumentation Choices part to construct purposes with run-time instrumentation and the part Choices That Management Optimization for rebuilding utilizing the generated profile info for extra particulars.

As described within the GCC guide, -fprofile-update=atomic is really useful for multi-threaded purposes, and may enhance efficiency by accumulating improved profile information.

When to Use PGO?

With PGO, GCC can higher optimize purposes by offering extra info corresponding to measuring branches taken vs. not taken and measuring loop journey counts. PGO is a helpful optimization to attempt to see if it improves efficiency. Efficiency signatures the place PGO could assist embody purposes with a major proportion of department mispredictions, which could be measured utilizing the perf utility to learn the CPU’s Efficiency Monitoring Unit (PMU) counter BR_MIS_PRED_RETIRED. Massive numbers of department mispredictions result in a excessive proportion of front-end stalls, which could be measured by the STALL_FRONTEND PMU counter. Purposes with a excessive L2 instruction cache miss fee can also profit from PGO, presumably associated to mis-predicted branches. In abstract, a big proportion of department mispredictions, CPU entrance finish stalls and L2 instruction cache misses are efficiency signatures the place PGO can enhance efficiency.

MySQL database GCC PGO Case Research

MySQL is the world’s hottest open-source database and as a result of big MySQL binary measurement, is a perfect candidate for utilizing GCC PGO optimization. With out PGO info, it’s inconceivable for GCC to appropriately predict the numerous completely different code paths executed. Utilizing PGO enormously reduces department misprediction, L2 instruction cache miss fee and CPU entrance finish stalls on Ampere Altra Max Processor.

Summarizing how MySQL is optimized utilizing GCC PGO:

  1. sysbench was used to judge MySQL efficiency
  2. GCC PGO was educated utilizing MySQL MTR (mysql-test-run) take a look at suite
  3. Sysbench’s oltp_point_select and oltp_read_only exams had been used to measure efficiency with PGO construct in comparison with the default construct
  4. The variety of threads used had been then different from 1 to 1024, giving a median pace up of 29% for the oltp_point_select and 20% for the oltp_read_only take a look at on an Ampere Altra Max M128-30 processor
  5. With 64 threads, PGO improved efficiency by 32% by enhancing MySQL’s throughput

Further particulars could be discovered on the Ampere Developer’s web site within the MySQL Tuning Information.


Optimizing purposes requires experimenting with completely different methods to find out what works greatest. This paper gives suggestions for various GCC compiler optimizations to generate excessive performing purposes working on Ampere Processors. It highlights utilizing the -mcpu possibility as the best option to generate code that takes benefit of all of the options supported by Ampere Cloud Native Processors. Two case research, for MySQL database and VP9 video encoder, present using GCC choices to optimize these purposes the place efficiency is crucial.

Constructed for sustainable cloud computing, Ampere’s first Cloud Native Processors ship predictable excessive efficiency, platform scalability, and energy effectivity unprecedented within the business. We invite you to study extra about our developer efforts and discover greatest practices at developer.amperecomputing.com and be a part of the dialog at group.amperecomputing.com.


Supply hyperlink


Please enter your comment!
Please enter your name here