Copyright © 2012 Intel Corporation. All Rights Reserved.
Selecting one of the following will take you directly to that section:
Set the optimization level to -O2
A basic block is generated for each C statement. No scheduling is done between statements. No global optimizations are performed.
Level-one optimization specifies local optimization (-O1). The compiler performs scheduling of basic blocks as well as register allocation. This optimization level is a good choice when the code is very irregular; that is it contains many short statements containing IF statements and the program does not contain loops (DO or DO WHILE statements). For certain types of code, this optimization level may perform better than level-two (-O2) although this case rarely occurs. The PGI compilers perform many different types of local optimizations, including but not limited to: Algebraic identity removal Constant folding Common subexpression elimination Local register optimization Peephole optimizations Redundant load and store elimination Strength reductions Note that this is the default optimiation level when no optimization flags are specified on the compilation command line.
Level-two optimization (-O2 or -O) specifies global optimization. The -fast option generally will specify global optimization; however, the -fast switch will vary from release to release depending on a reasonable selection of switches for any one particular release. The -O or -O2 level performs all level-one local optimizations as well as global optimizations. Control flow analysis is applied and global registers are allocated for all functions and subroutines. Loop regions are given special consideration. This optimization level is a good choice when the program contains loops, the loops are short, and the structure of the code is regular. The PGI compilers perform many different types of global optimizations, including but not limited to: Branch to branch elimination Constant propagation Copy propagation Dead store elimination Global register allocation Invariant code motion Induction variable elimination
All level 1 and 2 optimizations are performed. In addition, this level enables more aggressive code hoisting and scalar replacement optimizations that may or may not be profitable.
Performs all level 1, 2, and 3 optimizations and enables hoisting of guarded invariant floating point expressions.
Specify the type(s) of the target processors(s). The PGI compilers produce code specifically targeted to the type of processor on which the compilation is performed. In particular, the default is to use all supported instructions wherever possible when compiling on a given system. The default target processor is auto-selected depending on the processor on which the compilation is performed. You can specify a target processor to compile for a different processor type, such as to select a more generic processor, allowing the code to run on more system types. Specifying two or more target processors enables unified binary code generation, where two or more versions of each function may be generated, each version optimized for the specific instruction set available in each target processor. Executables created on a given system without the -tp flag may not be usable on previous generation systems. For example, executables created on an Intel Skylake processor may use instructions that are not available on earlier Intel Sandy Bridge systems. The following list contains the possible suboptions for -?tp and the processors that each suboption is intended to target. px - generate code that is usable on any x86-64 processor-based system. bulldozer - generate code for AMD Bulldozer and compatible processors. piledriver - generate code that is usable on any AMD Piledriver processor-based system. zen - generate code that is usable on any AMD Zen processor-based system (Epyc, Ryzen). sandybridge - generate code for Intel Sandy Bridge and compatible processors. haswell - generate code that is usable on any Intel Haswell processor-based system. knl - generate code that is usable on any Intel Knights Landing processor-based system. skylake - generate code that is usable on an Intel Skylake Xeon processor-based system.
Instructs the compiler to interpret user-inserted OpenMP shared-memory parallel programming directives, and to generate an executable file which will utilize multiple processors in a shared-memory parallel system. Use the -mp option to instruct the compiler to interpret user-inserted OpenMP shared-memory parallel programming directives and to generate an executable file which utilizes multiple processors in a shared-memory parallel system. The suboptions are one or more of the following: align Forces loop iterations to be allocated to OpenMP processes using an algorithm that maximizes alignment of vector sub-sections in loops that are both parallelized and vectorized for SSE. This allocation can improve performance in program units that include many such loops. It can also result in load-balancing problems that significantly decrease performance in program units with relatively short loops that contain a large amount of work in each iteration. The numa suboption uses libnuma on systems where it is available. allcores Instructs the compiler to target all available cores. You specify this suboption at link time. bind Instructs the compiler to bind threads to cores. You specify this suboption at link time. [no]numa Uses [does not use] libnuma on systems where it is available. For a detailed description of this programming model and the associated directives, refer to Section 9, Using OpenMP of the PGI Compiler User's Guide. To set this option in PVF, use the Fortran | Language | Enable OpenMP Directives property, described in Enable OpenMP Directives.
Description You can use this option to explicitly compile for and link to the static version of the PGI runtime libraries. Note:On Linux,-Bstatic_pgi results in code that runs on most Linux systems without requiring a Portability package. For more information on using static libraries on Windows, refer to ‘Creating and Using Static Libraries on Windows’in the‘Creating and Using Libraries’section of the PGI Compiler User's Guide.
Description Use this option to specify the 64-bit compiler as the default processor type.
Controls whether Fortran 95 or Fortran 2003 semantics are used in allocatable array assignments. The default behavior is to use Fortran 95 semantics; the 03 option instructs the compiler to use Fortran 2003 semantics.
Instructs the compiler to treat "*" as a synonym for standard input for reading and standard output for writing.
Instructs the compiler to perform certain optimizations and to disallow for stride 0 array references.
Instructs the compiler to convert all identifiers to lower case. This selection affects the linking process. If you compile and link the same source code using-Mupcase on one occasion and -Mnoupcase on another, you may get two different executables, depending on whether the source contains uppercase letters. The standard libraries are compiled using -Mnoupcase.
Instructs the compiler to allow the asm keyword in C source files. The syntax of the asm statement is as follows: asm("statement"); Where statement is a legal assembly-language statement. The quote marks are required. Note:The current default is to support gcc's extended asm, where the syntax of extended asm includes asm strings. The -M[no]asmkeyword switch is useful only if the target device is a Pentium 3 or older cpu type (-tp piii|p6|k7|athlon|athlonxp|px).
Instructs the compiler to allow the asm keyword in C source files. The syntax of the asm statement is as follows: asm("statement"); Where statement is a legal assembly-language statement. The quote marks are required. Note:The current default is to support gcc's extended asm, where the syntax of extended asm includes asm strings. The -M[no]asmkeyword switch is useful only if the target device is a Pentium 3 or older cpu type (-tp piii|p6|k7|athlon|athlonxp|px).
Instructs the compiler to convert float parameters to double parameters in non-prototyped functions.
Enables vectorization with SIMD instructions, cache alignment, and flushz for 64-bit targets.
Description
When you use this option, a generally optimal set of options is chosen for targets that support SIMD capability. In addition, the appropriate -tp option is automatically included to enable generation of code optimized for the type of system on which compilation is performed. This option enables vectorization with SIMD instructions, cache alignment, and flushz.
Note: Auto-selection of the appropriate -tp option means that programs built using the -fastsse option on a given system are not necessarily backward-compatible with older systems.
Note: C/C++ compilers enable -Mautoinline with-fast.
Interprocedural Analysis.
Instructs the compiler not to generate critical section calls around Fortran I/O statements.
(pgf77, pgf95, andpgfortran only) The compiler does not promote the intrinsics CMPLX and REAL to DCMPLX and DBLE, respectively.
Instructs the compiler not to assume that all local variables are subject to the SAVE statement.
Enables partial redundancy elimination.
FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).
AMD LibM is a software library containing a collection of basic math functions optimized for x86-64 processor-based machines. It provides many routines from the list of standard C99 math functions. Applications can link into AMD LibM library and invoke math functions instead of compiler’s math functions for better accuracy and performance.
FFTW is a comprehensive collection of fast C routines for computing the Discrete Fourier Transform (DFT) and various special cases thereof. It is an open-source implementation of the Fast Fourier transform algorithm. It can compute transforms of real and complex-values arrays of arbitrary size and dimension. An AMD optimized FFTW that includes selective kernels and routines optimized for the AMD EPYCâ„¢ processor family is available. Source code is available on GitHub https://github.com/amd/amd-fftw
Enables loop-carried redundancy elimination, an optimization that can reduce the number of arithmetic operations and memory references in loops.
(pgf77, pgf95, andpgfortran only) The compiler does not promote the intrinsics CMPLX and REAL to DCMPLX and DBLE, respectively.
Specifies signed char characters. The compiler treats "plain" char declarations as signed char.
Instructs the inliner to perform 1 level of inlining.
BLIS is a portable open-source software framework for instantiating high-performance Basic Linear Algebra Subprograms (BLAS) – like dense linear algebra libraries. The framework was designed to isolate essential kernels of computation that, when optimized, immediately enable optimized implementations of most of its commonly used and computationally intensive operations. Select kernels have been optimized for the AMD EPYCTM processor family by AMD and others. Source code is available on GitHub https://github.com/amd/blis.
Instructs the C/C++ compiler to override data dependencies between pointers of a given storage class.
Enable auto-concurrentization of loops. Multiple processors or cores will be used to execute parallelizable loops.
Instrument the generated code and link in libraries for dynamic collection of profile and data information at runtime.
Enable profile feedback information. The nopfo option is valid only immediately following the inline suboption. -Mipa=inline,nopfo tells IPA to ignore PFO information when deciding what functions to inline, if PFO information is available.
Instructs the compiler to assume input source files are in Fortran 90/95 freeform format.
pointer
for purposes of optimization, it is assumed that pointer-based variables do not overlay the storage of any other variable.
(For use only on 64-bit Linux targets) Generates code for the medium memory model in the linux86-64 execution environment. Implies -Mlarge_arrays.
Default: The compiler generates code for the small memory model.
Usage
The following command line requests position independent code be generated, and the option -mcmodel=medium be passed to the assembler and linker:
$ pgfortran -mcmodel=medium myprog.f
Description
The default small memory model of the linux86-64 environment limits the combined area for a user's object or executable to 1GB,
with the Linux kernel managing usage of the second 1GB of address for system routines, shared libraries, stacks, and so on.
Programs are started at a fixed address, and the program can use a single instruction to make most memory references.
The medium memory model allows for larger than 2GB data areas, or .bss sections. Program units compiled using either -mcmodel=medium
or -fpic require additional instructions to reference memory.
The effect on performance is a function of the data-use of the application. The -mcmodel=medium switch must be used at both compile
time and link time to create 64-bit executables. Program units compiled for the default small memory model can be linked into medium
memory model executables as long as they are compiled with the option -fpic, or position-independent.
The linux86-64 environment provides static libxxx.a archive libraries, that are built both with and without -fpic, and dynamic
libxxx.so shared object libraries that are compiled with -fpic. Using the link switch -mcmodel=medium implies the -fpic switch
and utilizes the shared libraries by default. The directory $PGI/linux86-64/
memory model codes; and the directory $PGI/linux86-64/
-mcmodel=medium?executables.
Note:
-mcmodel=medium -fpic is not allowed to create shared libraries. However, you can create static archive libraries (.a) that are -fpic.
Invokes the PGI C compiler.
Invokes the PGI Fortran compiler.
Invokes the PGI C++ compiler.
KMP_AFFINITY
The KMP_AFFINITY environment variable uses the following general syntax:
Syntax |
---|
KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>] |
For example, to list a machine topology map, specify KMP_AFFINITY=verbose,none to use a modifier of verbose and a type of none.
The following table describes the supported specific arguments.
Argument |
Default |
Description |
---|---|---|
noverbose respect granularity=core |
Optional. String consisting of keyword and specifier.
|
|
none |
Required string. Indicates the thread affinity to use.
The logical and physical types are deprecated but supported for backward compatibility. |
|
0 |
Optional. Positive integer value. Not valid with type values of explicit, none, or disabled. | |
0 |
Optional. Positive integer value. Not valid with type values of explicit, none, or disabled. |
Type is the only required argument.
Does not bind OpenMP threads to particular thread contexts; however, if the operating system supports affinity, the compiler still uses the OpenMP thread affinity interface to determine machine topology. Specify KMP_AFFINITY=verbose,none to list a machine topology map.
Specifying compact assigns the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed. For example, in a topology map, the nearer a node is to the root, the more significance the node has when sorting the threads.
Specifying disabled completely disables the thread affinity interfaces. This forces the OpenMP run-time library to behave as if the affinity interface was not supported by the operating system. This includes the low-level API interfaces such as kmp_set_affinity and kmp_get_affinity, which have no effect and will return a nonzero error code.
Specifying explicit assigns OpenMP threads to a list of OS proc IDs that have been explicitly specified by using the proclist= modifier, which is required for this affinity type.
Specifying scatter distributes the threads as evenly as possible across the entire system. scatter is the opposite of compact; so the leaves of the node are most significant when sorting through the machine topology map.
Types logical and physical are deprecated and may become unsupported in a future release. Both are supported for backward compatibility.
For logical and physical affinity types, a single trailing integer is interpreted as an offset specifier instead of a permute specifier. In contrast, with compact and scatter types, a single trailing integer is interpreted as a permute specifier.
Specifying logical assigns OpenMP threads to consecutive logical processors, which are also called hardware thread contexts. The type is equivalent to compact, except that the permute specifier is not allowed. Thus, KMP_AFFINITY=logical,n is equivalent to KMP_AFFINITY=compact,0,n (this equivalence is true regardless of the whether or not a granularity=fine modifier is present).
For both compact and scatter, permute and offset are allowed; however, if you specify only one integer, the compiler interprets the value as a permute specifier. Both permute and offset default to 0.
The permute specifier controls which levels are most significant when sorting the machine topology map. A value for permute forces the mappings to make the specified number of most significant levels of the sort the least significant, and it inverts the order of significance. The root node of the tree is not considered a separate level for the sort operations.
The offset specifier indicates the starting position for thread assignment.
Modifiers are optional arguments that precede type. If you do not specify a modifier, the noverbose, respect, and granularity=core modifiers are used automatically.
Modifiers are interpreted in order from left to right, and can negate each other. For example, specifying KMP_AFFINITY=verbose,noverbose,scatter is therefore equivalent to setting KMP_AFFINITY=noverbose,scatter, or just KMP_AFFINITY=scatter.
Does not print verbose messages.
Prints messages concerning the supported affinity. The messages include information about the number of packages, number of cores in each package, number of thread contexts for each core, and OpenMP thread bindings to physical thread contexts.
Information about binding OpenMP threads to physical thread contexts is indirectly shown in the form of the mappings between hardware thread contexts and the operating system (OS) processor (proc) IDs. The affinity mask for each OpenMP thread is printed as a set of OS processor IDs.
KMP_LIBRARY
KMP_LIBRARY = { throughput | turnaround | serial }, Selects the OpenMP run-time library execution mode. The options for the variable value are throughput, turnaround, and serial.
The compiler with OpenMP enables you to run an application under different execution modes that can be specified at run time. The libraries support the serial, turnaround, and throughput modes.
The serial mode forces parallel applications to run on a single processor.
In a dedicated (batch or single user) parallel environment where all processors are exclusively allocated to the program for its entire run, it is most important to effectively utilize all of the processors all of the time. The turnaround mode is designed to keep active all of the processors involved in the parallel computation in order to minimize the execution time of a single job. In this mode, the worker threads actively wait for more parallel work, without yielding to other threads.
Avoid over-allocating system resources. This occurs if either too many threads have been specified, or if too few processors are available at run time. If system resources are over-allocated, this mode will cause poor performance. The throughput mode should be used instead if this occurs.
In a multi-user environment where the load on the parallel machine is not constant or where the job stream is not predictable, it may be better to design and tune for throughput. This minimizes the total time to run multiple jobs simultaneously. In this mode, the worker threads will yield to other threads while waiting for more parallel work.
The throughput mode is designed to make the program aware of its environment (that is, the system load) and to adjust its resource usage to produce efficient execution in a dynamic environment. This mode is the default.
KMP_BLOCKTIME
KMP_BLOCKTIME = value. Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.Use the optional character suffixes: s (seconds), m (minutes), h (hours), or d (days) to specify the units.Specify infinite for an unlimited wait time.
KMP_STACKSIZE
KMP_STACKSIZE = value. Sets the number of bytes to allocate for each OpenMP* thread to use as the private stack for the thread. Recommended size is 16m. Use the optional suffixes: b (bytes), k (kilobytes), m (megabytes), g (gigabytes), or t (terabytes) to specify the units. This variable does not affect the native operating system threads created by the user program nor the thread executing the sequential part of an OpenMP* program or parallel programs created using -parallel.
OMP_NUM_THREADS
Sets the maximum number of threads to use for OpenMP* parallel regions if no other value is specified in the application. This environment variable applies to both -openmp and -parallel. Example syntax on a Linux system with 8 cores: export OMP_NUM_THREADS=8
OMP_DYNAMIC
OMP_DYNAMIC={ 1 | 0 } Enables (1, true) or disables (0,false) the dynamic adjustment of the number of threads.
OMP_SCHEDULE
OMP_SCHEDULE={ type,[chunk size]} Controls the scheduling of the for-loop work-sharing construct. type can be either of static,dynamic,guided,runtime chunk size should be positive integer
OMP_NESTED
OMP_NESTED={ 1 | 0 } Enables creation of new teams in case of nested parallel regions (1,true) or serializes (0,false) all nested parallel regions. Default is 0.