Selecting one of the following will take you directly to that section:
Don't include Fortran main program object module.
Chooses generally optimal flags for the target platform.
Allocate host data directly in CPU physical (pinned) memory in place of using pinned memory buffers. Allocation cost may be higher, but using pinned memory data transfer is often faster. Useful with programs having few allocation but many data transfers between the host and device.
Generate GPU device code targeting NVIDIA devices with compute capability 8.0.
Generate GPU device code using the CUDA 11.0 toolchain and libraries.
Limit the device register count per thread to a specified value.
Disable C++ exception handling support.
Disable C++ run time type information support.
Generate zero-overhead C++ exception handlers.
Disable loop autoparallelization within a acc parallel region.
Enable OpenACC directives targeting NVIDIA GPUs
Enable OpenACC directives targeting multicore CPUs.
Enable OpenACC directives.
Inline functions declared with the inline keyword.
Disable inlining of functions declared with the inline keyword. <Default>
Align "unconstrained" data objects of size greater than or equal to 16 bytes on cache-line boundaries. An "unconstrained" object is a variable or array that is not a member of an aggregate structure or common block, is not allocatable, and is not an automatic array. On by default on 64-bit Linux systems.
Align doubles on double alignment boundaries
Do not align doubles on double alignment boundaries. <Default>
Set SSE to flush-to-zero mode; if a floating-point underflow occurs, the value is set to zero.
Generate code to set up a stack frame. <Default>
Eliminates operations that set up a true stack frame pointer for every function. With this option enabled, you cannot perform a traceback on the generated code and you cannot access local variables.
Instructs the compiler to use relaxed precision in the calculation of floating-point reciprocal square root (1/sqrt). Can result in improved performance at the expense of numerical accuracy.
Instructs the compiler to use relaxed precision in the calculation of floating-point square root. Can result in improved performance at the expense of numerical accuracy.
Instructs the compiler to use relaxed precision in the calculation of floating-point division. Can result in improved performance at the expense of numerical accuracy.
Instructs the compiler to use relaxed precision for intrinsics. Can result in improved performance at the expense of numerical accuracy.
Instructs the compiler to allow floating-point expression reordering, including factoring. Can result in improved performance at the expense of numerical accuracy.
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy.
Instructs the compiler to extend the sign bit that is set as a result of an object's conversion from one data type to an object of a larger signed data type.
The numerical method used when computing the residual iterations of a vectorized (SIMD) loop may be different than used in the vectorized loop. Using this option may lead for fast but less numerically consistent results.
Allow expression re-association; specifying this sub-option can increase opportunities for loop-carried redundancy elimination.
Disable expression re-association.
Enables loop-carried redundancy elimination, an optimization that can reduce the number of arithmetic operations and memory references in loops.
Disable loop-carried redundancy elimination. <Default>
Print compiler feedback messages for acceleration.
Print compiler feedback messages for OpenMP.
Print compiler feedback messages.
Instructs the inliner to inline the functions within the library filename.ext.
Instructs the inliner to inline all eligible functions except foo, a function in the source text. Multiple functions can be listed, comma-separated.
Instructs the inliner to inline function func.
Allows inlining in Fortran even when array shapes do not match.
Instructs the inliner to inline functions with N or fewer statements where N is a supplied constant value.
Limit inlining to total size of n.
Inline only functions smaller than n.
Always inline functions smaller than n.
Instructs the inliner to perform N levels of inlining where N is a supplied constant value. If no value is suppiled, then the default value of 2 is used.
Instructs the inliner to perform 1 level of inlining.
Place automatic arrays on the stack.
Assume all pointers and arrays are independent and safe for aggressive optimizations, and in particular that no pointers or arrays overlap of conflict with each other.
Instructs the compiler that arrays and pointers are treated with the same copyin and copyout semantics as Fortran dummy arguments.
Instructs the compiler that local pointers and arrays do not overlap or conflict with each other and are independent.
Instructs the compiler that local pointers and arrays do not overlap or conflict with each other and are independent.
Instructs the compiler that static pointers and arrays do not overlap or conflict with each other and are independent.
Instructs the compiler that global or external pointers and arrays do not overlap or conflict with each other and are independent.
Instructs the C/C++ compiler to override data dependencies between pointers of a given storage class.
Instructs the compiler to completely unroll loops with a constant loop count of less than or equal to N where N is a supplied constant value. If no constant value is given, then a default of 4 is used. A value of 1 inhibits the complete unrolling of loops with constant loop counts.
"-Munroll=n:n" instructs the compiler to unroll loops N times where N is a supplied constant value. If no constant value is given, then a default of 4 is used.
"-Munroll=m:n" instructs the compiler to unroll loops with multiple blocks N times where N is a supplied constant value. If no constant value is given, then a default of 4 is used.
Instructs the compiler to unroll loops with multiple blocks using the default value of 4 times
Invokes the loop unroller.
Disable loop unrolling. <Default>
Don't check dependence relations for vector or parallel code.
Treat plain char as signed.
Treat plain char as unsigned.
Align large objects on quad-word boundaries.
Disable an optional post-pass instruction scheduling. <Default>
Disable automatic vector pipelining. <Default>
Instructs the vectorizer to generate alternate code for vectorized loops when appropriate. For each vectorized loop the compiler decides whether to generate altcode and what type or types to generate, which may be any or all of:
The compiler also determines suitable loop count and array alignment conditions for executing the altcode.
Disables alternate code generation for vectorized loops.
Instructs the vectorizer to enable certain associativity conversions that can change the results of a computations due to roundoff error. A typical optimization is to change an arithmetic operation to an arithmetic operation that is mathematically correct, but can be computationally different, due to round-off error.
Instructs the vectorizer to disable associativity conversions.
Instructs the vectorizer, when performing cache tiling optimizations, to assume a cache size of N. The default size is processor dependent.
Instructs the vectorizer to enable loop fusion.
Instructs the vectorizer to disable vectorization of indirect array references.
Instructs the vectorizer to enable idiom recognition.
Instructs the vectorizer to disable idiom recognition.
Generate vector loops for all loops where possible regardless of the number of statements in the loop. This overrides a heuristic in the vectorizer that ordinarily prevents vectorization of loops with a number of statements that exceed a certain threshold.
Instructs the vectorizer to generate partial vectorization.
Instructs the vectorizer to generate prefetch instructions.
Enables generation of packed SSE instructions for short vector operations that arise from scalar code outside of loops or within the body of a loop iteration.
Instructs the vectorizer to search for vectorizable loops and, where possible, make use of SSE, SSE2, and prefetch instructions.
Instructs the driver to disable the -Mvect=sse option which is part of the "-fast" option.
Enable automatic vector pipelining.
Disables -Ktrap=fp.
-Ktrap is only processed by the compilers when compiling main functions' programs. The options inv, denorm, divz, ovf, unf, and inexact correspond to the processor's exception mask bits invalid operation, denormalized operand, divide-by-zero, overflow, underflow, and precision, respectively. Normally, the processor's exception mask bits are on (floating-point exceptions are masked the processor recovers from the exceptions and continues). If a floating-point exception occurs and its corresponding mask bit is off (or unmasked ), execution terminates with an arithmetic exception (C's SIGFPE signal). -Ktrap=fp is equivalent to -Ktrap=inv,divz,ovf.
Use the -mp option to instruct the compiler to interpret user-inserted OpenMP shared-memory parallel programming directives and generate an executable file which will utilize multiple processors in a shared-memory parallel system. When used strictly as a linker flag, the NVHPC OpenMP runtime will be linked and users can use the environment variables MP_BIND and MP_BLIST to bind a serial program to a CPU.
(For use only on 64-bit Linux targets) Generate code for the medium memory model in the linux86-64 execution environment. The default small memory model of the linux86-64 environment limits the combined area for a user's object or executable to 1GB, with the Linux kernel managing usage of the second 1GB of address for system routines, shared libraries, stacks, etc. Programs are started at a fixed address, and the program can use a single instruction to make most memory references. The medium memory model allows for larger than 2GB data areas, or .bss sections. Program units compiled using either -mcmodel=medium or -fpic require additional instructions to reference memory. The effect on performance is a function of the data-use of the application. The -mcmodel=medium switch must be used at both compile time and link time to create 64-bit executables. Program units compiled for the default small memory model can be linked into medium memory model executables as long as they are compiled -fpic, or position-independent.
Enable support for 64-bit indexing and single static data objects larger than 2GB in size. This option is default in the presence of -mcmodel=medium. Can be used separately together with the default small memory model for certain 64-bit applications that manage their own memory space.
Enable optimizations using ANSI C type-based pointer disambiguation.
A basic block is generated for each C statement. No scheduling is done between statements. No global optimizations are performed.
Level-one optimization specifies local optimization (-O1). The compiler performs scheduling of basic blocks as well as register allocation. This optimization level is a good choice when the code is very irregular; that is it contains many short statements containing IF statements and the program does not contain loops (DO or DO WHILE statements). For certain types of code, this optimization level may perform better than level-two (-O2) although this case rarely occurs.
The NVHPC compilers perform many different types of local optimizations, including but not limited to:
Level-two optimization (-O2 or -O) specifies global optimization. The -fast option generally will specify global optimization; however, the -fast switch will vary from release to release depending on a reasonable selection of switches for any one particular release. The -O or -O2 level performs all level-one local optimizations as well as global optimizations. Control flow analysis is applied and global registers are allocated for all functions and subroutines. Loop regions are given special consideration. This optimization level is a good choice when the program contains loops, the loops are short, and the structure of the code is regular.
The NVHPC compilers perform many different types of global optimizations, including but not limited to:
Staticily link with the NVIDIA runtime libraries. System libraries may still be dynamically linked.
Link with static libraries.
Link with dynamic libraries.
Print the compiler version.
Use C++ 17 language features.
Use C++ 14 language features.
Swap byte-order for unformatted input/output.
The OpenMPI C++ driver configured for use with the NVIDIA HPC C++ compiler (nvc++).
The OpenMPI C driver configured for use with the NVIDIA HPC C compiler (nvc).
The OpenMPI Fortran driver configured for use with the NVIDIA HPC Fortran compiler (nvfortran).
Disable warning messages.
Include the NVIDIA Fortran runtime libraries when linking.