The PGI C compiler for Linux.
The PGI Fortran compiler for Linux.
The PGI C compiler for Linux.
The PGI Fortran compiler for Linux.
By default, 351.palm uses the Temperton Algorithm to compute FFTs. By defining SPEC_HOST_FFTW3, the benchmark will instead use a user suppiled FFTW3 library. The arrays passed to this library will be the host copy.
Users must specify both -DSPEC_HOST_FFTW as well as the include path to the FFTW3 interface file, fftw3.f03. They must also add the FFTW3 libary to the libraries. For example:
315.palm:
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Enable OpenACC directives.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Enable OpenACC directives.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Enable OpenACC directives.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Enable OpenACC directives.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Don't include Fortran main program object module.
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Enable OpenACC directives.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Enable OpenACC directives.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Set maximum number of registers to use on the GPU.
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Disable loop autoparallelization within a acc parallel region.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Use the fast math library on the device.
Link using FFTW 3.3.3 library for Linux. Description from FFTW:
FFTW is a C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Enable OpenACC directives.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Enable OpenACC directives.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Use pinned host data for memory transfers.
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Enable OpenACC directives.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".
Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"
Enable OpenACC directives.
Target NVIDA GPUs with compute capability 3.5
Use NVIDIA CUDA 5.5 to compile device code.
Set maximum number of registers to use on the GPU.
Don't include Fortran main program object module.
Specifies a directory to search for include files. Use -I to add directories to the search path for include files.
Specifies a directory to search for libraries. Use -L to add directories to the search path for library files. Multiple -L options are valid. However, the position of multiple -L options is important relative to -l options supplied.
This section contains descriptions of flags that were included implicitly by other flags, but which do not have a permanent home at SPEC.
Level-two optimization (-O2 or -O) specifies global optimization. The -fast option generally will specify global optimization; however, the -fast switch will vary from release to release depending on a reasonable selection of switches for any one particular release. The -O or -O2 level performs all level-one local optimizations as well as global optimizations. Control flow analysis is applied and global registers are allocated for all functions and subroutines. Loop regions are given special consideration. This optimization level is a good choice when the program contains loops, the loops are short, and the structure of the code is regular.
The PGI compilers perform many different types of global optimizations, including but not limited to:
Level-one optimization specifies local optimization (-O1). The compiler performs scheduling of basic blocks as well as register allocation. This optimization level is a good choice when the code is very irregular; that is it contains many short statements containing IF statements and the program does not contain loops (DO or DO WHILE statements). For certain types of code, this optimization level may perform better than level-two (-O2) although this case rarely occurs.
The PGI compilers perform many different types of local optimizations, including but not limited to:
Instructs the compiler to completely unroll loops with a constant loop count of less than or equal to 1 where 1 is a supplied constant value. If no constant value is given, then a default of 4 is used. A value of 1 inhibits the complete unrolling of loops with constant loop counts.
Invokes the loop unroller.
Inline functions declared with the inline keyword.
Enable an optional post-pass instruction scheduling.
Enables loop-carried redundancy elimination, an optimization that can reduce the number of arithmetic operations and memory references in loops.
Eliminates operations that set up a true stack frame pointer for every function. With this option enabled, you cannot perform a traceback on the generated code and you cannot access local variables.
Instructs the vectorizer to search for vectorizable loops and, where possible, make use of SSE, SSE2, and prefetch instructions.
Enable automatic vector pipelining.
Instructs the vectorizer to enable certain associativity conversions that can change the results of a computations due to roundoff error. A typical optimization is to change an arithmetic operation to an arithmetic opteration that is mathmatically correct, but can be computationally different, due to round-off error.
Instructs the vectorizer to generate alternate code for vectorized loops when appropriate. For each vectorized loop the compiler decides whether to generate altcode and what type or types to generate, which may be any or all of:
The compiler also determines suitable loop count and array alignment conditions for executing the altcode.
Align "unconstrained" data objects of size greater than or equal to 16 bytes on cache-line boundaries. An "unconstrained" object is a variable or array that is not a member of an aggregate structure or common block, is not allocatable, and is not an automatic array. On by default on 64-bit Linux systems.
Set SSE to flush-to-zero mode; if a floating-point underflow occurs, the value is set to zero.
Treat denormalized numbers as zero. Included with "-fast" on Intel based systems. For AMD based systems, "-Mdaz" is not included by default with "-fast".
Use SSE/SSE2 instructions to perform scalar floating-point arithmetic on targets where these instructions are supported. This option is default on SSE2 enabled target architectures.
Instructs the compiler to use relaxed precision in the calculation of floating-point reciprocal square root (1/sqrt). Can result in improved performance at the expense of numerical accuracy.
Instructs the compiler to use relaxed precision in the calculation of floating-point square root. Can result in improved performance at the expense of numerical accuracy.
Instructs the compiler to use relaxed precision in the calculation of floating-point division. Can result in improved performance at the expense of numerical accuracy.
Instructs the compiler to allow floating-point expression reordering, including factoring. Can result in improved performance at the expense of numerical accuracy.
Flag description origin markings:
For questions about the meanings of these flags, please contact the tester.
For other inquiries, please contact webmaster@spec.org
Copyright 2014-2015 Standard Performance Evaluation Corporation
Tested with SPEC ACCEL v39.
Report generated on Tue Mar 3 14:20:56 2015 by SPEC ACCEL flags formatter v1156.