The following text (courtesy of Intel Corp.) describes the compiler flags used for Unisys CPU2000 and SPECompm results generated with the Intel 7.1 compilers on Linux-based systems. Description of compiler flags for Intel C/C++ Compiler 7.1 ------------------------------------------------------ -O1 Optimize for maximum speed, but disable some optimizations which increase code size for a small speed benefit. Turns off software pipelining to reduce code size. Enables the same optimizations as -O2 except for loop unrolling. Optimations include global register allocation, instruction scheduling, register variable detection, common subexpression elimination, dead code elimination, variable renaming, copy propagation, constant propagation, strength reduction-induction variable, and tail recursion elimination. -O2 Enable optimizations (DEFAULT) done with -O1 + software pipelining, loop unrolling, and inlining of intrinsics. -O3 Enable -O2 plus more aggressive optimizations including loop transformation, OpenMP, and prefetching. High-level optimizations use the properties of source code constructs such as loops and arrays in applications written in high-level programming languages. Optimizes for maximum speed, but may not improve performance for some programs. Enable -O2 optimizations + prefetching, scalar replacement, and loop transformations. -O0 Disable optimizations -O1, -O2 and -O3. -O Same as -O2 -ansi_alias Directs the compiler to assume the following: .Arrays are not accessed out of bounds. .Pointers are not cast to non-pointer types, and vice-versa. .References to objects of two different scalar types cannot alias. For example, an object of type int cannot alias with an object of type float, or an object of type float cannot alias with an object of type double. If your program satisfies the above conditions, setting the -ansi_alias flag will help the complier better optimize the program. -ip Enable single-file IP optimizations -ipo Multi-file ip optimizations that includes: - inline function expansion - interprocedural constant propogation - dead code elimination - propagation of function characteristics - passing arguments in registers - loop-invariant code motion -openmp Enables the parallelizer to generate multi-threaded code based on the OpenMP directives. The -openmp option only works at an optimization level of -O2 (the default) or higher. -prof_gen Instrument program for profiling for the first phase of two-phase profile guided optimization -prof_use Instructs the compiler to produce a profile-optimized executable and merges available dynamic information (.dyn) files into a pgopti.dpi file. If you perform multiple executions of the instrumented program, -prof_use merges the dynamic information files again and overwrites the previous pgopti.dpi file. Without any other options, the current directory is searched for .dyn files -static Prevents linking with shared libraries Description of compiler flags for Intel FORTRAN Compiler 7.1 ------------------------------------------------------------ -O1 Optimize for maximum speed, but disable some optimizations which increase code size for a small speed benefit. Turns off software pipelining to reduce code size. Enables the same optimizations as -O2 except for loop unrolling. Optimations include global register allocation, instruction scheduling, register variable detection, common subexpression elimination, dead code elimination, variable renaming, copy propagation, constant propagation, strength reduction-induction variable, and tail recursion elimination. -O2 Enable optimizations (DEFAULT) done with -O1 + software pipelining, loop unrolling, and inlining of intrinsics. -O3 Enable -O2 plus more aggressive optimizations including loop transformation, OpenMP, and prefetching. High-level optimizations use the properties of source code constructs such as loops and arrays in applications written in high-level programming languages. Optimizes for maximum speed, but may not improve performance for some programs. Enable -O2 optimizations + prefetching, scalar replacement, and loop transformations. -72,-80,-132 Specifies 72, 80 or 132 column lines for fixed form source only. -FI Specifies that the source code is in fixed format (default for .for, .f, or .ftn). -ftz Enable flush-to-zero mode, where gradual underflows are flushed to zero. -ip Enable single-file IP optimizations -ipo Multi-file ip optimizations that includes: - inline function expansion - interprocedural constant propogation - dead code elimination - propagation of function characteristics - passing arguments in registers - loop-invariant code motion -openmp Enables the parallelizer to generate multi-threaded code based on the OpenMP directives. The -openmp option only works at an optimization level of -O2 (the default) or higher. -prof_gen Instrument program for profiling for the first phase of two-phase profile guided optimization -prof_use Instructs the compiler to produce a profile-optimized executable and merges available dynamic information (.dyn) files into a pgopti.dpi file. If you perform multiple executions of the instrumented program, -prof_use merges the dynamic information files again and overwrites the previous pgopti.dpi file. Without any other options, the current directory is searched for .dyn files -stack_temps Tells the compiler to allocate temporaries on the stack whenever possible. OpenMP programs which repeatedly allocate heap memory sometimes show poor performance as the number of threads increase. Allocating arrays on the stack using -stack_temps will sometimes eliminate such performance problems. -static Prevents linking with shared libraries ALTERNATE PEAK SOURCE: SPEC OMPL2001 (srcalt=ompl) 64-bit source modified for OMPM2001 Used for 320.wquake_m, 324.apsi_m, 328.fma3d_m USER ENVIRONMENT: Variable Description OMP_SCHEDULE Sets the run-time schedule type and chunk size. Default is static, no chunk size specified. OMP_NUM_THREADS Sets the number of threads to use during execution. Default is the number of processors. OMP_DYNAMIC Enables (true) or disables (false) the dynamic adjustment of the number of threads. Default is false. OMP_NESTED Enables (true) or disables (false) nested parallelism. Default is false. KMP_STACKSIZE Sets the number of bytes to allocate for each parallel thread to use as its private stack. Use the optional suffix b, k, m, g, or t, to specify bytes, kilobytes, megabytes, gigabytes, or terabytes. Default for IA-32 is 2m. Default for Itanium is 4m. KMP_LIBRARY Selects the OpenMP runtime library throughput. The options for the variable value are: serial, turnaround, or throughput indicating the execution mode. The default value of throughput is used if this variable is not specified. KMP_LIBRARY Execution Modes. The compiler with OpenMP enables you to run an application under different execution modes that can be specified at run time. The libraries support the serial, turnaround, and throughput modes. These modes are selected by using the kmp_library environment variable at run time. Serial The serial mode forces parallel applications to run on a single processor. Turnaround In a dedicated (batch or single user) parallel environment where all processors are exclusively allocated to the program for its entire run, it is most important to effectively utilize all of the processors all of the time. The turnaround mode is designed to keep active all of the processors involved in the parallel computation in order to minimize the execution time of a single job. In this mode, the worker threads actively wait for more parallel work, without yielding to other threads. Note: Avoid over-allocating system resources. This occurs if either too many threads have been specified, or if too few processors are available at run time. If system resources are over-allocated, this mode will cause poor performance. The throughput mode should be used instead if this occurs. Throughput In a multi-user environment where the load on the parallel machine is not constant or where the job stream is not predictable, it may be better to design and tune for throughput. This minimizes the total time to run multiple jobs simultaneously. In this mode, the worker threads will yield to other threads while waiting for more parallel work. The throughput mode is designed to make the program aware of its environment (that is, the system load) and to adjust its resource usage to produce efficient execution in a dynamic environment. This mode is the default. SYSTEM TUNABLES: ulimit -s Provides control over the resources available to the shell and to processes started by it. The -s option sets the maximum stack size (kbytes).