The following are the compiler switches/options used by SGI for the recent SPEC OMP2001 submissions. Portability Flags: -fixedform Treats all input source files, regardless of suffix, as if they were written in fixed source form. By default, only input files suffixed with .f or .F are assumed to be written in fixed source form. -extend_source Specifies a 132-character line length for fixed-format source lines. By default, fixed-format lines are 72 characters wide. For more information on controlling line length, see the -coln option Compiler Flags: -64, -n32 Specifies the Application Binary Interface (ABI). Option Action -64 Generates 64-bit objects. -n32 Generates 32-bit objects. When in effect, the total memory allocation for a program and individual arrays cannot exceed 2 Gbytes. -bigp_off Disables the use of large pages within your program. This is the default for all optimization levels except -Ofast. -bigp_on Enables the use of large pages within the program. After a program is compiled with this option, set the PAGESIZE_DATA, PAGESIZE_STACK, and PAGESIZE_TEXT environment variables to be one of the following values: 16, 256, 1024, 4096, or 16384. These values represent the size, in kilobytes, of the pages you wish to use. If these environment variables are not set, by default, your program uses 16KB pages. This option is enabled when -Ofast is also specified. For more information, see the -bigp_off option. -extend_source Specifies a 132-character line length for fixed-format source lines. By default, fixed-format lines are 72 characters wide. For more information on controlling line length, see the -coln option -fixedform Treats all input source files, regardless of suffix, as if they were written in fixed source form. By default, only input files suffixed with .f or .F are assumed to be written in fixed source form. -CG:ld_latency= Specifies the assumed latency of load instruction in processor cycles to be used to determine optimal instruction scheduling. The default setting is 5. -INLINE:aggressive=on Tells the compiler to be more aggressive about inlining. The default is -INLINE:aggressive=off. -IPA[:...] IPA option group: control the inter-procedural analyses and transformations performed. Note that giving just the group name without any options, i.e. -IPA, will invoke IPA with the default settings. -IPA is off by default unless -Ofast is specified. -IPA:aggr_cprop[=(on/off)] Enable/disable aggressive interprocedural constant propagation. Attempt to avoid passing constant parameters, replacing the corresponding formal parameters by the constant values. This optimization is off by default. -IPA:callee_limit=(n) Functions whole size exceeds this limit will never be automatically inlined by the compiler. The default is 2000. -IPA:clone=on Allows IPA to clone procedures while inlining. This is off by default. -IPA:common_pad_size=n Specifies the amount by which to pad common block array dimensions. By default, the compiler automatically chooses the amount of padding to improve cache behavior for common block array accesses. -IPA:inline[=(on/off)] Controls whether compiler performs inter-file subprogram inlining during main IPA processing. This defaults to on if -IPA or -Ofast was specified, otherwise it is off. -IPA:linear=on Sets linearization of array references. setting can be ON or OFF. When inlining Fortran subroutines, IPA tries to map formal array parameters to the shape of the actual parameter. It may not always be able to always map it. In the case that it cannot map the parameter, it linearizes the array reference. By default, it will not inline such callsites because they may cause performance problems. The default is OFF. -IPA:maxdepth=n Directs IPA to not attempt to inline functions at a depth of more than n in the callgraph, where functions which make no calls are at depth 0, those which call only depth 0 functions are at depth 1, and so on. Inlining remains subject to overriding limits on code expansion. See also forcedepth, space, and plimit. -IPA:min_hot=(n) When feedback information is available, a call site to a procedure must be invoked with a frequency that exceeds the threshold specified by n before the procedure will be inlined at that call site. -IPA:multi_clone=n Specifies the maximum number of clones that can be created from a single procedure. By default, this value is 0. interprocedural optimization, but it also may significantly increase the code size. -IPA:node_bloat=n When used in conjunction with IPA:multi_clone, this specifies the maximum percentage growth of the total number of procedures relative to the original program. -IPA:pad=off Disables the automatic padding of common block arrays. Default is on when -IPA is specified, otherwise it is off. -IPA:plimit=(n) Inline calls to a procedure until the procedure has grown to size of n. -IPA:small_pu=(n) A procedure with size smaller than n is not subjected to the plimit restriction. -IPA:space=(n) Inline until a program expansion of n% is reached. This defaults to 100. -IPA:use_intrinsic[=(ON|OFF)] Enable/disable loading the intrinsic version of standard library functions. The default is off. -LANG:exceptions=(on/off) Enables or disables exception handling constructs in the language. Generally, code with and without exception handling cannot be mixed. Specifically, the scopes crossed between throwing and catching an exception must all have been compiled with exceptions=ON. Default is ON. -lfastm Causes the executable to be linked using libfastm.so, a faster, lower-precision versions of various routines from libm.so. -lmalloc Causes the executable to be linked using libmalloc.so, which has a high performance version of malloc(). -mp Generates multiprocessing code for the files being compiled. This option causes the compiler to recognize all multiprocessing directives including the OpenMP Fortran API multiprocessing directives and the Silicon Graphics directives that are extensions to OpenMP. -lscs Causes the executable to be linked using libscs.so. libscs is the SGI/Cray Scientific Library (SCSL) which contains the following high performance routines: BLAS1, BLAS2, BLAS3, LAPACK, and FFTs. -LNO:ap=(0/1/2) Controls automatic parallelization: 0 - no parallelization, 1 - normal parallelization, 2 - parallelize loops regardless of number of trip counts. The default is 1. -LNO:auto_dist=true On Origin systems, use a heuristic to distribute local and global arrays that are accessed in parallel. The heuristic is based on access patterns of the named arrays; access patterns of arrays used as dummy arguments are ignored. The default is off. -LNO:blocking[=(on/off)] Enable/disable the cache blocking transformation. The default is on at -O3 or higher. -LNO:cs2=(n) Specify size of second level cache (e.g. 4m equals 4 megabytes) Default is 4m. -LNO:fission=(n/on/off) Perform loop fission, n: 0 - off, 1 - conservative, 2 - aggressive. The default is 1. -LNO:fusion=(n/on/off) Perform loop fusion, n: 0 - off, 1 - conservative, 2 - aggressive. The default is 1. -LNO:interchange=(on/off) Perform loop interchange. This is on with -O3 or higher is specified, otherwise it is off. -LNO:local_pad_size=n Specifies the amount by which to pad local array dimensions. By default, the compiler automatically chooses the amount of padding to improve cache behavior for local array accesses. -LNO:opt= Controls the LNO optimization level. n can be one of the following: 0 Disables nearly all loop nest optimization. 1 Peforms full loop nest transformations. This is the default. -LNO:ou= Indicates that all outer loops for which unrolling is legal should be unrolled by , where is a positive integer. The compiler unrolls loops by this amount or not at all. -LNO:ou_prod_max=n Indicates that the product of unrolling of the various outer loops in a given loop nest is not to exceed n, where n is a positive integer. The default is 16. -LNO:outer_unroll_max,ou_max=(n) Outer_unroll_max indicates that the compiler may unroll outer loops in a loop nest by as many as n per loop, but no more. The default is 4. -LNO:pf2=(on/off) Enable/disable prefetch for second level cache. The default is on if -O3 or higher is specified, otherwise the default is off. -LNO:prefetch[=(0|1|2)] Specify level of prefetching. 0 = Prefetch disabled. 1 = Prefetch enabled but conservative, the default. 2 = Prefetch enabled and aggressive. -LNO:prefetch_ahead=[n] Prefetch n cache line(s) ahead. The default is 2. -LNO:pwr2[=(on/off)] If enabled, when the leading dimension of an array is a power of two, the compiler makes an extra effort to make the inner loop stride one. The default is on. -mips4 Generate code using the full MIPS IV instruction set which is supported on R10000, R5000 and R8000 systems, and search for mips4 libraries/objects at link-time. -O or -O2 Turn on extensive optimization. The optimizations at this level are generally conservative, in the sense that they (1) are virtually always beneficial, (2) provide improvements commensurate to the compile time spent to achieve them, and (3) avoid changes which affect such things as floating point accuracy. -O3 Turn on aggressive optimization. The optimizations at this level are distinguished from -O2 by their aggressiveness, generally seeking highest-quality generated code even if it requires extensive compile time. They may include optimizations which are generally beneficial but occasionally hurt performance. This includes but is not limited to turning on the Loop Nest Optimizer, -LNO:opt=1, and setting -OPT:ro=2:IEEE_arith=3:Olimit=3000:reorg_common=on and -TENV:X=3. -Ofast[=ipxx] Use optimizations selected to maximize performance for the given SGI target platform IPxx. The optimizations may differ for the various platforms, and will always enable the full instruction set of the target platform (e.g. -mips4 for an R10000). Although the optimizations are generally safe, they may affect floating point accuracy due to rearrangement of computations. This effectively turns on the following optimizations: -O3 -IPA -OPT:ro=3:Olimit=0:div_split=on:alias=typed -TARG:platform= -bigp_on. -Ofast=ip27 This flag is equivalent to the following optimizations: -O3 -IPA -OPT:ro=3:Olimit=0:div_split=on:alias=typed:unroll_times_max=8 -TARG:platform=ip27 -bigp_on. -Ofast=ip35 This flag is equivalent to the following optimizations: -O3 -IPA -OPT:ro=3:Olimit=0:div_split=on:alias=typed:unroll_times_max=8 -TARG:platform=ip35 -bigp_on. -OPT:alias= Specifies the pointer aliasing model to be used. By specifiying one or more of the following for , the compiler is able to make assumptions throughout the compilation: typed Assume that the code adheres to the ANSI/ISO C standard which states that two pointers of different types cannot point to the same location in memory. This is on by default when -Ofast is specified. restrict Specifies that distinct pointers are assumed to point to distinct, non-overlapping objects. This is off by default. disjoint Specifies that any two pointer expressions are assumed to point to distinct, non-overlapping objects. This is off by default. -OPT:div_split=(on/off) Enable/disable changing x/y into x*(recip(y)). This is on when -Ofast is specified, otherwise it is off by default. -OPT:fast_bit_intrinsics[=(on/off)] Disable/enable the check for the bit count being within range for Fortran bit intrinsics (e.g., BTEST, ISHFT). This is off. -OPT:Olimit=(n) Disable optimization when size of program unit is > n. When n is 0, program unit size is ignored and optimization process will not be disabled due to compile time limit. The default is 0 when -Ofast is specified, otherwise the default is 2000. -OPT:goto=(off/on) Disable/enable the conversion of GOTOs into higher level structures like FOR loops. The default is on for -O2 or higher. -OPT:IEEE_arith=(n) specify level of conformance to IEEE 754 floating pointing roundoff/overflow behavior. At level 3, all mathematically valid transformations are allowed. The default is 1. -OPT:reorg_common[=(on/off)] reorg_common=ON reorganizes common blocks to improve the cache behavior of accesses to members of the common block. The reorganization is done only if the compiler detects that it is safe to do so. -OPT:ro=(n) Specifies the level of acceptable departure from source language floating-point, round-off, and overflow semantics. n can be one of the following: 0 Inhibits optimizations that might affect the floating-point behavior. This is the default when optimization levels -O0, -O1, and -O2 are in effect. 1 Allows simple transformations that might cause limited round-off or overflow differences. Compounding such transformations could have more extensive effects. 2 Allows more extensive transformations, such as the reordering of reduction loops. This is the default level when -O3 is in effect. 3 Enables any mathematically valid transformation. -OPT:unroll_times_max=(n) Unroll inner loops by a maximum of n. The default is 4. -OPT:unroll_size=(n) Sets the ceiling of maximum number of instructions for an unrolled inner loop. If n = 0, the ceiling is disregarded. -OPT:unroll_analysis=[on/off] Enable/disable unrolling of inner loops by analysing resource and processor specifics. The default is on. -pfa Turns on the MIPSpro Auto Parallelizing Option, which enables the compiler to automatically discover parallelism in the source code. This is off by default. -TARG:platform[=ipxx] Identify the target SGI platform for compilation, choosing various internal parameters (such as cache sizes) appropriately. The default is ip25. -TARG:platform=ip27 Turns on the following -TARG:madd=ON:isa=mips4:processor=r10000 -TARG:madd=flag Enable or disable transformations to use multiply/add instructions. Flag can be either ON or OFF. These instructions perform a multiply and an add with a single round- off. They are, therefore, more accurate than the usual discrete operations, and may cause results to not match baselines from other targets. Use this option to determine whether observed differences are due to madds. The default is -TARG:madd=ON for a MIPS IV target; it is ignored for others. -TARG:isa=value Identifies the target instruction set architecture for compilation, such as the set of instructions that are generated. value can be mips3 or mips4. Specify -TARG:isa=mips3 for code that must run on R4000 processors. This option is equivalent to specifying -mips3 or -mips4 (see those options for defaults). -TARG:processor=type Select the processor for which to schedule code. type can be either r4000, r5000, r8000, or r10000. The chosen processor must support the ISA specified (or implied by the ABI). -TENV:X=(0..5) Specify the level of enabled exceptions that will be assumed for purposes of performing speculative code motion (default level 1 at -O0..-O2, 2 at -O3). In general, an instruction will not be speculated (i.e. moved above a branch by the optimizer) unless any exceptions it might cause are disabled by this option. At level 0, no speculative code motion may be performed. At level 1, safe speculative code motion may be performed, with IEEE-754 underflow and inexact exceptions disabled. At level 2, all IEEE-754 exceptions are disabled except divide by zero. At level 3, all IEEE-754 exceptions are disabled including divide by zero. At level 4, memory exceptions may be disabled or ignored. -Wl,-x Passes the -x option to the linker. With this flag set, the linker will not preserve local (non-global) symbols in the output symbol table. The linker enters external and static symbols only. This option conserves space in the output file. This is off by default. The following are descriptions of the system tunable parameters used to enable large pages. systune -i Systune is a tool that enables a user to examine and configure your tunable kernel parameters. -i puts systune in interactive mode. percent_totalmem__pages=n A system tunable parameter that can be set from within systune. It tells IRIX what the maximum percent of memory can be allocated to pages. Size is one of: 64k, 256k, 1m, 4m, 16m, indicating 64 Kilobyte, 256 Kilobyte, 1 Megabyte, 4 Megabyte and 16 Megabyte pages, respectively. nlpages_4m=n A system tunable parameter that can be set from within systune. It tells IRIX to statically allocate n 4MB pages at system boot time. r12k_bdiag=n A system tunable parameter that can be set from within systune. It specifies the number of bits of the global history register to be used for branch prediction. Bits 26:23 of the Diag register (CPO register 22) correspond to the global history register. Turning bit 26 on, n=0x4000000, allows branch prediction to use all the 8bits in the global history register. If bit 26 is not set, default, bits 25:23 specify a count of the number of bits of the global history register to be used. The following is a description of the shell intrinsic limit, used to set UNIX process limits. limit [ resource [ max-use ] ] Limit the consumption by the current process or any process it spawns, each not to exceed max-use on the specified resource. If max-use is omitted, print the current limit; if resource is omitted, display all limits. resource may include: stacksize Maximum stack size for the process. Note: If this is set too high, sproc(2) may fail. The following is a description of the environment variables used and interpreted by the multiprocessor (mp) library. OMP_NUM_THREADS Sets the number of threads to use during execution, unless that number is explicitly changed by calling the OMP_SET_NUM_THREADS subroutine. OMP_DYNAMIC=[TRUE/FALSE] Enables or disables dynamic adjustment of the number of threads available for execution of parallel regions. MPC_GANG=[ON/OFF] Controls the use of gang scheduling, which is enabled by default. To disable gang scheduling, set this environment variable to OFF. By default, this environment variable is not set. _DSM_MUSTRUN Locks each thread to the corresponding CPU. This environment variable is not set by default. _DSM_WAIT=[SPIN/YIELD] Controls how a thread waits for a synchronization event, such as a lock or a barrier. This environment variable accepts one of the following values: Value Action SPIN Specifies that a thread wait in a loop until the synchronization event succeeds. YIELD Specifies that a waiting thread should spin for a while and invokes sginap(2). This surrenders the CPU to another waiting process (if any). Default. _DSM_PLACEMENT=[FIRST_TOUCH/ROUND_ROBIN] Allocates memory for all stack, data, and text segments. This environment variable accepts the following values: Value Action FIRST_TOUCH Specifies first-touch data placement. Default. ROUND_ROBIN Specifies round-robin data allocation.