================================================================================ Fujitsu PRIMEPOWER flags/tunables description (September 15, 2004) ================================================================================ Fujitsu Parallelnavi 2.4 compiler flag description Parallelnavi consists of the following three main function groups: (a) Program development and execution environment functions; (b) High-speed application execution functions; and (c) Job operation management functions. The first group consists of a program development environment that supports Fortran, C, and C++ in accordance with international standards, and allows the use of parallel technology. Mathematical libraries, incl. the parallel version, and program development support tools are included. The second group contains an harmonized scheduling function to ensure a uniform level of execution performance for a parallelized program and a large page function, which improves the execution performance of applications that handle large volumes of data. The third group provides for the execution control and management of jobs, including parallel jobs, using the network queuing system (NQS). NQS allows one to submit batch jobs to queues on local or remote machines and have the log file returned to the originating machine or another machine. It is possible to disallow the sharing of CPUs by multiple jobs. The system resources can be effectively applied using the NQS scheduling functions, and a flexible scheduling for each user and group is provided by the Network Queuing System Job Manager (NQS-JM). The following compiler options have been used for both Fortran 90 and C compilers, except where specified otherwise. Compiler options Remark -------------------------------------------------------------------------------- -KOMP Recognizes OpenMP directives in the source code and generates multiprocessing code. -Kfast_GP2[={1|2|3}] This performs optimization for the SPARC64 V series. 1: This performs optimization suitable for SPARC64 V including the global instruction scheduling. 2: This performs reordering of expression evaluation in addition to -Kfast_GP2=1. It also puts the options -KSPARC64_GP2 and -Kprefetch=3 in effect. 3: This option specifies the optimization of moving the evaluation of invariant expressions beyond the branch in addition to -Kfast_GP2=2. -KV8PLUS Indicates that SPARC V8+ instructions are generated. -KV9 Indicates that SPARC V9 instructions are generated. -KV9FMADD Indicates that SPARC V9 instructions and multiply add/subtract are generated. -Klargepage[={1|2}] Specifies the generation of an executable program which utilizes the Parallelnavi largepage facility. 1: The largepage facility applies to the data and heap areas 2: The largepage facility applies to the data, heap, and stack areas -Khardbarrier Specifies the generation of an executable program which utilizes the Parallelnavi thread hardware barrier facility. The synchronization performance is improved. -x- Inline expansion of user-defined procedures having 30 or fewer executable statements is performed. -xstm_no Inline expansion of user-defined procedures having stm_no or fewer executable statements. The default is 30. -Kprefetch_line=N When the prefetch instruction is generated, the target of prefetch is cache after N line. It is valid only if one of the -Kprefetch={2|3|4|5} options are in effect. -Kprefetch[={1|2|3|4|5}] Generate prefetch instruction corresponding to each prefetch level. The -Kprefetch option is valid when the -KV8PLUS or the -KV9 option is effective. 1: Basic level prefetch for array elements in only the inner-most loop. However, in case of multiloop, only the array data in the most inner loop is targeted. 2: In addition to the -Kprefetch=1, generate the prefetch instruction for array elements within the loop pre-header which accesses the first iteration in the loop. 3: In addition to the -Kprefetch=2, when the stride of access for array elements are larger than cache line size, compiler generates prefetch instruction for each cache line size access. 4: In addition to -Kprefetch=3, prefetch with address calculation is executed. 5: In addition to -Kprefetch=4, prefetching is applied to array data which are accessed indirectly. -Kpreex This option specifies the optimization of moving the evaluation of invariant expressions beyond branch. -Ka{1|2|4|8} Specifies, in bytes, the minimum alignment boundary for external and static variables. For example, 1-byte to 4-byte boundary data is aligned on 4-byte boundaries when -Ka4 is specified. By default, -Ka1 is assumed. -Kprefetch_cache_level=N (F90 only) This option specifies the cache-level to prefetch data. It is valid only if the -KSPARC64_GP2 and one of the -Kprefetch={2|3|4|5} options are in effect. N can be specified as follows: 1 : Data is prefetched in the first cache. Prefetch instruction is used normally. 2 : Data is prefetched only in the second cache. 3 : Level 1 and 2 functions are in effect. Two kind of prefetch instructions are used, so that the prefetch becomes high level. -K{prefetch_infer|prefetch_noinfer} This option specifies that the prefetch distance is guessed from existing information. The address calculation for prefetch can be omitted. It is valid if the -Kprefetch option is in effect. Default is -Kprefetch_noinfer. -Kprefetch_infer With this option you specify that the address calculation for prefetch is omitted when the prefetch distance is guessed from existing information. -Kprefetch_noinfer With this option you specify that the address calculation for prefetch is performed when the prefetch distance is guessed from existing information. -KSPARC64_GP2 The -KSPARC64_GP2 option creates an object program that uses SPARC64 V instructions. The -KSPARC64_GP2 option is not supported to run on a system equipped with CPU other than SPARC64 V. -O [opt_lvl] opt_lvl:{0|1|2|3|4|5} (F90 version) The -O option specifies the optimization level used by the compiler. The system provides four optimization levels: 0, 1, 2, 3, 4, and 5. If the argument is omitted from the -O option, level 3 is used. If the -O option is not specified, level 2 is used. The following description holds for F90 only. 0 The -O0 option creates an object program without applying optimization. A program compiled with the zero optimization level requires the least compile time and memory. Specify this argument to debug compilation errors in Fortran source programs. 1 The -O1 option creates an object program by applying basic optimization. The run time of the resulting executable program will be shorter. The size of an object program created using level 1 is smaller than the size of an object program created using level 2. 2 The -O2 option creates an object program by applying the basic optimization of level 1 plus loop unrolling. Compared with level 1, the run time of an executable program that includes many DO loops can be reduced by using level 2. 3 The -O3 option creates an object program by applying the optimizations of level 2 plus modification of the structure of nested loops, loop tiling and software pipelining. Also the optimizations of level 1 are applied repeatedly. Compared with level 2, the execution performance of an object program is improved by using level 3. 4 The -O4 option creates an object program by applying further optimizations of loop restructuring in addition to the -O3 option. Optimizations of this level contain full unrolling of nested loops, splitting for promoting loop exchange and optimization about arrays with constant subscripts. Although compiling time increases more than -O3, this option can be used to obtain better execution performance. 5 The -O5 option creates an object program by applying further optimizations of register allocation in addition to -O4 option. Although compiling time increases further more than -O4, this option can be used to obtain better execution performance. -O [opt_lvl] opt_lvl:{0|1|2|3|4|5} (C version) With n, specify the level of optimization as 0, 1, 2, 3, 4, or 5. When -O option is specified, -O3 is assumed. The higher the level of optimization, the shorter the execution time and the longer the compile time. The higher levels of optimization functionally include the lower levels of optimization. This option is not valid in an .s file. 0 No optimization is performed. This is equivalent to not specifying -O. 1 Optimization is performed through detailed analysis of program control flow. 2 In addition to the optimization of optimization level 1, the following optimization is performed: - Loop unrolling. This may involve an increase in object size. 3 In addition to the optimization of optimization level 2, the following optimizations are performed: - Loop unrolling (expanded) - Software pipelining - Repeated application of optimization functions Repeated application of optimization functions means that the optimization functions performed in optimization level 1 are repeatedly performed until there is no room for further optimization. Compile time will increase. 4 In addition to the optimization of optimization level 3, the following optimizations are performed: - Splitting for promotion of loop exchange - -KGREG_APPLI option is assumed 5 In addition to the optimization of optimization level 4, the following optimizations are performed: - register allocation (expanded) -K{GREG_APPLI | NOGREG_APPLI} This specifies whether or not the global registers g2 through g4 (when -KV9 is available, g2, g3 are used) are subject to register allocation in the compile stage. These registers are used for application programs. If -KGREG_APPLI is specified, the registers are subject to register allocation. If -KNOGREG_APPLI is specified, the registers are not subject to register allocation. By default, -KNOGREG_APPLI is assumed. -Nautoobjstack Allocates an automatic data object on the stack. (F90 only) -Am Required if a source file contains modules which will be referenced by USE statements in other source files or if a source file contains USE statements that reference modules in another source file. (F90 only) -Fixed Specifies that Fortran source programs are written in fixed source form. (F90 only) -w In fixed source form, the length of all source lines is 255 characters. (F90 only) -Kfast_GP[={0|1|2}] This performs optimization for SPARC64 GP series. 0: This performs optimization suitable for SPARC64 GP including the global instruction scheduling. 1: This generates multiply and add instruction in addition to -Kfast_GP=0. (default) 2: This performs reordering of expression evaluation in addition to -Kfast_GP=1. -D name[=tokens] Associates name with the specified tokens in the same way as for a #define preprocessing directive. If =tokens is not specified, the token 1 is used. -KOMP_fast_reduction Use a reduction facility which is valid only if the parallel regions are not nested. -Kgs Perform global instruction scheduling. This is activated if -Kfast_GP2 is specified. -Kparallel Specifies automatic parallelization. -Knogs Do not perform global instruction scheduling. -Kpreex This option specifies the optimization of moving the evaluation of invariant expressions beyond the branch. -Kloop Restructures DO loops by performing interchange and blocking for better cache use. -Kilfunc This option replaces single and double precision mathematical functions, sin, cos, log10, log and exp with compiler built-in functions. -Kcfunc This uses high speed mathematical functions and library functions (malloc, calloc, realloc, free) prepared by this compilation system. (C only) -Kmfunc This uses high-speed mathematical functions (C version) prepared by this compilation system. The mathematical functions are the trigonometric functions, logarithmic functions, and gamma functions. -Kmfunc[={1|2|3}] This uses high-speed mathematical functions (F90 version) prepared by this compilation system. The mathematical functions are the trigonometric functions, logarithmic functions, and gamma functions. -Kmfunc=1 is the same as -Kmfunc. It combines two or four function references into one function call. -Kmfunc=2 combines two or four or eight function references into one function call. It requires more stack area than -Kmfunc=1 does. -Kmfunc=3 combines two or four or eight function references into one function call even if the loop contains IF constructs, etc. It also requires a lot of stack area. -K{ unroll[=N] | nounroll } 2 = N = 100 With this flag, loop unrolling is performed. N designates the upper limit of the unrolling expansion number, whose value should be between 2 to 100. When specification of N is omitted, the compiler determines automatically the suitable value. Default is -Knounroll if -O0 or -O1 is specified, and -Kunroll if -O2 or higher is specified. -lmtmalloc Use Solaris mtmalloc library. malloc() and free() provide a simple general-purpose memory allocation package that is suitable for use in high performance multithreaded applications. -SSL2 Combine with SSL II library. This library contains the optimized BLAS functions. -Kcommonpad[=N] 4 =< N =< 4096 The -Kcommonpad option inserts padding into a common block for efficient use of cache. If the source file that includes the common block is compiled with the -Kcommonpad option, other source files that include the same common block must also be compiled with -Kcommonpad. Unit is byte. -Kbcopy This option directs to use memcpy(3C) and memset(3C) functions for copying between arrays and setting zero (0) into arrays in a loop. This option is only meaningful with option -KSPARC64_GP2 or with option -O3 or higher. ================================================================================ ================================================================================ Solaris system configuration information file description All flags are specified in '/etc/system'. System Tunables Remark -------------------------------------------------------------------------------- shmsys:shminfo_shmmax Maximum size of system V shared memory segment that can be created. shmsys:shminfo_shmmni System wide limit on number of shared memory segments that can be created. shmsys:shminfo_shmseg Limit on the number of shared memory segments that any one process can create. autoup Period of execution of the fsflush daemon in units of second. This daemon periodically scans memory to find unwritten data and meta-data and writes this into disks. memscrub_period_sec Period of execution of the memory patrol daemon in units of seconds. This daemon periodically scans memory to confirm no ECC error. ================================================================================ ================================================================================ Fujitsu Parallelnavi 2.3 Large page management information file description All flags are specified in '/etc/opt/FJSVpnrm/lpg.conf'. Tunables Remark -------------------------------------------------------------------------------- JOB=size[unit] Size of total memory, to be used for large page segments. At start of the system, this amount of memory is reserved and initialized for NQS jobs. "unit" can be M for mega-byte and G for giga-byte. SHMSEGSIZE=size[unit] Size of large page segment. "unit" can be M for mega-byte and G for giga-byte. ================================================================================ ================================================================================ Fujitsu Parallelnavi 2.3 CPU resource information file description All flags are specified in '/etc/opt/FJSVpnrm/cpursc.conf'. Tunables Remark -------------------------------------------------------------------------------- CPU_USE=tss-cpuid:io-cpuid Parallelnavi allows the grouping of CPUs into CPUs that are used for job execution, and CPUs that are used for other types of processing (interactive, daemon). It is also possible to specify which CPUs should, and which should not receive I/O interrupts. The CPUs not used for job execution, are specified by tss-cpuid and of these CPUs, the CPUs interruptible by I/O devices are specified by io-cpuid. tss-cpuid and io-cpuid can be ranges of two numbers separated by a dash or comma. For instance CPU_USE=0,1:0,1 means that CPUs 0 and 1 are reserved for system tasks, and both of these CPUS are enabled to receive I/O interrupts. All other CPUs are reserved for job execution. ================================================================================ ================================================================================ Fujitsu Parallelnavi 2.3 NQS parameters description The Network Queuing System (NQS) parameters are managed by the qmgr command. Selecting the execution mode 'Simplex' disallows the sharing of CPUs by multiple jobs. Tunables Remark -------------------------------------------------------------------------------- Per-process data size limit The limit of data segment size. Per-process permanent file size limit The limit of permanent file size. Per-process memory size limit The limit of memory size. Per-request memory size limit The limit of largepage memory size. Per-process number of cpus limit The limit of the number of CPUs Per-process stack size limit The limit of stack segment size. Per-process CPU time limit The CPU time limit for each process. Execution mode Selection of execution mode. Simplex means sharing CPUs by multiple jobs is not permitted. Jobclass Assigning sub group. On a non-clustered system, this value is zero. ================================================================================ History June 2001 Matthijs van Waveren Initial version June 2003 Matthijs van Waveren Update to Parallelnavi 2.2 November 2003 Matthijs van Waveren Update to Parallelnavi 2.3 Added -Kprefetch_cache_level and -O5 Extended section on CPU_USE and -Kmfunc August 2004 Matthijs van Waveren Update of some flags Update to Parallelnavi 2.4 September 2004 Matthijs van Waveren Updates due to SPEC HPG review comments ================================================================================