IBM SPEC HPC2002 Flag Descriptions for Opteron 

Portland Group Compiler Technology's Fortran compiler  pgf90 5.0-1

Last updated:  17-October-2003


Portability  Options

-i8	Treat INTEGER variables as eight bytes.  Operations use 64 bits.

-Mfixed
	Indicates that source code is in fixed (72 column) format. 

-DSPEC_HPG_MPI_INT4  (ChemS and ChemM)
	Indicates that calls to the MPI library expect 4-byte integers rather
	than 8-byte integers.
	
-DBITS64  (ChemS and ChemM)
	Indicates the size of standard integers.  BITS64 indicates 64-bit integers.


Flags and Compiler options for the Portland Group's Fortran and C/C++ compilers 
  

Fortran pgf90 and C/C++ pgcc 5.0-1


The optimization levels and their meanings are as follows:	

	-O0	A basic block is generated for each Fortran statement.  No scheduling is done
		between statements.  No global optimizations are performed.

	-O1	Scheduling within extended basic blocks is performed.
		Some register allocation is performed.  No global optimizations
		are performed.

	-O2	All level 1 optimizations are performed.  In addition,  scalar
		optimizations such as induction recognition and loop invariant motion are
		performed by the global optimizer. 
                
        -O3     This level performs all level-one and level-two optimizations and enables more
		aggressive hoisting and scalar replacement optimizations.

	-fast    Chooses generally optimal flags for the target platform.  Equivalent to
		"-O2 -Munroll -Mnoframe" 

	-fastsse
		Chooses generally optimal flags for  machines that supports the SSE  type 
		instructions.  Equivalent to "-fast -Mscalarsse -Mvect=sse -Mcache_align -Mflushz" 

	-Mcache_align
		Align unconstrained objects of length greater than or equal to 16 bytes on
		cache-line boundaries. An unconstrained object is a data object that is not
		a member of an aggregate structure or common block. This option does
		not affect the alignment of allocatable or automatic arrays.

		Note: To effect cache-line alignment of stack-based local variables, the
		main program or function must be compiled with -Mcache_align.

	-Mflushz
		Set SSE MXCSR register to flush-to-zero mode.
	
	-Mnoframe 
		Eliminate operations that set up a true stack frame pointer for functions.

	-Mnosmart 
		Don't run the Smart assembly re-write tool to enable post-compilation 
		linear assembly scheduling and optimization

	-Mscalarsse 
		Utilize the SSE (Streaming SIMD(Single Instruction Multiple Data) Extensions)
		and SSE2  instructions to perform the operations  coded. This assumes the
		user has an assembler capable of interpreting SSE/SSE2  instructions, as in
		later versions of Linux.  This implies -Mflushz.

	-Munroll 
		Invokes the loop unroller.  This also sets the optimization level to 2 if the 
		level is set to less than 2.
	
		c:m	Instructs the compiler to completely unroll loops with a
			constant loop count less than or equal to m, a supplied constant.
			If this value is not supplied, the m count is set to 4.

		n:u	Instructs the compiler to unroll u times, a loop which is
			not completely unrolled, or has a non-constant loop count.
			If u is not supplied, the unroller computes the number of times a
			candidate loop is unrolled.


Concur	Automatic Parallelization

	-Mconcur
		Instructs the compiler to enable auto-concurrentization of loops.  If -Mconcur
		is specified, nultiple processors will be used to execute loops that the
		compiler determines to be parallelizable. 

	-Mconcur=altcode:n
		Instructs the parallelizer to generate alternate scalar code for
		parallelized loops.  If altcode is specified without arguments, the
		parallelizer determies an appropriate cutoff length and generates
		scalar code to be executed whenever the loop count is less than 
		or equal to that length.  If altcode:n is specified, the scalar 
		altcode is executed whenever the loop count is less than or equal to n.

	-Mconcur=noaltcode
		The parallelized version of the loop is always executed.

	-Mconcur=dist:block
		Contiguous blocks of iterations of a parallelizable loop are
		assigned to the available processors.

	-Mconcur=dist:cyclic
		The outermost parallelizable loop in any loop nest is parallelized.
		If a parallelized loop is innermost, its iterations are allocated to 
		processors cyclically.

	-Mconcur=cncall
		Calls in parallel loops are safe to parallelize.  Loops containing
		calls are candidates for parallelization.

	-Mconcur=noassoc
		Disables parallelization of loops with reductions.


Vect	Vectorizer

	-Mvect	Without any suboptions -Mvect is equivalent to "-Mvect=assoc,cachesize:262144".

	-Mvect=altcode
		Instructs the vectorizer to generate alternate scalar code for
		vectorized loops.

	-Mvect=assoc
		Instructs the vectorizer to enable certain associativity conversions that
		can change the results of a computation due to roundoff error.

	-Mvect=prefetch
		Instructs the vectorizer to search for vectorizable loops and,
		where possible, make use of prefetch instructions.

	-Mvect=cachesize:n
		Instructs the vectorizer, when performing cache tiling optimizations, to 
		assume a cache size of n.  The default is 262144 bytes.

	-Mvect=sse 
		Instructs the vectorizer to search for loops, and where possible,
		use the SSE or SSE2 and prefetch instructions
		(depending on which processor is targeted).


IPA	InterProcedural Analyzer 

	-Mipa=align
		Instructs the IPA to recognize when pointer targets are all cache-line 
		aligned, allowing better SSE code generation.

	-Mipa=arg 
		Instructs the IPA to remove arguments replaced by -Mipa=ptr,const 

	-Mipa=const 
		Enable propagation of constants across procedure calls.

	-Mipa=fast 
		Equivalent to: -Mipa=const,globals,localarg,ptr,vestigial 

	-Mipa=globals 
		Instructs the IPA to optimize references to globals when not used in procedure calls.		

	-Mipa=localarg 
		Externalizes local variables for use with -Mipa=arg

	-Mipa=ptr 
		Instructs the IPA to perform pointer disambiguation across procedure calls.

	-Mipa=vestigial 
		Instructs the IPA to eliminate functions that are not called.


rm -f *.da *.life analyz_prbrob.out 
Remove any profile feedback information from previous runs. 

BIOS Setting Definitions -   
DRAM Interleave
	Defines whether data will be interleaved among the four data 
	banks within individual DRAMs.

Node Interleave 
	Defines whether or not data addresses will be alternating
	between both processors in 4KB blocks.

ACPI SRAT
	Defines whether the Static Resource Allocation Table is exported by
	the BIOS to a location where the operating system can see it.  The SRAT may
	only be exported when Node Interleave is disabled.


Special Options for ch_gm device:

	-s  Close stdin - can run in background without tty input problems.

	-r  Cleanup remote shared memory files - should be removed automatically,
	    but always good to have an option to force it.

	-pg <file>      Specifies the procgroup file.

	-wd <path>      Specifies the working directory.

	-ddt            Specifies DDT debugging session.

	--gm-no-shmem   Disable the shared memory support (enabled by default).

	--gm-wait <n>   Wait <n> seconds between each spawning step.

	--gm-kill <n>   Kill all processes <n> seconds after the first exits.

	--gm-eager <n>  Specifies the Eager/Rendez-vous protocol threshold size.

	--gm-recv <m>   Specifies the receive mode <polling>, <blocking> or
			<hybrid>, <polling> is the default.