IBM XL Compiler Flags, Common Operating System Commands and Environment Settings

Compilers: IBM XL C/C++ Version 16.1.0 for Linux

Compilers: IBM XL Fortran Version 16.1.0 for Linux

Libraries: IBM Advance Toolchain version 11.0-0 Available for download at : https://ibm.biz/AdvanceToolchain

IBM Post Link Optimizer : IBM Feedback Directed Program Restructing (FDPR) for Linux on Power 5.6.4-0 Available for download at : https://developer.ibm.com/linuxonpower/sdk-packages/

Operating systems: SLES 12 SP3

Sections

Selecting one of the following will take you directly to that section:


Optimization Flags


Portability Flags


Compiler Flags


Other Flags


Commands and Options Used for Feedback-Directed Optimization

fdprpro is a Feedback Directed Program Restructuring optimization tool that is available for the IBM POWER platform. It can be used optionally during FDO.

An example command to invoke fdprpro for the optimization pass is:


Additional details regarding usage and flags are provided below.

Usage:
fdprpro -a/--action [action] [options] program 
where `program' specifies the input program in the form of an executable or a shared object
[action] can be one of the following:
	anl	 analyze program
	instr	 generate instrumented program for profile gathering
	opt	 generate optimized program
	check_sign	 check FDPR signature in the input program
	sample	 generate script file for collecting sampled profile
[options] can be one of the following:


 Analysis Options:
           
  -aawc, --analyze-assembly-written-csects
          Analyze objects written in Assembly.
  -acf <analysis configuration file>, --analysis-configuration-file <analysis configuration file>
          Provide a configuration file of analysis information (advanced option)
  -asd/-noasd, --analyze-static-data/--noanalyze-static-data
          
  -ifl <file>, --ignored-function-list <file>
          Set the ignored function list. The file contains names of functions
          that considered as unsafe and thus are not modified


 Instrumentation Options:
           
  -fd <Fdesc>, --file-descriptor <Fdesc>
          Set the file descriptor number to be used when opening the profile
          file. The default of <Fdesc> is set to the maximum-allowed number of
          open files
  -icvp, --instr-call-value-profiling
          instrument the values of parameters passed in function calles
  -imullX, --mullX-instrumentation
          perform value profiling of RA and RB operands in mullX instructions
  -iderat, --derat-instrumentation
          Perform value profiling of RA and RB operands in load/store indexed
          instructions
  -issu, --instrumentation-safe-stack-usage
          Ensure that additional stack space is properly allocated for the
          instrumented run. Use this option if your application uses the stack
          extensively (e.g., when the program uses alloca()). Note that this
          option adds extra overhead on instrumentation code
  -iso <offset>, --instrumentation-stack-offset <offset>
          Set the offset from the stack, a negative number, where the
          instrumentation's area for saving registers is kept at runtime. Use
          with care
  -M <addr>, --profile-map <addr>
          Set the shared memory segment address for profiling. Alternative shared
          memory addresses are needed when the instrumented program application
          creates a conflict with the shared-memory addresses preserved for the
          profiling. Typical alternative values are 0x40000000, 0x50000000, ...
          up to 0xC0000000. The default is set to 0x3000000
  -ptm, --profile-to-memory
          Use shared memory key instead of file mapping to obtain a shared memory
          area for the profile data
  -ri/-nori, --register-instrumentation/--noregister-instrumentation
          Instrument/Do not instrument the input program file to collect profile
          information about indirect branches via registers. The default is set
          to collect the profile information
  -sfp/-nosfp, --save-floating-point-registers/--nosave-floating-point-registers
          Save/Do not save floating point registers in instrumented code. The
          default is set to save floating point registers
  -shmkey <key number>, --shared-memory-key <key number>
          Specify a shared memory key to use when creating a shared memory area
          for the profile. The default key is created by hashing the profile
          file name (with ftok).


 Profile Files Options:
           
  -af <prof_file>, --ascii-profile-file <prof_file>
          Set the name of a text format profile file containing profile
          information.
  -aop, --accept-old-profile
          Accept the old profile file collected on previous versions of the input
          program file (requires the -f flag)
  -f <prof_file>, --profile-file <prof_file>
          Set the profile file name. The profile file is created during the
          instrumentation phase and read during the optimization phase. The
          profile file is updated each time you run the instrumented program
  -fdir <prof_file_dir>, --profile-file-directory <prof_file_dir>
          Set the run-time location of the profile file. The profile will be
          search during the profiling phase at this location. The default
          location is the path given in the profile file name (-f option).
          Applicable only at instrumentation phase


 Optimization Options:
           
  -A <alignment>, --align-code <alignment>
          Specify code alignment strategy. 1: Use grouping rules of target
          machine (default), 2: Same as 1 but consider also hotness of branch
          targets. See -m for the selected machine model.
  -abb <factor>, --align-basic-blocks <factor>
          Align basic blocks that are hotter than the average by a given (float)
          <factor>. This is a lower-level machine-specific alignment compared to
          --align-code. Value of -1 (the default) disables this option
  -bf, --branch-folding
          Eliminate branch to branch instructions
  -ccc <threshold>, --cold-code-connector <threshold>
          Preserves original order for code which is less frequently executed
          than given threshold
  -bp, --branch-prediction
          Set branch prediction bit for conditional branches according to the
          collected profile
  -cbpth, --cold-branch-prediction-threshold
          Set the Cold Branch Prediction Threshold for branch prediction
          optimization. Branches whose execution count relative to the average
          is below this value will be statically predicted. Allowed values are
          between (0,1). Default is -1 - optimization is not applied.
          (Applicable only with the -bp flag)
  -pbp, --preserve-branch-predication
          Preserve branch predication pattern (bc+8) and avoid code reordering
          and branch prediction
  -cbsi, --chain-based-selective-inline
          Perform selective inlining of functions that produce long hot chains of
          code
  -dce, --dead-code-elimination
          Eliminate instructions related to unused local variables within
          frequently executed functions. This is useful mainly after applying
          function inlining optimization
  -dp, --data-prefetch
          Insert data-cache prefetch instructions to improve data-cache
          performance
  -ece, --epilog-code-eliminate
          Reduce code size by grouping common instructions in function epilogs,
          into a single unified code
  -fatc <num_of_bytes>, --fat-const <num_of_bytes>
          Inflate constant areas in code section by adding <num_of_bytes> (entire
          set to 255) to each constant area
  -fatd <num_of_bytes>, --fat-data <num_of_bytes>
          Inflate data section by adding <num_of_bytes> (entire set to 255) to
          each data basic unit
  -fatn <num_of_nops>, --fat-nop <num_of_nops>
          Inflate code secion by adding <num_of_nop> to each code basic block
  -bined < binary_editor>, --binary-editor < binary_editor>
          Edit existing binary code (advanced option)
  -hr, --hco-reschedule
          Relocate instructions from frequently executed code to rarely executed
          code areas, when possible
  -hrf <factor>, --hco-resched-factor <factor>
          Set the aggressiveness of the -hr optimization option according to a
          factor value between (0,1), where 0 is the least aggressive factor
          (applicable only with the -hr option)
  -tasr, --toc-anchor-store-reschedule
          Relocate TOC store instructions from frequently executed code to rarely
          executed code areas, when possible
  -i, --inline
          Same as --selective-inline with --inline-small-funcs 12
  -ihf <pct>, --inline-hot-functions <pct>
          Inline all function call sites to functions that have a frequency count
          greater than the given <pct> frequency percentage
  -isf <size>, --inline-small-funcs <size>
          Inline all functions that are smaller than or equal to the given <size>
          in bytes
  -kr, --killed-registers
          Eliminate stores and restores of registers that are killed
          (overwritten) after frequently executed function calls
  -lap, --load-address-propagation
          Eliminate load instructions of variable addresses by re-using
          pre-loaded addresses of adjacent variables
  -las, --load-after-store
          Add NOP instructions to place each load instruction further apart
          following a store instruction that references the same memory address
  -plas, --pattern-based-load-after-store
          Optimizes inefficient memory access patterns in order to avoid
          load-after-store events. 
  -ebplas, --event-based-pattern-based-load-after-store
          Optimizes inefficient memory access patterns in order to avoid
          load-after-store events. The optimization is possible if
          PM_MRK_LSU_REJECT_LHS profile is available
  -rcl, --remove-constant-load
          Reduces the number of load instructions used to bring constant values
          into registers. The parameter is used to control which version of
          optimization is applied, versions from 0 to 3 are available.
  -pvgc <mode>, --print-visual-graph-csect <mode>
          Print a .dot file with CFG information for each csect. Mode 0 is for a
          graph containing full instructions list for each node, 1 is for a
          graph with short nodes description.
  -pvgf <mode>, --print-visual-graph-func <mode>
          Print a .dot file with CFG information for each function. Mode 0 is for
          a graph containing full instructions list for each node, 1 is for a
          graph with short nodes description.
  -lro, --link-register-optimization
          Eliminate saves and restores of the link register in
          frequently-executed functions
  -lu <aggressiveness_factor>, --loop-unroll <aggressiveness_factor>
          Unroll short loops containing one to several basic blocks according to
          an aggressiveness factor between (1,9), where 1 is the least
          aggressive unrolling option for very hot and short loops
  -lun <unrolling_number>, --loop-unrolling-number <unrolling_number>
          Set the number of unrolled iterations in each unrolled loop. The
          allowed range is between (2,50). Default is set to 2. (Applicable only
          with the -lu flag)
  -lux <unrolling_factor>, --loop-unroll-extended <unrolling_factor>
          Unroll hot loops using given unrolling factor. The allowed values are
          integer numbers that are power of 2. Value -1 disables the
          optimization, value 1 calculates the unrolling factor automatically,
          given a machine model
  -nop, --nop-removal
          Remove NOP instructions from reordered code
  -sls, --store-load-on-stack-opt
          Optimize store load on stack pattern
  -fmrx, --fmr-to-xxlor
          Replace FMR instructions from reordered code with XXLOR instruction
  -xscpx, --xscpsgndp-to-xxlor
          Replace Xscpsgndp instructions from reordered code with XXLOR
          instruction
  -divopt, --divide-optimization
          Replace fdiv/fdivs instructions with fre + fmul/fmuls instructions
  -tslopt, --toc-store-in-loop-optimization
          Remove toc store instructions from the loop and place toc store
          instruction before loop 
  -ifopt, --instruction_fusion_optimization
          put together two instructions suitable for fusion 
  -liopt, --loop_invariant_optimization
          move loop invariant instructions out of the loop 
  -sfopt, --simple_functions_opt
          inlining of the simple functions (isascii, isdigit) 
  -dir, --dependant-instr-resched
          Put NOP between dependant instructions
  -O      Switch on basic optimizations only. Same as -RC -nop -bp -bf
  -O2     Switch on less aggressive optimization flags. Same as -O -hr -pto -isf
          8 -tlo -kr -see 0 
  -O3     Switch on aggressive optimization flags. Same as -O2 -RD -isf 12 -si
          -lro -las -vro -btcar (for XCOFF files) -lu 9 -rt 0 -so -see 1 -oderat
          -tslopt
  -O4     Switch on aggressive optimization flags together with aggressive
          function inlining. Same as -O3 -sidf 50 -ihf 20 -sdp 9 -shci 90 and
          -bldcg (for XCOFF files)
  -ocvp, --opt-call-value-profiling
          specialize function calls according to the values of their passed
          parameters
  -omullX, --mullX-optimization
          Optimize mullX instructions by adding a run-time check on RA and RB and
          performing equivalent operations with lower penalty. The optimization
          requires the use of -imullX in the instrumentation phase
  -oderat, --derat-optimization
          Optimize load/store indexed instructions by adding a run-time check on
          RA and RB and performing equivalent operations with lower penalty. The
          optimization requires the use of -iderat in the instrumentation phase
  -pbsi, --path-based-selective-inline
          Perform selective inlining of dominant hot function calls based on the
          control flow paths leading to hot functions
  -pca, --propagate-constant-area
          Relocate the constant variables area to the top of the code section
          when possible
  -pr/-nopr, --ptrgl-r11/--noptrgl-r11
          Perform/Do not perform removal of R11 load instruction in _ptrgl csect
          (the default is to perform the optimization)
  -pto, --ptrgl-optimization
          Perform optimization of indirect call instructions via registers by
          replacing them with conditional direct jumps
  -ptoht <heatness_threshold>, --ptrgl-optimization-heatness-threshold <heatness_threshold>
          Set the frequency threshold for indirect calls that are to be optimized
          by -pto optimization. Allowed range between 0 and 1. Default is set to
          0.8. (Applicable only with -pto flag)
  -ptosl <limit_size>, --ptrgl-optimization-size-limit <limit_size>
          Set the limit of the number of conditional statements generated by -pto
          optimization. Allowed values are between 1 and 100. Default value is
          set to 3. (Applicable only with the -pto flag)
  -RC, --reorder-code
          Perform code reordering
  -rcaf <aggressiveness_factor>, --reorder-code-aggressivenes-factor <aggressiveness_factor>
          Set the aggressiveness of code reordering optimization. Allowed values
          are [0 | 1 | 2], where 0 preserves then original code order and 2 is
          the most aggressive. Default is set to 1. (Applicable only with the
          -RC flag)
  -rccrf <reversal_factor>, --reorder-code-condition-reversal-factor <reversal_factor>
          Set the threshold fraction that determines when to enable condition
          reversal for each conditional branch during code reordering. Allowed
          input range is between 0.0 and 1.0 where 0.0 tries to preserve
          original condition direction and 1.0 ignores it. Default is set to 0.8
          (Applicable only with the -RC flag)
  -rcctf <termination_factor>, --reorder-code-chain-termination-factor <termination_factor>
          Set the threshold fraction that determines when to terminate each chain
          of basic blocks during code reordering. Allowed input range is between
          0.0 and 1.0 where 0.0 generates long chains and 1.0 creates single
          basic block chains. Default is set to 0.05. (Applicable only with the
          -RC flag)
  -RD, --reorder-data
          Perform static data reordering
  -ippcf, --instrument-for-path-profiling
          Perform cross function path profiling instrumentation
  -ppcf, --optimize-with-path-profiling
          Perform cross function path profiling optimization
  -rmte, --remove-multiple-toc-entries
          Remove multiple TOC entries pointing to the same location in the input
          program file
  -rt <removal_factor>, --reduce-toc <removal_factor>
          Perform removal of TOC entries according to a removal factor between
          (0,1), where 0 removes non-accessed TOC entries only and 1 removes all
          possible TOC entries
  -rtb, --remove-traceback-tables
          Remove traceback tables in reordered code
  -sdp <aggressiveness_factor>, --stride-data-prefetch <aggressiveness_factor>
          Perform data prefetching within frequently executed loops based on
          stride analysis, according to an aggressiveness factor between (1,9),
          where 1 is the least aggressive
  -sdpila <instructions_number>, --stride-data-prefetch-instruction-look-ahead <instructions_number>
          Set the number of instructions for which data is prefetched into the
          cache ahead of time. Default value is platform dependant. (Applicable
          only with the -sdp flag)
  -sdpms <stride_min_size>, --stride-data-prefetch-min-size <stride_min_size>
          Set the minimal stride size in bytes, for which data will be considered
          a candidate for prefetching. Default value is set to 128 bytes.
          (Applicable only with the -sdp flag)
  -ebp <evt_based_prefetch>, --event-based-prefetch <evt_based_prefetch>
          Perform data prefetching based on the events file
  -ebpla <instructions_number>, --event-based-prefetch-look-ahead <instructions_number>
          Set the number of instructions for which event based prefetch is
          performed. Default value is platform dependant. (Applicable only with
          the -ebp flag)
  -vecopt
          Use vector optimizations(remove double xxswapd, remove redundant load,
          remove xxlnand and replace data with complemented data, replace lxvd2x
          from rodata and xxswapd by lvx)
  -see <level>
          Use simplified prolog/epilog for functions that perform conditional
          early-exit. Use basic optimization with <level>=0 and maximal with
          <level>=1
  -shci <pct>, --selective-hot-code-inline <pct>
          Perform selective inlining of functions in order to decrease the total
          number of execution counts, so that only functions with hotness above
          the given percentage are inlined
  -si, --selective-inline
          Perform selective inlining of dominant hot function calls
  -chca, --convert_hole_to_constareas
          Convert Holes In SafeCSects To ConstAreas
  -sidf <percentage_factor>, --selective-inline-dominant-factor <percentage_factor>
          Set a dominant factor percentage for selective inline optimization. The
          allowed range is between 0 and 100. Default is set to 80. (Applicable
          only with the -si and -pbsi flags)
  -siht <frequency_factor>, --selective-inline-hotness-threshold <frequency_factor>
          Set a hotness threshold factor percentage for selective inline
          optimization to inline all dominant function calls that have a
          frequency count greater than the given frequency percentage. Default
          is set to 100. (Applicable only with the -si -pbsi flags)
  -slbp, --spinlock-branch-prediction
          Perform branch prediction bit setting for conditional branches in
          spinlock code containing l*arx and st*cx instructions. (Applicable
          after -bp flag)
  -sldp, --spinlock-data-prefetch
          Perform data prefetching for memory access instructions preceding
          spinlock code containing l*arx and st*cx instructions
  -sll <Lib1:Prof1,...,LibN:ProfN>, --static-link-libraries <Lib1:Prof1,...,LibN:ProfN>
          Statically link hot code from specified dynamically linked libraries to
          the input program. The parameter consists of a comma-separated list of
          libraries and their profiles. IMPORTANT: Licensing rights of specified
          libraries should be observed when applying this copying optimization
  -sllht <hotness_threshold>, --static-link-libraries-hotness-threshold <hotness_threshold>
          Set hotness threshold for the --static-link-libraries optimization. The
          allowed input range is between 0 (least aggressive) and 1, or -1,
          which does not require a profile and selects all code that might be
          called by the input program from the given libraries. Default is set
          at 0.5
  -so, --stack-optimization
          Reduce the stack frame size of functions that are called with a small
          number of arguments
  -spc, --shortcut-plt-calls
          Shortcut PLT calls in shared libraries to local functions if they
          exist. Note: Resolving to external symbols is disabled for such calls
  -tb, --preserve-traceback-tables
          Force the restructuring of traceback tables in reordered code. If -tb
          option is omitted, traceback tables are automatically included only
          for C++ applications that use the Try & Catch mechanism
  -tlo, --tocload-optimization
          Replace each load instruction that references the TOC with a
          corresponding add-immediate instruction via the TOC anchor register,
          where possible
  -vro, --volatile-registers-optimization
          Eliminate stores and restores of non-volatile registers in frequently
          executed functions by using available volatile registers
  -vrox, --volatile-registers-extended-optimization
          Eliminate stores and restores of non-volatile registers in frequently
          executed functions by using available volatile registers, the extended
          version supports FP registers and transparency


 Output Options:
           
  -cep, --complement-edge-profile
          Complements partial profile information given for the basic blocks'
          frequencies by adding missing basic block-to-basic block edge counts
  -d, --disassemble-text
          Print the disassembled text section of the output program into
          <output_file>.dis_text file
  -dap, --dump-ascii-profile
          Dump profile information in ASCII format into <program>.aprof (requires
          the -f flag).
  -db, --disassemble-bss
          Print the disassembled bss section of the output program into
          <output_file>.dis_bss file
  -dd, --disassemble-data
          Print the disassembled data section of the output program into
          <output_file>.dis_data file
  -diap, --dump-initial-ascii-profile
          Dump the given profile information in ASCII format into
          <program>.aprof.init (requires the -f flag)
  -dim, --dump-instruction-mix
          Dump instruction mix statistics based on gathered profile information
  -dm, --dump-mapper
          Print a map of basic blocks and static variables with their respective
          new -> old addresses into a <program>.mapper file
  -o <output_file>, --output-file <output_file>
          Set the name of the output file. The default instrumented file is
          <program>.instr. The default optimized file is <program>.fdpr
  -scl, --show-constant-load
          Adds annotaions in fdpr disassembly on load instructions used to bring
          constant values into registers (requires -d flag)
  -ppcf, --print-prof-counts-file
          Print a text format of the profiling counters into a <program>.counts
          file (requires the -f flag).
  -sf, --strip-file
          Strip the output file
  -simo, --single-input-multiple-outputs
          Optimize in parallel into multiple outputs as specified by option sets
          read from stdin


 General Options:
           
  -h, --help
          Print the online help
  -j <jour_file>, --journal <jour_file>
          Output optimization journal information to <jour_file>
  -smt, --smt_mode
          set SMT mode (1:ST, 2: (SMT2-shared, SMT2-split), 4:SMT4, 8:SMT8)
  -m <machine-model>, --machine <machine-model>
          Generate code for the specified machine model. Target machine can be
          one of the following models: power2, power3, ppc405, ppc440, power4,
          ppc970, power5, power6, power7, ppe, spe, spe_edp, z10, z9. Default is
          power7
  -q, --quiet
          Set the output mode to quiet, suppressing informational messages
  -st <stat_file>, --statistics <stat_file>
          Output statistics information to <stat_file>. If <stat_file> is '-',
          the output goes to the standard output. See --verbose for the default
  -v <level>, --verbose <level>
          Set verbose output mode level. When set, various statistics about the
          output program are printed into the file <program>.stat. Allowed level
          range is between 0 and 3. Default is set to 0
  -V, --version
          Print the version number
  -w <level>, --warning-level <level>
          Set the warning level so only errors of this level and below will be
          printed. The levels are: 1: errors, 2: warnings, 3: debug warning, 4:
          debug information. Default is 2
- Analysis options should be specified identically in the instrumentation and optimization phases
- Some options are relevant only to specific platforms