Copyright © 2012 Intel Corporation. All Rights Reserved.
Invoke the Intel C compiler for IA32 applications.
You need binutils 2.16.91.0.7 or later with this compiler to support new instructions on Intel Core 2 processors
Invoke the Intel C++ compiler for IA32 and Intel 64 applications.
You need binutils 2.16.91.0.7 or later with this compiler to support new instructions on Intel Core 2 processors
Invoke the Intel Fortran compiler for IA32 and Intel 64 applications.
You need binutils 2.16.91.0.7 or later with this compiler to support new instructions on Intel Core 2 processors
specify source files are in free format. Same as -FR. -nofree indicates fixed format
-mcmodel=<size>
use a specific memory model to generate code and store data
small - Restricts code and data to the first 2GB of address space (DEFAULT)
medium - Restricts code to the first 2GB; it places no memory restriction on data
large - Places no memory restriction on code or data
-mcmodel=<size>
use a specific memory model to generate code and store data
small - Restricts code and data to the first 2GB of address space (DEFAULT)
medium - Restricts code to the first 2GB; it places no memory restriction on data
large - Places no memory restriction on code or data
enable language support for
c99 enable C99 support for C programs
c++11 enable C++11 experimental support for C++ programs
c++0x same as c++11
Enables O2 optimizations plus more aggressive optimizations,
such as prefetching, scalar replacement, and loop and memory
access transformations. Enables optimizations for maximum speed,
such as:
- Loop unrolling, including instruction scheduling
- Code replication to eliminate branches
- Padding the size of certain power-of-two arrays to allow
more efficient cache use.
On IA-32 and Intel EM64T processors, when O3 is used with options
-ax or -x (Linux) or with options /Qax or /Qx (Windows), the compiler
performs more aggressive data dependency analysis than for O2, which
may result in longer compilation times.
The O3 optimizations may not cause higher performance unless loop and
memory access transformations take place. The optimizations may slow
down code in some cases compared to O2 optimizations.
The O3 option is recommended for applications that have loops that heavily
use floating-point calculations and process large data sets. On IA-32
Windows platforms, -O3 sets the following:
/GF (/Qvc7 and above), /Gf (/Qvc6 and below), and /Ob2
Enable the compiler to generate multi-threaded code based on the OpenMP* directives
-ipo[n]
Multi-file ip optimizations that includes:
- inline function expansion
- interprocedural constant propogation
- dead code elimination
- propagation of function characteristics
- passing arguments in registers
- loop-invariant code motion
(n - number of multi-file objects)
Code is optimized for Intel(R) processors with support for CORE-AVX512 instructions. The resulting code may contain unconditional use of features that are not supported on other processors. This option also enables new optimizations in addition to Intel processor-specific optimizations including advanced data layout and code restructuring optimizations to improve memory accesses for Intel processors.
Do not use this option if you are executing a program on a processor that is not an Intel processor. If you use this option on a non-compatible processor to compile the main program (in Fortran) or the function main() in C/C++, the program will display a fatal run-time error if they are executed on unsupported processors.
Enable/disable(DEFAULT) use of ANSI aliasing rules in optimizations; user asserts that the program adheres to these rules.
Enables O2 optimizations plus more aggressive optimizations,
such as prefetching, scalar replacement, and loop and memory
access transformations. Enables optimizations for maximum speed,
such as:
- Loop unrolling, including instruction scheduling
- Code replication to eliminate branches
- Padding the size of certain power-of-two arrays to allow
more efficient cache use.
On IA-32 and Intel EM64T processors, when O3 is used with options
-ax or -x (Linux) or with options /Qax or /Qx (Windows), the compiler
performs more aggressive data dependency analysis than for O2, which
may result in longer compilation times.
The O3 optimizations may not cause higher performance unless loop and
memory access transformations take place. The optimizations may slow
down code in some cases compared to O2 optimizations.
The O3 option is recommended for applications that have loops that heavily
use floating-point calculations and process large data sets. On IA-32
Windows platforms, -O3 sets the following:
/GF (/Qvc7 and above), /Gf (/Qvc6 and below), and /Ob2
Enable the compiler to generate multi-threaded code based on the OpenMP* directives
-ipo[n]
Multi-file ip optimizations that includes:
- inline function expansion
- interprocedural constant propogation
- dead code elimination
- propagation of function characteristics
- passing arguments in registers
- loop-invariant code motion
(n - number of multi-file objects)
Code is optimized for Intel(R) processors with support for CORE-AVX512 instructions. The resulting code may contain unconditional use of features that are not supported on other processors. This option also enables new optimizations in addition to Intel processor-specific optimizations including advanced data layout and code restructuring optimizations to improve memory accesses for Intel processors.
Do not use this option if you are executing a program on a processor that is not an Intel processor. If you use this option on a non-compatible processor to compile the main program (in Fortran) or the function main() in C/C++, the program will display a fatal run-time error if they are executed on unsupported processors.
Enable/disable(DEFAULT) use of ANSI aliasing rules in optimizations; user asserts that the program adheres to these rules.
Enables O2 optimizations plus more aggressive optimizations,
such as prefetching, scalar replacement, and loop and memory
access transformations. Enables optimizations for maximum speed,
such as:
- Loop unrolling, including instruction scheduling
- Code replication to eliminate branches
- Padding the size of certain power-of-two arrays to allow
more efficient cache use.
On IA-32 and Intel EM64T processors, when O3 is used with options
-ax or -x (Linux) or with options /Qax or /Qx (Windows), the compiler
performs more aggressive data dependency analysis than for O2, which
may result in longer compilation times.
The O3 optimizations may not cause higher performance unless loop and
memory access transformations take place. The optimizations may slow
down code in some cases compared to O2 optimizations.
The O3 option is recommended for applications that have loops that heavily
use floating-point calculations and process large data sets. On IA-32
Windows platforms, -O3 sets the following:
/GF (/Qvc7 and above), /Gf (/Qvc6 and below), and /Ob2
Enable the compiler to generate multi-threaded code based on the OpenMP* directives
-ipo[n]
Multi-file ip optimizations that includes:
- inline function expansion
- interprocedural constant propogation
- dead code elimination
- propagation of function characteristics
- passing arguments in registers
- loop-invariant code motion
(n - number of multi-file objects)
Code is optimized for Intel(R) processors with support for CORE-AVX512 instructions. The resulting code may contain unconditional use of features that are not supported on other processors. This option also enables new optimizations in addition to Intel processor-specific optimizations including advanced data layout and code restructuring optimizations to improve memory accesses for Intel processors.
Do not use this option if you are executing a program on a processor that is not an Intel processor. If you use this option on a non-compatible processor to compile the main program (in Fortran) or the function main() in C/C++, the program will display a fatal run-time error if they are executed on unsupported processors.
specify how data items are aligned
keywords: all (same as -align), none (same as -noalign),
[no]commons, [no]dcommons,
[no]qcommons, [no]zcommons,
rec1byte, rec2byte, rec4byte,
rec8byte, rec16byte, rec32byte,
array8byte, array16byte, array32byte,
array64byte, array128byte, array256byte,
[no]records, [no]sequence
This section contains descriptions of flags that were included implicitly by other flags, but which do not have a permanent home at SPEC.
This option enables read only string-pooling optimization.
This option enables read/write string-pooling optimization.
Specifies the level of inline function expansion.
Ob0 - Disables inlining of user-defined functions. Note that statement functions are always inlined.
Ob1 - Enables inlining when an inline keyword or an inline attribute is specified. Also enables inlining according to the C++ language.
Ob2 - Enables inlining of any function at the compiler's discretion.
Enables optimizations for speed. This is the generally recommended
optimization level. This option also enables:
- Inlining of intrinsics
- Intra-file interprocedural optimizations, which include:
- inlining
- constant propagation
- forward substitution
- routine attribute propagation
- variable address-taken analysis
- dead static function elimination
- removal of unreferenced variables
- The following capabilities for performance gain:
- constant propagation
- copy propagation
- dead-code elimination
- global register allocation
- global instruction scheduling and control speculation
- loop unrolling
- optimized code selection
- partial redundancy elimination
- strength reduction/induction variable simplification
- variable renaming
- exception handling optimizations
- tail recursions
- peephole optimizations
- structure assignment lowering and optimizations
- dead store elimination
On IA-32 Windows platforms, -O2 sets the following:
/Og, /Oi-, /Os, /Oy, /Ob2, /GF (/Qvc7 and above), /Gf (/Qvc6 and below), /Gs, and /Gy.
Disables inline expansion of all intrinsic functions.
This option disables stack-checking for routines with 4096 bytes of local variables and compiler temporaries.
Allows use of EBP as a general-purpose register in optimizations.
This option tells the compiler to separate functions into COMDATs for the linker.
This option enables most speed optimizations, but disables some that increase code size for a small speed benefit.
This option enables global optimizations.
Enables optimizations for speed and disables some optimizations that
increase code size and affect speed.
To limit code size, this option:
- Enables global optimization; this includes data-flow analysis,
code motion, strength reduction and test replacement, split-lifetime
analysis, and instruction scheduling.
- Disables intrinsic recognition and intrinsics inlining.
The O1 option may improve performance for applications with very large
code size, many branches, and execution time not dominated by code within loops.
On IA-32 Windows platforms, -O1 sets the following:
/Qunroll0, /Oi-, /Op-, /Oy, /Gy, /Os, /GF (/Qvc7 and above), /Gf (/Qvc6 and below), /Ob2, and /Og
Tells the compiler the maximum number of times to unroll loops.
Disables conformance to the ANSI C and IEEE 754 standards for floating-point arithmetic.
KMP_AFFINITY
The KMP_AFFINITY environment variable uses the following general syntax:
Syntax |
---|
KMP_AFFINITY=[<modifier>,...]<type>[,<permute>][,<offset>] |
For example, to list a machine topology map, specify KMP_AFFINITY=verbose,none to use a modifier of verbose and a type of none.
The following table describes the supported specific arguments.
Argument |
Default |
Description |
---|---|---|
noverbose respect granularity=core |
Optional. String consisting of keyword and specifier.
|
|
none |
Required string. Indicates the thread affinity to use.
The logical and physical types are deprecated but supported for backward compatibility. |
|
0 |
Optional. Positive integer value. Not valid with type values of explicit, none, or disabled. | |
0 |
Optional. Positive integer value. Not valid with type values of explicit, none, or disabled. |
Type is the only required argument.
Does not bind OpenMP threads to particular thread contexts; however, if the operating system supports affinity, the compiler still uses the OpenMP thread affinity interface to determine machine topology. Specify KMP_AFFINITY=verbose,none to list a machine topology map.
Specifying compact assigns the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed. For example, in a topology map, the nearer a node is to the root, the more significance the node has when sorting the threads.
Specifying disabled completely disables the thread affinity interfaces. This forces the OpenMP run-time library to behave as if the affinity interface was not supported by the operating system. This includes the low-level API interfaces such as kmp_set_affinity and kmp_get_affinity, which have no effect and will return a nonzero error code.
Specifying explicit assigns OpenMP threads to a list of OS proc IDs that have been explicitly specified by using the proclist= modifier, which is required for this affinity type.
Specifying scatter distributes the threads as evenly as possible across the entire system. scatter is the opposite of compact; so the leaves of the node are most significant when sorting through the machine topology map.
Types logical and physical are deprecated and may become unsupported in a future release. Both are supported for backward compatibility.
For logical and physical affinity types, a single trailing integer is interpreted as an offset specifier instead of a permute specifier. In contrast, with compact and scatter types, a single trailing integer is interpreted as a permute specifier.
Specifying logical assigns OpenMP threads to consecutive logical processors, which are also called hardware thread contexts. The type is equivalent to compact, except that the permute specifier is not allowed. Thus, KMP_AFFINITY=logical,n is equivalent to KMP_AFFINITY=compact,0,n (this equivalence is true regardless of the whether or not a granularity=fine modifier is present).
For both compact and scatter, permute and offset are allowed; however, if you specify only one integer, the compiler interprets the value as a permute specifier. Both permute and offset default to 0.
The permute specifier controls which levels are most significant when sorting the machine topology map. A value for permute forces the mappings to make the specified number of most significant levels of the sort the least significant, and it inverts the order of significance. The root node of the tree is not considered a separate level for the sort operations.
The offset specifier indicates the starting position for thread assignment.
Modifiers are optional arguments that precede type. If you do not specify a modifier, the noverbose, respect, and granularity=core modifiers are used automatically.
Modifiers are interpreted in order from left to right, and can negate each other. For example, specifying KMP_AFFINITY=verbose,noverbose,scatter is therefore equivalent to setting KMP_AFFINITY=noverbose,scatter, or just KMP_AFFINITY=scatter.
Does not print verbose messages.
Prints messages concerning the supported affinity. The messages include information about the number of packages, number of cores in each package, number of thread contexts for each core, and OpenMP thread bindings to physical thread contexts.
Information about binding OpenMP threads to physical thread contexts is indirectly shown in the form of the mappings between hardware thread contexts and the operating system (OS) processor (proc) IDs. The affinity mask for each OpenMP thread is printed as a set of OS processor IDs.
KMP_LIBRARY
KMP_LIBRARY = { throughput | turnaround | serial }, Selects the OpenMP run-time library execution mode. The options for the variable value are throughput, turnaround, and serial.
The compiler with OpenMP enables you to run an application under different execution modes that can be specified at run time. The libraries support the serial, turnaround, and throughput modes.
The serial mode forces parallel applications to run on a single processor.
In a dedicated (batch or single user) parallel environment where all processors are exclusively allocated to the program for its entire run, it is most important to effectively utilize all of the processors all of the time. The turnaround mode is designed to keep active all of the processors involved in the parallel computation in order to minimize the execution time of a single job. In this mode, the worker threads actively wait for more parallel work, without yielding to other threads.
Avoid over-allocating system resources. This occurs if either too many threads have been specified, or if too few processors are available at run time. If system resources are over-allocated, this mode will cause poor performance. The throughput mode should be used instead if this occurs.
In a multi-user environment where the load on the parallel machine is not constant or where the job stream is not predictable, it may be better to design and tune for throughput. This minimizes the total time to run multiple jobs simultaneously. In this mode, the worker threads will yield to other threads while waiting for more parallel work.
The throughput mode is designed to make the program aware of its environment (that is, the system load) and to adjust its resource usage to produce efficient execution in a dynamic environment. This mode is the default.
KMP_BLOCKTIME
KMP_BLOCKTIME = value. Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.Use the optional character suffixes: s (seconds), m (minutes), h (hours), or d (days) to specify the units.Specify infinite for an unlimited wait time.
KMP_STACKSIZE
KMP_STACKSIZE = value. Sets the number of bytes to allocate for each OpenMP* thread to use as the private stack for the thread. Recommended size is 16m. Use the optional suffixes: b (bytes), k (kilobytes), m (megabytes), g (gigabytes), or t (terabytes) to specify the units. This variable does not affect the native operating system threads created by the user program nor the thread executing the sequential part of an OpenMP* program or parallel programs created using -parallel.
OMP_NUM_THREADS
Sets the maximum number of threads to use for OpenMP* parallel regions if no other value is specified in the application. This environment variable applies to both -openmp and -parallel. Example syntax on a Linux system with 8 cores: export OMP_NUM_THREADS=8
OMP_DYNAMIC
OMP_DYNAMIC={ 1 | 0 } Enables (1, true) or disables (0,false) the dynamic adjustment of the number of threads.
OMP_SCHEDULE
OMP_SCHEDULE={ type,[chunk size]} Controls the scheduling of the for-loop work-sharing construct. type can be either of static,dynamic,guided,runtime chunk size should be positive integer
OMP_NESTED
OMP_NESTED={ 1 | 0 } Enables creation of new teams in case of nested parallel regions (1,true) or serializes (0,false) all nested parallel regions. Default is 0.
Sets the stack size to n kbytes, or unlimited to allow the stack size to grow without limit.
Launching a process with numactl --interleave=all sets the memory interleave policy so that memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes.
The command "echo 1> /proc/sys/vm/drop_caches" is used to free up the filesystem page cache.
For multi-copy runs or single copy runs on systems with multiple sockets, it is advantageous to bind a process to a particular core. Otherwise, the OS may arbitrarily move your process from one core to another. This can affect performance. To help, SPEC allows the use of a "submit" command where users can specify a utility to use to bind processes. We have found the utility 'numactl' to be the best choice.
numactl runs processes with a specific NUMA scheduling or memory placement policy. The policy is set for a command and inherited by all of its children. The numactl flag "--physcpubind" specifies which core(s) to bind the process. "-l" instructs numactl to keep a process memory on the local node while "-m" specifies which node(s) to place a process memory. For full details on using numactl, please refer to your Linux documentation, 'man numactl'
In order to take advantage of large pages, your system must be configured to use large pages. To configure your system for huge pages perform the following steps:
Create a mount point for the huge pages: "mkdir /mnt/hugepages" The huge page file system needs to be mounted when the systems reboots. Add the following to a system boot configuration file before any services are started: "mount -t hugetlbfs nodev /mnt/hugepages" Set vm/nr_hugepages=N in your /etc/sysctl.conf file where N is the maximum number of pages the system may allocate. Reboot to have the changes take effect. (Not necessary on some operating systems like RedHat Enterprise Linux 5.5).
Note that further information about huge pages may be found in your Linux documentation file: /usr/src/linux/Documentation/vm/hugetlbpage.txt
Transparent Huge Pages
On RedHat EL 6 and later, Transparent Hugepages increases the memory page size from 4 kilobytes to 2 megabytes. Transparent Hugepages provides significant performance advantages on systems with highly contended resources and large memory workloads. If memory utilization is too high or memory is badly fragmented which prevents hugepages being allocated, the kernel will assign smaller 4k pages instead. Hugepages are used by default if /sys/kernel/mm/redhat_transparent_hugepage/enabled is set to always.
Set this environment variable to "yes" to enable applications to use large pages.
Specify stack size to be allocated for each thread.
KMP_AFFINITY = < physical | logical >, starting-core-id specifies the static mapping of user threads to physical cores. For example, if you have a system configured with 8 cores, OMP_NUM_THREADS=8 and KMP_AFFINITY=physical,0 then thread 0 will mapped to core 0, thread 1 will be mapped to core 1, and so on in a round-robin fashion. KMP_AFFINITY = granularity=fine,scatter The value for the environment variable KMP_AFFINITY affects how the threads from an auto-parallelized program are scheduled across processors. Specifying granularity=fine selects the finest granularity level, causes each OpenMP thread to be bound to a single thread context. This ensures that there is only one thread per core on cores supporting HyperThreading Technology Specifying scatter distributes the threads as evenly as possible across the entire system. Hence a combination of these two options, will spread the threads evenly across sockets, with one thread per physical core.
Sets the maximum number of threads to use for OpenMP* parallel regions if no other value is specified in the application. This environment variable applies to both -openmp and -parallel (Linux and Mac OS X) or /Qopenmp and /Qparallel (Windows). Example syntax on a Linux system with 8 cores: export OMP_NUM_THREADS=8
Enabling this option allows the processor cores to automatically increase their frequency if they are running below power and temperature, thereby increasing performance. By default, this option is enabled.
Enabling this option allows processor resources to be used more efficiently, enabling multiple threads to run on each core and increasing processor throughput, improving overall performance on threaded software.
Enabling this option allows the system to dynamically adjust processor voltage and core frequency. This technology can result in decreased average power consumption and decreased average heat production.
This option specifies the number of logical processor cores that can run on the server. This option sets he state of logical processor cores in a package. If you disable this setting, Hyper Threading is also disabled.
This option allows the user whether the processor uses Intel Virtualization Technology, which allows a platform to run multiple operating systems and applications in independent partitions. This can be one of the following: Disabled - The processor does not permit virtualization. enabled — The processor allows multiple operating systems in independent partitions. Platform Default — The BIOS option uses the value for this attribute contained in the BIOS defaults for the server type and vendor. By default this BIOS option is enabled.
Enabling this option allows processors to increase I/O performance by placing data from I/O devices directly into the processor cache. This setting helps to reduce cache misses.
This BIOS option enables the user to configure the CPU power management settings such as Enhanced Intel SpeedStep, Intel Turbo Boost Technology and Processor Power State C6. Settings in Custom will allow the user to change individual settings for the BIOS parameters in the preceding list. You must select this option if you want to change any of these BIOS parameters. Settings in Energy Efficient will determines the best settings for the BIOS parameters in the preceding list and ignores the individual settings for these parameters. Settings in Disabled state do not perform any CPU power management and any settings for the BIOS parameters in the preceding list are ignored
Enabling this option allows the processor to transition to its minimum frequency upon entering C1. This setting does not take effect until after you have rebooted the server. In disabled state, the CPU continues to run at its maximum frequency in C1 state. Users should disable this option for performing application benchmarking. In enabled state, the CPU transitions to its minimum frequency. This option saves the maximum amount of power in the C1 state.
Enabling this option allows the processor to send the C6 report to the operating system. Users should disable this option for performing application benchmarking.
This BIOS option allows you to determine whether system Performance or energy efficiency is more important on server. This can be one of the following: Balanced Energy, Balanced Performance, Energy Efficient and Performance. Balanced Performance optimized to maximum power savings with minimal impact on performance and it is enabled by default. Performance disables all power management options with any impact on performance. Balanced Energy is optimized for power efficiency and "Energy Efficient" for power savings. The BIOS option is only selectable if “Power Technology" is set to "Custom".
This BIOS option allows the enabling/disabling of a processor mechanism in 3 modes: Enterprise, High-Throughput and HPC. Setting this BIOS option in Enterprise and High-Throughput modes, will enable all the prefetchers and disable Data Reuse Technology. Setting this BIOS option in HPC mode will enable all the prefetchers and enable Data Reuse Technology.
This BIOS option configures the processor last level cache (LLC) prefetch feature as a result of the non-inclusive cache architecture. The LLC prefetcher exists on top of other prefetchers that can prefetch data into the core data cache unit (DCU) and mid-level cache (MLC). In some cases, setting this option to disabled can improve performance. Values for this BIOS option can be: Disabled: Disables the LLC prefetcher. The other core prefetchers are unaffected. Enabled: Gives the core prefetcher the ability to prefetch data directly to the LLC. By default, LLC prefetch option is disabled.
This BIOS option determines how aggressively the CPU will be power managed and placed into turbo. With “BIOS Controls”, the system controls the setting. Selecting "OS Controls” allows the operating system to control it.
This BIOS option controls the DIMM power savings mode policy. Setting this BIOS option in Disabled, DIMMs do not enter power saving mode. Setting this BIOS option in Slow, DIMMs can enter power saving mode, but the requirements are higher. Therefore, DIMMs enter power saving mode less frequently. Setting this BIOS option in Fast, DIMMs enter power saving mode as often as possible. Setting this BIOS option in Auto, BIOS controls when a DIMM enters power saving mode based on the DIMM configuration.
This BIOS option controls the prioritization of memory operations. Setting this BIOS option in Power-saving-mode will prioritize low voltage memory operations over high frequency memory operations. This mode may lower memory frequency in order to keep the voltage low. Setting this BIOS option in Performance-mode will prioritize high frequency operations over low voltage operations.
This BIOS option allows the user to enable/disable temperature-based memory throttling. By default this BIOS option is enabled. By enabling this BIOS option, the system BIOS will initiate memory throttling to manage memory performance by limiting bandwidth to the DIMMs, therefore capping the power consumption and preventing the DIMMs from overheating.
This BIOS option allows the user to configure memory reliability, availability and serviceability (RAS). Setting this BIOS option in Maximum Performance, system performance is optimized Setting this BIOS option in Mirroring, system reliability is optimized by using half the system memory as backup. Setting this BIOS option in Lockstep, if the DIMM pairs in the server have an identical type, size, and organization and are populated across the SMI channels, you can enable lockstep mode to minimize memory access latency and provide better performance. Setting this BIOS option in Sparing, system reliability is enhanced with a degree of memory redundancy while making more memory available to the operating system than mirroring.
This option controls the refresh interval rate for internal memory. By default, the refresh interval rate set as Auto, which is 2X DRAM refresh for every 32ns. Setting this BIOS option in 1X, DRAM cells are refreshed every 64ns.
This BIOS option is memory RAS feature which runs a background memory scrub against all DIMMs and it can negatively impact performance. By default, this option is enabled. Disabling this option, improves performance.
There are 4 snoop mode options for how to maintain cache coherency across the Intel QPI fabric, each with varying memory latency and bandwidth characteristics depending on how the snoop traffic is generated.
Cluster on Die (COD) mode logically splits a socket into 2 NUMA domains that are exposed to the OS with half the amount of cores and LLC assigned to each NUMA domain in a socket. This mode utilizes an on-die directory cache and in memory directory bits to determine whether a snoop needs to be sent. Use this mode for highly NUMA optimized workloads to get the lowest local memory latency and highest local memory bandwidth for NUMA workloads.
Home Directory Snoop with OSB is the Opportunistic Snoop Broadcast (OSB) directory mode, the HA could choose to do speculative home snoop broadcast under very lightly loaded conditions even before the directory information has been collected and checked.
In Home Snoop and Early Snoop modes, snoops are always sent , they just originate from different places: the caching agent (earlier) in Early Snoop mode and the home agent (later) in Home Snoop mode.
This BIOS option provides similar localization benefits as cluster-on-die (COD), without some of COD’s downsides. SNC breaks up the LLC into two disjoint clusters based on address range, with each cluster bound to a subset of the memory controllers in the system. SNC improves average latency to the LLC (last level cache) and memory. SNC is a replacement for the COD feature found in previous processor families. For a multi-socketed system, all SNC clusters are mapped to unique NUMA domains. IMC Interleaving must be set to the correct value to correspond with SNC enable/disable. Values for this BIOS option can be: Disabled: The LLC is treated as one cluster when this option is disabled Enabled: Utilizes LLC capacity more efficiently and reduces latency due to core/IMC proximity. This may provide performance improvement on NUMA-aware operating systems By default this BIOS option set to Disabled.
This BIOS option controls the interleaving between the Integrated Memory Controllers (IMCs). There are two IMCs per socket in Skylake Server. If IMC Interleaving is set to 2-way, addresses will be interleaved between the two IMCs. If IMC Interleaving is set to 1-way, there will be no interleaving. If SNC is disabled, IMC Interleaving should be set to 2-way. If SNC is enabled, IMC Interleaving should be set to 1-way.
Enabling this option allows the chipset to defer memory transactions and process them out of order for optimal performance.
When running multiple copies of benchmarks, the SPEC config file feature submit is sometimes used to cause individual jobs to be bound to specific processors. This specific submit command is used for Linux. The description of the elements of the command are:
/usr/bin/taskset [options] [mask] [pid | command [arg] ... ] :Flag description origin markings:
For questions about the meanings of these flags, please contact the tester.
For other inquiries, please contact webmaster@spec.org
Copyright 2012-2019 Standard Performance Evaluation Corporation
Tested with SPEC OMP2012 v1.1.
Report generated on Wed May 1 12:07:46 2019 by SPEC OMP2012 flags formatter v538.