SPEC CPU2017 Platform Settings for Nettrix Systems

Operating System Tuning Parameters

ulimit

Used to set user limits of system-wide resources. Provides control over resources available to the shell and processes started by it. Some common ulimit commands may include:

ulimit -s [n | unlimited] Set the stack size to n kbytes, or unlimited to allow the stack size to grow without limit.
ulimit -l (number) Set the maximum size that can be locked into memory.

Disabling Linux services

Certain Linux services may be disabled to minimize tasks that may consume CPU cycles.

irqbalance

Disabled through "service irqbalance stop". Depending on the workload involved, the irqbalance service reassigns various IRQ's to system CPUs. Though this service might help in some situations, disabling it can also help environments which need to minimize or eliminate latency to more quickly respond to events.

cpupower tool

Use the cpupower tool to read your supported CPU frequencies, and to set them. For rhel linux, to install the tool run the command "yum install cpupowerutils". You can set the cpu frequency in three modes with the command "cpupower frequency-set -g [options]", where "[options]" could be:

userspace: allows the frequency to be set manually.
ondemand: allows the CPU to run at different speed depending on the workloads.
performance: sets the CPU frequency to the maximum allowed.

The default governor mode is "performance".

Tuning Kernel parameters

The following Linux Kernel parameters were tuned to better optimize performance of some areas of the system:

swappiness: The swappiness value can range from 1 to 100. A value of 100 will cause the kernel to swap out inactive processes frequently in favor of file system performance, resulting in large disk cache sizes. A value of 1 tells the kernel to only swap processes to disk if absolutely necessary. This can be set through a command like "echo 1 > /proc/sys/vm/swappiness". The default value is 10.
ksm/sleep_millisecs: Set through "echo 200 > /sys/kernel/mm/ksm/sleep_millisecs". This setting controls how many milliseconds the ksmd (KSM daeomn) should sleep before the next scan.
khugepaged/scan_sleep_millisecs: Set through "echo 50000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs". This setting controls how many milliseconds to wait in khugepaged is there is a hugepage allocation failure to throttle the next allocation attempt.
Zone Reclaim Mode: Zone reclaim allows the reclaiming of pages from a zone if the number of free pages falls below a watermark even if other zones still have enough pages available. Reclaiming a page can be more beneficial than taking the performance penalties that are associated with allocating a page on a remote zone, especially for NUMA machines. To tell the kernel to free local node memory rather than grabbing free memory from remote nodes, use a command like "echo 1 > /proc/sys/vm/zone_reclaim_mode"
max_map_count-n: The maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries.
sched_cfs_bandwidth_slice_us: This OS setting controls the amount of run-time(bandwidth) transferred to a run queue from the task's control group bandwidth pool. Small values allow the global bandwidth to be shared in a fine-grained manner among tasks, larger values reduce transfer overhead. The default value is 5000 (ns).
sched_latency_ns: This OS setting configures targeted preemption latency for CPU bound tasks. The default value is 24000000 (ns).
sched_rt_runtime_us: A global limit on how much time realtime scheduling may use.The default value is 950000 (us).
sched_migration_cost_ns: Amount of time after the last execution that a task is considered to be "cache hot" in migration decisions. A "hot" task is less likely to be migrated to another CPU, so increasing this variable reduces task migrations. The default value is 500000 (ns).
sched_min_granularity_ns: This OS setting controls the minimal preemption granularity for CPU bound tasks. As the number of runnable tasks increases, CFS(Complete Fair Scheduler), the scheduler of the Linux kernel, decreases the timeslices of tasks. If the number of runnable tasks exceeds sched_latency_ns/sched_min_granularity_ns, the timeslice becomes number_of_running_tasks * sched_min_granularity_ns. The default value is 10000000 (ns).
sched_wakeup_granularity_ns: This OS setting controls the wake-up preemption granularity. Increasing this variable reduces wake-up preemption, reducing disturbance of compute bound tasks. Lowering it improves wake-up latency and throughput for latency critical tasks, particularly when a short duty cycle load component must compete with CPU bound components. The default value is 15000000 (ns).
numa_balancing: This OS setting controls automatic NUMA balancing on memory mapping and process placement. Setting 0 disables this feature. It is enabled by default (1).
dirty_ratio:This OS setting controls the absolute maximum amount of system memory (here expressed as a percentage) that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point, all new I/O operations are blocked until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory. The default value is 40.
dirty_background_ratio: This OS setting controls the percentage of system memory that can be filled with "dirty" pages before the pdflush/flush/kdmflush background processes kick in to write it to disk. "Dirty" pages are memory pages that still need to be written to disk. As an example, if you set this value to 10 (it means 10%), and your server has 256 GB of memory, then 25.6 GB of data could be sitting in RAM before something is done. The default value is 10.
dirty_writeback_centisecs: The kernel flusher threads will periodically wake up and write old data out to disk. This OS setting controls the interval between those wakeups, in 100'ths of a second. Setting this to zero disables periodic writeback altogether. The default value is 500.
dirty_expire_centisecs: This OS setting is used to define when dirty data is old enough to be eligible for writeout by the kernel flusher threads. It is expressed in 100'ths of a second. Data which has been dirty in-memory for longer than this interval will be written out next time a flusher thread wakes up. The default value is 3000.

Transparent Huge Pages (THP)

THP is an abstraction layer that automates most aspects of creating, managing, and using huge pages. THP is designed to hide much of the complexity in using huge pages from system administrators and developers, as normal huge pages must be assigned at boot time, can be difficult to manage manually, and often require significant changes to code in order to be used effectively. Transparent Hugepages increase the memory page size from 4 kilobytes to 2 megabytes. Transparent Hugepages provide significant performance advantages on systems with highly contended resources and large memory workloads. If memory utilization is too high or memory is badly fragmented which prevents hugepages being allocated, the kernel will assign smaller 4k pages instead. Most recent Linux OS releases have THP enabled by default.

Linux Huge Page settings

If you need finer control and manually set the Huge Pages you can follow the below steps:

Create a mount point for the huge pages: "mkdir /mnt/hugepages"
The huge page file system needs to be mounted when the systems reboots. Add the following to a system boot configuration file before any services are started: "mount -t hugetlbfs nodev /mnt/hugepages"
Set vm/nr_hugepages=N in your /etc/sysctl.conf file where N is the maximum number of pages the system may allocate.
Reboot to have the changes take effect.

For further information about huge pages may be found in your Linux documentation file: /usr/src/linux/Documentation/vm/hugetlbpage.txt

Firmware / BIOS / Microcode Settings

Application Performance Profile:

Application Performance Profile is designed for customers who need an easy way to optimize BIOS settings according to their application scenarios. The option could be set as "Disabled", "Computing Throughput Mode", "Computing Latency Mode", "Memory Bandwidth Mode", "Energy Efficient Mode", "Java Application Mode", and "High Reliability Mode". Default = "High Reliability Mode".

"Disabled" switch off this option. When set as "Disabled", this feature is not available for customers.
"Computing Throughput Mode" makes the BIOS tuned for throughput-sensitive application scenarios automatically.
"Computing Latency Mode"makes the BIOS tuned for latency-sensitive application scenarios automatically.
"Memory Bandwidth Mode" makes the BIOS tuned for memory bandwidth sensitive application scenarios automatically.
"Energy Efficient Mode"makes the BIOS tuned for Power Efficiency application scenarios automatically.
"Java Application Mode"makes the BIOS tuned for latency-sensitive application scenarios automatically.
"High Reliability Mode"makes the BIOS tuned for conservative and high reliability usage.

C-States:

C-states reduce CPU idle power. There are three options in this mode: "Legacy","Autonomous", and "Disable". Default is "Disable".

Legacy: When "Legacy" is selected, the operating system initiates the C-state transitions. For E5/E7 CPUs, ACPI C1/C2/C3 map to Intel C1/C3/C6. For 6500/7500 CPUs, ACPI C1/C3 map to Intel C1/C3 (ACPI C2 is not available). Some OS SW may defeat the ACPI mapping (e.g. intel_idle driver).
Autonomous: When "Autonomous" is selected, HALT and C1 request get converted to C6 requests in hardware.
Disable: When "Disable" is selected, only C0 and C1 are used by the OS. C1 gets enabled automatically when an OS autohalts.

C1 Enhanced Mode:

Enabling C1E (C1 enhanced) state can save power by halting CPU cores that are idle. Default is "Enable".

Turbo Mode:

Enabling turbo mode can boost the overall CPU performance when all CPU cores are not being fully utilized. A CPU core can run above its rated frequency for a short perios of time when it is in turbo mode. Default is "Enable".

Hyper-Threading:

Enabling Hyper-Threading let operating system addresses two virtual or logical cores for a physical presented core. Workloads can be shared between virtual or logical cores when possible. The main function of hyper-threading is to increase the number of independent instructions in the pipeline for using the processor resources more efficiently. Default is "Enabled".

Hardware P-states:

This setting allows the user to select between OS and hardware-controlled P-states. Selecting Native Mode allows the OS to choose a P-state. Selecting Out of Band Mode allows the hardware to autonomously choose a P-state without OS guidance. Selecting Native Mode with No Legacy Support functions as Native Mode with no support for older hardware. Default is "Disable".

Per Core P-state:

When per-core P-states are enabled, each physical CPU core can operate at separate frequencies. If disabled, all cores in a package will operate at the highest resolved frequency of all active threads. Default is Enable.

Sub-NUMA Cluster (SNC):

SNC breaks up the last level cache (LLC) into disjoint clusters based on address range, with each cluster bound to a subset of the memory controllers in the system. SNC improves average latency to the LLC and memory. SNC is a replacement for the cluster on die (COD) feature found in previous processor families. For a multi-socketed system, all SNC clusters are mapped to unique NUMA domains. (See also IMC interleaving.) Values for this BIOS option can be:

Disabled: The LLC is treated as one cluster when this option is disabled
Enable SNC2(2-clusters): Utilizes LLC capacity more efficiently and reduces latency due to core/IMC proximity. This may provide performance improvement on NUMA-aware operating systems
When "Enable SNC2(2-clusters)", the interleaving between the Integrated Memory Controllers (IMCs) is set to 1-way interleave autonomously.

Default is "Disabled".

XPT Prefetch

XPT prefetch is a mechanism that enables a read request that is being sent to the last level cache to speculatively issue a copy of that read to the memory controller prefetching. This can be one of the following:

Disabled: The CPU does not use the XPT Prefetch option.
Enabled: The CPU enables the XPT prefetcher option.

KTI Prefetch

KTI prefetch is a mechanism to get the memory read started early on a DDR bus. This can be one of the following:

Disabled: The processor does not preload any cache data.
Enabled: The KTI prefetcher preloads the L1 cache with the data it determines to be the most relevant.

The default setting is "Disabled".

UPI Prefetcher

UPI prefetch is a mechanism to get the memroy read started early on DDR bus. The UPI receive path will spawn a memory read to the memory controller prefetcher. Default is Enabled.

Patrol Scrub:

Patrol Scrub is a memory RAS feature which runs a background memory scrub against all DIMMs. Can negatively impact performance. This option allows for correction of soft memory errors. Over the length of system runtime, the risk of producing multi-bit and uncorrected errors is reduced with this option. Values for this BIOS setting can be:

Enabled: Correction of soft memory errors can occur during runtime.
Disabled: Soft memory error correction is turned off during runtime.

Default is Enabled.

DCU Streamer Prefetcher:

DCU (Level 1 Data Cache) streamer prefetcher is an L1 data cache prefetcher. Lightly threaded applications and some benchmarks can benefit from having the DCU streamer prefetcher enabled. Default setting is Enable.

Hardware Prefethcer:

When this option is enable, a dedicated hardware mechanism in the processor is supported to watch the stream of instructions or data being requested by the executing program, recognize the next few elements that the program might need based on this stream and prefetch into the processor's cache. The program with good instruction and data locality will benefit from this feature when this option is enable. Default is enable.

Trusted Execution Technology:

Enable Intel Trusted Execution Technology (Intel TXT). Default is disable.

Page Policy:

Adaptive Open Page Policy can improve performance for applications with a highly localized memory access pattern; Closed Page Policy can benifit applications that access memory more randomly. The default is "Auto".

LLC Dead Line Allocation:

If the Dead Line LLC Alloc feature is disabled, dead lines will always be dropped and will never fill into the LLC. This can help save space in the LLC and prevent the LLC from evicting useful data. However, if the Dead Line LLC Alloc feature is enabled, the LLC can opportunistically fill dead lines into the LLC if there is free space available. The default is enable.

Stale AtoS

Stale AtoS is the transition of a directory line state. The inmemory directory has three states: I, A, and S. I (invalid) state means the data is clean and does not exist in any other socket's cache. The A (snoopAll) state means the data may exist in another socket in exclusive or modified state. S (Shared) state means the data is clean and may be shared across one or more socket's caches. When doing a read to memory, if the directory line is in the A state, we must snoop all the other sockets because another socket may have the line in modified state. If this is the case,the snoop will return the modified data. However, it may be the case that a line is read in A state and all the snoops come back a miss. This can happen if another socket read the line earlier and then silently dropped it from its cache without modifying it. Values for this BIOS option can be:

Auto: Recommended setting
Disabled: Disabling this option allows the feature to process memory directory states as described above.
Enabled: In the situation where a line in A state returns only snoop misses, the line will transition to S state. That way, subsequent reads to the line will encounter it in S state and not have to snoop, saving latency and snoop bandwidth.

Default is Auto.

Cooling Policy

The feature to configure "Cooling Policy" option is provided on BMC webpage. This option provides 4 choices: "Balance Mode", "Performance Mode", "Silent Mode" and "Manual Mode" and default is "Balance Mode".

"Balance Mode" makes fan speed self-adjust actively according to the changes of temperature monitored by on-board temperature sensors.
"Performance Mode" makes fan speed self-adjust more actively according to the changes of temperature monitored by on-board temperature sensors.
"Silent Mode" makes fan speed self-adjust passively according to the changes of temperature monitored by on-board temperature sensors.
"Manual Mode" allows customers setting a value as the duty percentage of fan speed, this value is called "Fan Duty". The value should be an integer in the range from 30 to 100. When set as 100, all of the fans are working at full speed. It is not recommanded to set duty percentage at a low level when there exists high workload on system.