SPEC Run and Reporting Rules for CPU95 Suites Date: September 11, 1994 From: SPEC Open Systems Steering Committee ABSTRACT This document provides guidelines required to build, run, and report on the benchmarks in the SPEC CPU95 suites. SPEC Run and Reporting Rules for CPU95 Suites Table of Contents: Purpose 1. General Philosophy 2. Building SPEC95 2.1 General Rules for Selecting Optimization Flags 2.2 Baseline Optimization Rules 3. Running SPEC95 3.1 System Configuration 3.2 Additional Rules for Running SPECrate 3.3 Continuous Run Requirement 3.4 Run Rule Exceptions 4. Results Disclosure 4.1 Configuration Disclosure 4.2 Test Results Disclosure 4.3 Metric Selection 5. SPEC95 Benchmarks 5.1 CINT95 5.2 CFP95 SPEC Subject: SPEC Run and Reporting Rules for CPU95 Suites Revised: Sept 11, 1995 From: SPEC Open Systems Steering Committee Purpose This document specifies how the benchmarks in the CPU95 suites are to be run for measuring and publicly reporting performance results. This ensures that results generated with this suite are meaningful, comparable to other generated results, and are repeatable (with documentation covering factors pertinent to duplicating the results). Per the SPEC license agreement, all results publicly disclosed must adhere to the SPEC Run and Reporting Rules. The following basics are expected: - Adherence to the SPEC general run rule philosophy, including: - general availability of all components within 6 months of publication. - providing a suitable environment for C/FORTRAN programs. - Use of the SPEC tools for all published results, including: - compilation of the benchmark with the SPEC tools. - requiring the median of at least three runs of each benchmark to help promote stability and reproducability. - requiring that a publishable result be generated with one invocation of the SPEC tools. - validating the benchmark output with the SPEC-provided validation output to ensure that the benchmark ran to completion and generated correct results. - Adherence to the criteria for flag selection, including: - proper use of feedback directed optimization for both base and non-base/"aggressive compilation" measurements. - the use of only four optimization options (including preprocessor, compiler and linker flags) for baseline measurements. - Availability of a full disclosure report Each of these points are discussed in further detail below. Suggestions for improving this run methodology should be made to SPEC for consideration in future releases. 1. General Philosophy SPEC believes the user community will benefit from an objective series of tests, which can serve as common reference and be considered as part of an evaluation process. SPEC is aware of the importance of optimizations in producing the best system performance. SPEC is also aware that it is sometimes hard to draw an exact line between legitimate optimizations that happen to benefit SPEC benchmarks and optimizations that specifically target the SPEC benchmarks. However, with the list below, SPEC wants to increase awareness of implementors and end users to issues of unwanted benchmark-specific optimizations that would be incompatible with SPEC's goal of fair benchmarking. To ensure that results are relevant to end-users, SPEC expects that the hardware and software implementations used for the running the SPEC benchmarks adhere to following conventions: - Hardware and software used to run the CINT95/CFP95 benchmarks must provide a suitable environment for running typical C or FORTRAN programs. - Optimizations must generate correct code for a class of programs, where the class of programs must be larger than a single SPEC benchmark or SPEC benchmark suite. This also applies to "assertion flags" that may be used for non-base/"aggressive compilation" measurements (see section 2.2.4 p2). - Optimizations must improve performance for a class of programs where the class of programs must be larger than a single SPEC benchmark or SPEC benchmark suite. - The vendor encourages the implementation for general use. - The implementation is generally available, documented and supported by the providing vendor. In the case where it appears that the above guidelines have not been followed, SPEC may investigate such a claim and request that the offending optimization (e.g. a SPEC-benchmark specific pattern matching) be backed off and the results resubmitted. Or, SPEC may request that the vendor correct the deficiency (e.g. make the optimization more general purpose or correct problems with code generation) before submitting results based on the optimization. The SPEC Open Systems Group reserves the right to adapt the CINT95/CFP95 suite as it deems necessary to preserve its goal of fair benchmarking (e.g. remove benchmark, modify benchmark code or workload, etc). If a change is made to the suite, SPEC will notify the appropriate parties (i.e. members and licensees). SPEC will redesignate the metrics (e.g. changing the metric from SPECfp95 to SPECfp95a). In the case that a benchmark is removed, SPEC reserves the right to republish in summary form "adapted" results for previously published systems, converted to the new metric. In the case of other changes, such a republication may necessitate retesting and may require support from the original test sponsor. 2. Building SPEC95 SPEC has adopted a set of rules defining how SPEC95 benchmark suites must be built and run to produce "baseline" and "non-baseline/aggressive compilation" metrics. "Non-baseline" metrics are produced by building each benchmark in the suite with a set of optimizations individually tailored for that benchmark. The optimizations selected must adhere to the set of general benchmark optimization rules described in section 2.1 below. This may also be referred to as "aggressive compilation". "Baseline" metrics are produced by building all the benchmarks in the suite with a common set of optimizations. In addition to the general benchmark optimization rules (section 2.1), "Baseline" optimizations must also adhere to a stricter set of rules described in section 2.2. These additional rules serve to form a "baseline" of recommended performance optimizations for a given vendor. With the release of SPEC95 suites, a set of tools based on GNU Make (gmake) and Perl5 are supplied to build and run the benchmarks. To produce "publication-quality" results, these SPEC tools must be used. This helps insure reproducibility of results by requiring that all individual benchmarks in the suite are run in the same environment and that a configuration file that defines the optimizations used is available. 2.1 General Rules for Selecting Optimization Flags The following rules apply to compiler flag selection for SPEC95 Peak and Baseline Metrics. Addition rules for Baseline Metrics follow in section 2.2. 2.1.1 No source file or variable or subroutine name may be used within an optimization flag or compile option. 2.1.2 Flags which substitute pre-computed (e.g. library-based) routines for routines defined in the benchmark on the basis of the routine's name are not allowed. Exceptions are: a) the function "alloca" in benchmark 126.gcc and b) the level 1 BLAS functions in the CFP95 benchmarks 2.1.3 Feedback directed optimization is allowed. Only the "training" input (i.e. runspec -i train ... ) may be used for the run that generates the feedback data. The runspec.txt file covers how to set this up with the SPEC tools. 2.1.4 Flags that change a data type size to a size different from the default size of the compilation system are not allowed. Exceptions are: a) C long can be 32 or greater bits, b) pointer sizes can be set different from the default size. 2.2 Baseline Optimization Rules In addition to the rules listed in section 2.1 above, the selection of optimizations to be used to produce SPEC95 Baseline Metrics includes the following: 2.2.1 The optimization options used are expected to be safe, and it is expected that system or compiler vendors would endorse the general use of these options by customers who seek to achieve good application performance. 2.2.2 The same compiler and same set of optimization flags/options is used for all benchmarks of a given language within a benchmark suite, except for portability flags (see 2.2.6.3 below). All flags must be applied in the same order for all benchmarks. SPEC reserves the right to certify "portability" switches which directly affect the source code for a benchmark. The runspec.txt file covers how to set this up with the SPEC tools. 2.2.3 Feedback directed optimization is allowed. A one or two pass process may be used. The first pass or feedback run must use the "train" input data set to generate the feedback file. When a two pass process is used, the same set of optimization flags must be used, excluding those related to the generation or use of the feedback file. 2.2.4 Assertion flags may NOT be used. An assertion flag is one that supplies semantic information that the compilation system did not derive from the source statements of the benchmark. With an assertion flag, the programmer asserts to the compiler that the program has certain "nice" properties that allow the compiler to apply more aggressive optimization techniques (for example, that there is no "aliasing" via C pointers). The problem is that there can be legal programs (possibly strange, but still standard-conforming programs) where such a property does not hold. These programs could crash or give incorrect results if an assertion flag is used. This is the reason why such flags are sometimes also called "unsafe flags". Assertion flags should never be applied to a production program without previous careful checks; therefore they are disallowed for baseline. 2.2.5 Baseline results may use flags which affect the numerical accuracy or sensitivity by reordering floating-point operations based on algebraic identities. 2.2.6 Baseline optimization is further restricted by limiting to four (4) the maximum number of optimization switches that can be applied to create a baseline result. An example of this would be: cc Where users might use a flag for a general optimization, one to specify the architecture, one to specify an optimal library, plus one other optimization flag. The following rules must be followed for selecting and counting optimization flags: (1) A flag is defined as a unit of definition to the compiler. For example, each of the following is defined as a single flag: -O2 -inline -qarch=ppc -tp p5 -Xunroll0 (2) Some compilers allow delimited lists (usually comma or space delimited) behind an initial flag; for purposes of baseline, each optimization item in the list counts as an optimization toward the limit of four. For example: -K inline,unroll,strip_mine counts as three optimization flags. (3) Portability flags are not counted in the count of four. A flag may be considered a portability flag, if and only if that flag is necessary for the successful compilation and correct execution of the benchmark regardless of any or all compilation flags used. That is, if it is possible to build and run the benchmark without this flag, then this flag is not considered a portability flag. For a given portability problem, the same flag(s) must be applied to all affected benchmarks. If a library is specified as a portability flag, SPEC may request that the table of contents of the library be included in the disclosure. (4) A flag specifying ANSI compliance is NOT counted in the count of four switches. (5) Switches for feedback directed optimization follow the same rules (one unit of definition) and count as one of the four optimization flags. Since two passes are currently allowed for baseline, all invocations of activating feedback count as one flag: for example: Pass 1: cc -prof_gather -O9 -go_fast -arch=404 Pass 2: cc -prof_use -O9 -go_fast -arch=404 This breaks down into [FDO invocation, optimization level, extra optimization, and an architecture flag] and counts as an acceptable four flags. (6) "Pointer" or "location" flags (flags that indicate where to find data) are not included in the four flag definition. For example: -L/usr/ucblib -prof_dir `pwd` (7) Flags that only suppress warnings (typically -w), flags that only create object files (typically -c) and flags that only name the output file (typically -o) are not counted as optimization flags. (8) The four flag limit counts all options for all parts of the compilation system, i.e. the entire transformation from SPEC supplied source code to completed executable. The list below is a partial set of the types of flags that would be included in the flag count: - Linker options are also counted as part of the four optimization flags. - Preprocessor directives are counted as part of the four optimization flags. - Libraries, when not needed for portability, are counted as part of the four optimization flags. 3. Running SPEC95 3.1 System Configuration 3.1.1 File Systems SPEC requires the use of a of single file system to contain the directory tree for the SPEC95 suite being run. SPEC allows any type of file system (disk-based, memory-based, NFS, DFS, etc.) to be used. The type of file system must be disclosed in reported results. 3.1.2 System State The system state (multi-user, single-user, init level N) may be selected by the benchmarker. This state along with any changes in the default configuration of daemon processes or system tuning parameters must be documented in the notes section of the results disclosure. 3.2 Additional Rules for Running SPECrate 3.2.1 For SPEC_rate95 (non-base), the benchmarker is free to choose the number of concurrent copies for each individual benchmark independently of the other benchmarks. 3.2.2 For SPEC_base_rate95, the benchmarker must select a single value to use as the number of concurrent copies to be applied to all benchmarks in the suite. 3.2.3 The multiple concurrent copies of the benchmark must be executed from different directories within the same file system. Each copy of the test must have its own working directory, which is to contain all the files needed for the actual execution of the benchmark, including input files, the benchmark binary, and all output files when created. The output of each copy of the benchmark must be validated to be correct output. 3.3 Continuous Run Requirement All benchmark executions, including the validations steps, contributing to a particular result page must occur continuously, i.e. in one execution of the "runspec" script. Exceptions must be approved by OSSC vote. For example, if a result page will contain both "non-base" and "base" CFP95 results, a single "runspec" invocation must have been used to run both the "non-base" and "base" executables for each benchmark and their validations. 3.4 Run Rule Exceptions If for some reason, the test sponsor cannot run the benchmarks as specified in these rules, the test sponsor can seek SPEC OSSG approval for performance-neutral alternatives. No publication may be done without such approval. 4. Results Disclosure SPEC requires a full disclosure of results and configuration details sufficient to reproduce the results. SPEC also requires that baseline results be submitted to the SPEC Newsletter whenever non-base/ aggressive compilation results are submitted. If non-base results are published outside of the newsletter, in a publicly available medium, the benchmarker must supply baseline results on request. Publication of results under non-disclosure or company internal use or company confidential are not considered "publicly" available. A full disclosure of results will include: - The components of the Newsletter page, as prompted for by the SPEC tools. - The benchmarker's configuration file and any supplemental makefiles needed to build the executables used to generate the results. - A "flags" definition disclosure. All results disclosures to be submitted to the SPEC Newsletter should be sent to SPEC via email or by mailing a diskette. Licensees are encouraged to register a results disclosure with SPEC/NCGA at any time, even if the results are not submitted to the SPEC Newsletter. In this way, SPEC can give correct answers to persons inquiring about specific results. All public SPEC results MUST be submitted to SPEC upon request no later than ten working days after the date of public release. 4.1 Configuration Disclosure The following sections describe the the various elements that make up the disclosure for the system and test configuration used to produce a given test result. The SPEC tools used for the benchmark allow setting this information in the configuration file: 4.1.1 System Identification o System Manufacturer o System Model Name o SPEC license number o Test Sponsor (Name,Location) o Test Date (Month,Year) o Hardware Availability Date o Software Availability Date Note: All components of the tested system must be available within 6 months of the publication of results. 4.1.2 Hardware Configuration o CPU (Processor Name, Speed (MHz)) o FPU o Number of CPU's in System o Level 1 Cache (Size and Organization) o Level 2 Cache (Size and Organization) o Other Cache (Size and Organization) o Memory (Size in MB) o Disk (Size (MB/GB), Type (SCSI, Fast SCSI) o Other Hardware (Addition equipment added to improve performance, special disk controller, NVRAM file system accelerator) 4.1.3 Software Configuration o Operating System (Name and Version) o System State (Single User, Multi-user, Init 3) o File System Type o Compilers: - C Compiler (Name and Version) for CINT95 - FORTRAN Compiler (Name and Version) for CFP95 - Pre-processors (Name and Version) if used o Other Software (Addition Software added to improve performance) 4.1.4 Tuning Information o Description of System Tuning (Includes any special OS parameters set, changes to standard daemons) o Baseline flags List o Portability flags used for any benchmark o Peak flags list for each benchmark o Any addition notes such as listing any OSSC approved alternate sources or SPECtool changes used. 4.2 Test Results Disclosure The actual test results consist of the elapsed times and ratios for the individual benchmarks and the overall SPEC metric produced by running the benchmarks via the SPEC tools. The required use of the SPEC tools ensures that the results generated are based on benchmarks built, run, and validated according to to the SPEC run rules. Below is a list of the measurement components for each SPEC95 suite and metric: 4.2.1 Speed Metrics o CINT95 Speed Metrics: SPECint_base95 (Required Baseline result) SPECint95 (Optional Peak result) o CFP95 Speed Metrics: SPECfp_base95 (Required Baseline result) SPECfp95 (Optional Peak result) The elapsed time in seconds for each of the benchmarks in CINT95 or CFP95 suite is given and the ratio to the reference machine (SPARCstation 10 Model 40) is calculated. The SPEC_base95 metric is calculated as a Geometric Mean of the individual ratios, where each ratio is based on the median execution time from a minimum of three runs. All runs of a specific benchmark when using the SPEC tools are required to have validated correctly. The benchmark executables must have been built according to the "baseline" rules described in sections 2.1 and 2.2 above. The optional "non-base/aggressively compiled" metric: SPEC95, is based similarly on results of these individual benchmarks built using the "non-base" rules (see section 2.1). * Note: The "median" rule applies only when the number of runs is "odd" (3,5,11). If an "even" number of runs are made (4,6,12...), the lowest (worse) result of the two middle results is selected. 4.2.2 Throughput Metrics o CINT95 Throughput Metrics: SPECint_base_rate95 (Required Baseline result) SPECint_rate95 (Optional Peak result) o CFP95 Throughput Metrics: SPECfp_base_rate95 (Required Baseline result) SPECfp_rate95 (Optional Peak result) The throughput metrics are calculated based on the execution of the same "baseline" and/or "non-base/aggressive compilation" benchmark executables as for the speed metrics described above. However the test sponsor may select the number of concurrent copies of each benchmark to be run. The same number of copies must be used for all benchmarks in a "baseline" test. This is not true for the "peak" results where the tester is free to select any combination of copies. The number of copies selected is usually a function of the number of CPUs in the system. The "rate" calculated for each benchmark is a function of the number of copies run * reference factor for the benchmark * number of seconds in a day / elapsed time in seconds, which yield a rate in jobs/day. The SPEC_base_rate95 or SPEC_rate95 metrics are calculated as a geometric mean from the individual SPECrates using the median result from a minimum of 3 runs. As with the speed metric, all copies of the benchmark during each run are expected to have validated correctly. 4.3 Metric Selection Submission of "non-base" results are considered optional by SPEC, so the benchmarker may choose to submit only "baseline" results. Since by definition "baseline" results adhere to all the rules that apply to "non-base" results, the benchmarker may choose to refer to these results by either the "base" or "non-base" metric names (i.e. SPECint_base95 or SPECint95). 5. SPEC95 Benchmarks A description of the benchmarks can be found in the README1.txt file provided with and on the SPEC media.