SPEC Workshop Abstracts

Workshop Introduction

Rudolf Eigenmann
Purdue University
School of Electrical and Computer Engineering
email: eigenman@purdue.edu

Welcome to the Workshop on Performance Evaluation with Realistic Applications!

The Standard Performance Evaluation Corporation (SPEC) is well known for its benchmark development. It has a broad industrial member base. Although SPEC benchmarks are used quite widely in academic research as well, there has not been an active dialog between the SPEC membership and the research community in developing and using benchmarks and performance evaluation methodologies. This workshop intends to bridge this gap.

The program includes a series of talks by researchers who have used SPEC suites and other benchmarks, have developed new tools and methodologies for performance evaluation and benchmarking, have identified needs for new benchmarks, and have developed benchmarks themselves. The program includes an overview of SPEC and gives several samples of the use of its benchmarks in research and industry. A panel discussion will conclude the workshop.

The speakers include SPEC supporters and critics alike. They will show what's good and what can be improved in the state of the art of performance evaluation and benchmarking. Are current benchmarks realistic enough? Or, do we need to find new workloads that are larger, longer-running, more memory-intensive, more disk-consuming, or more algorithmically complex than what we have today? Do we really measure what we think we measure when using today's suites? - This workshop will help review such and other issues. I hope that, together with the more than one hundred attendees from industry and academia, it will contribute to and stimulate the search for new and even better solutions for the next millennium.

KEYNOTE SPEECH

Real Applications and Data:
We Need Them for More Than Performance Work

David Kuck
Kuck & Associates

Performance Evaluation with Realistic Applications

Frederica Darema
Senior Science and Technology Advisor
National Science Foundation

This presentation will discuss a systematic methodology for performance evaluation of production class applications and will also address the role of such a methodology and the ensuing technology not only in analysis but also for prediction of performance of such applications. The talk will also address the use of such technology in the management, operation and control cycles of these applications for delivering quality of service, that is development and execution efficiency, fault tolerance and high confidence.

Applications of today are more complex, consist of multiple interoperating modules, developed by multiple developers, and potentially each module written in a different programming language than the others. Such characteristics will be more so in the future. In addition these applications will potentially have multiple input and output data sets, and the scenarios will include situations where there data reside in distributed data repositories and locations, and the data need to be dynamically accessible at execution time. Moreover, there will be increasing need for these applications to execute on multiple platforms, and/or employ assemblies of multiple such platforms at a given time. As these applications, as well as the platforms on which these applications execute, become ever more complex and expensive, it will be come increasingly imperative to attain high efficiency in developing and executing these applications. It will also become increasingly imperative to know a priori and with accuracy the expected performance of such applications.

In the past performance evaluation was limited into analyzing the behavior of specific kernels in production-class applications. This approach while useful to some extent, it is quite limited in its ability to predict the behavior of the entire application. Several reasons cause the deficiencies. The data sizes and the data organization changes when the kernels are used within the full application, and there is a lack of methods and tools that allow converting the analysis of the simple kernels to larger and more complex configurations. Also the performance of such applications depends on the configuration of the underlying complex platforms. The isolated and limited methods and tools that exist today to describe these platforms are not sufficient to make accurate analysis of production-time conditions or predict performance. In summary, the current technology allows individual component description and analysis, but there is no ability for ``system-level engineering'' as it would be required for the performance analysis of the kinds of applications and execution environments we consider here.

The methodology presented here is aimed to overcome these shortcomings and to enable enhanced performance analysis and prediction capabilities. Key elements of this methodology include development of multi-level and multi-modal methods and tools for describing the application software, the system software, and the system hardware. Such models and tools encompass modeling and simulation at multiple levels of detail and abstraction, as well as incorporation of performance measurements. Additional key capability in the methodology discussed here is the ability of coupling these multi-level/multimodal methods and tools into ``performance frameworks'', in a ``plug-and-play'' fashion, as needed for understanding, analysis, and prediction of performance. The talk will also discuss government agency programs and sponsored projects aimed at developing this methodology and the ensuing technology.

SPEC OSG benchmarks:
History, Use and Current Issues

Reinhold P. Weicker
Siemens AG, Information and Communication Products
Paderborn, Germany
email: reinhold.weicker@pdb.siemens.de

SPEC started out as an organization composed mainly of RISC workstation manufacturers, and its first product was a benchmark suite now known as the ``CPU89 benchmarks''. The successor suites of CPU benchmarks (CPU92, CPU95) are still the best known SPEC products, and it is a common misunderstanding to say ``SPEC benchmarks'' when one really means the SPEC CPU benchmarks. SPEC has added more groups under the SPEC umbrella:

High Performance Computing Group (HPG), continuing the tradition of the former ``Perfect Club:'' High-performance, numerically intensive computing benchmarks
Graphics Performance Council (GPC), continuing the tradition of GPC which cooperated with NCGA (National Computer Graphics Association): Graphics benchmarks and renamed the original SPEC organization the
Open Systems Group (OSG): General UNIX (and NT) benchmarks

The Open Systems Group now has several benchmark suites:

CPU benchmarks: Currently CPU95, predecessors CPU89, CPU92, successor CPU99
UNIX System commands benchmarks: SDM (1991)
File server benchmarks: Currently SFS 2.0 (1998), predecessors SFS 1.0 and 1.1
Web server benchmarks: Currently SPECweb96, successor SPECweb99
Java Virtual Machine benchmarks: Currently SPECjvm98

Usage of these benchmarks is discussed: Number of results, quotations in company publications, perceptions of the public. A particularly interesting question is why some benchmarks (example: the SDM benchmarks) are heavily used internally (and therefore can be considered good tests) but have not been successful as benchmarks.

Use in computer science research is discussed: The CPU benchmarks influence not only development in the manufacturers' labs but also academic research. However, they seem to be used often without any questions about their appropriate use: Ïf some new des ign speeds up the SPEC CPU benchmarks, it must be good". The question, "What is a good benchmark" is rarely discussed in academic research, and it may have different answers from marketing and from development perspectives.

Finally, some current topics within OSG are discussed:

Baseline vs. Peak values for CPU benchmarks
Uniformity across all SPEC OSG benchmarks vs. individual rules for different suites
Measurement of shipping systems vs. measurements of prototypes vs. estimates
Benchmarks for fast moving targets: New NFS, HTTP, ... protocols, new Java versions, etc.
The old question revisited: Synthetic benchmarks vs. real applications

Feedback from the audience is sought for these questions, and ideally a longer-term involvement of a wider variety of experts, not only from just the computer system manufacturers.

Re-establishing the Relevance of the SPEC Benchmarks

Andrew Allison
Computer Industry Consultant,
and Publisher of Inside the New Computer Industry Newsletter
http://aallison.com

Present at the Creation

I come before you as an industry consultant who was present at the inaugural meeting of SPEC ten years ago, a fervent supporter in its early days, a constructive critic who blew the whistle on the rampant optimization of SPECfp89 and worked hard to cause SPECbase to appear, and who two years ago dismissed all but the Viewperf benchmark suite as irrelevant. I appreciate the opportunity to offer my thoughts on what went wrong, what to do about it, and how to prevent it from happening again.

What Went Wrong

SPEC's benchmarks have consistently fallen victim to the mistaken belief of most of its members that they could cheat better than their competitors. The ball was kicked off by Kuck & Associates' recognition of the opportunity represented by the early success of the SPEC89 benchmark and HP's willingness to employ the Kuck optimizations. Everybody followed suit, and the SPECmark became irrelevant. As I reported in November 1993, history repeated itself with the SPEC92 benchmarks, leading to SPECbase (which vendors never reported). It has continued to do so.

What to Do About It

First, what not to do. Please don't go into competition with the Transaction Performance Council. They're doing a pretty good job with OLTP and DSS. There is, however, clearly a need for a new set of processor/cache/memory benchmarks (the original SPEC target) to compare the leading edge processors on the market today. There are also the high-performance and graphics-intensive client markets (the latter redundantly addressed by two distinct SPEC benchmark activities), and none-transactional, e.g., network, servers.

The simple answer to the question of how to preserve the integrity of benchmarks is to establish and maintain at all costs their objectivity. The devil, of course, is in the details. First, the organization must be de-politicized. By this I mean, for example, that you don't change the benchmarks if a new, commercially available, processor architecture, e.g. Alpha, happens to deliver outstanding performance.

Next, the publication of baseline, i.e. minimally optimized, results must be required in conjunction with the publication of optimized results. Further, multi-processor results should be clearly identified as such, and the prices of the configurations tested included as part of any result. The publication of single results from a component suite, e.g. CDRS without the rest of the Viewperf results or a SPECweb result without a response time, should be prohibited. And, finally, SPEC needs to be prepared to publicly chastise companies who bend the rules.

Thank you for your kind attention.

NFS Sensitivity to High Performance Networks

Richard P. Martin and David E. Culler
Computer Science Division
University of California at Berkeley

Local and system area networks have made rapid performance advances in recent years, including increases in per-port bandwidth (e.g. from 10, to 100, to 1000 Mb/s), huge increases in aggregate bandwidth, plus reductions in network latency and software overhead. With switching, aggregate bandwidth scales with number of nodes in network. Routing ASICs can forward small packets at the line rate. Cut-through routing reduces the switch latency into the microsecond range. Such performance in Local Area Networks (LANs) was unheard of even 5 years ago-these networks represent a fundamental advance over their predecessors, 10 Mb Ethernet and FDDI.

The recent performance gains of these networks motivate the question: ``How much do improvements in network performance translate into improvements in application performance?'' Although the application space is enormous, previous work shows that 60%-70% of LAN traffic is filesystem related. We thus focus our study on the effects these recent networks have on the Network File System (NFS) protocol. In particular, we want to understand to what aspects of the communication subsystem (e.g. bandwidth, latency, overhead) is NFS most sensitive. Instead of attempting to answer these questions in a piecemeal fashion for specific point technologies, we take a systematic, parameterized approach so we can draw general conclusions.

We view the NFS system under investigation as a ``grey-box''. From this perspective, the clients and server comprise a system for which we can measure how certain outputs, e.g., the time to read a file, change in response to various parametric inputs. One class of inputs are the traditional characteristics of NFS workloads, e.g. the mix of reads/writes/lookups and the data set size. These inputs are governed by the SPECsfs mix and scaling rules. The second class of inputs are novel to this study. We vary each of the performance characteristics of the communication subsystem, including the bandwidth, latency and software overhead. The output of the system is a curve of response time vs. throughput for a fixed set of communication characteristics. We can then understand the effect of the network parameters on the response time vs. throughput curve.

Our approach is two-phased. First, we build a simple model using standard queuing theory techniques. Second, we perform experiments on a live system. Using the model, we can make predictions about what will happen in a real system. Our goal in this work, however, is not to develop highly accurate models. Rather, the purpose of the model is to gain insight as to how a system should behave as we change the networking parameters. The model's value lies in its ability to identify why the system responds as it does. Points where the measured data deviates from model predictions expose weaknesses in our understanding of the system.

The second phase of our approach is experimental: we construct a live testbed on which we can change input parameters and measure the system's response. We begin with a network which is higher-performance than what is generally available, and then scale back the performance in a controlled manner. As we scale back the performance, we observe the ``slowdown'' of the NFS system as a function of different network parameters. From the slowdown, we derive the sensitivity of the system to each parameter. A high sensitivity indicates where future work will yield the largest improvements. Our choice of workload, SPECsfs, allows us to compare our results to industry published data. More importantly, using the techniques in this work we can infer the structure of the NFS servers from the data published by SPEC.

We find that simple queuing models are quite effective in analyzing the behavior of an NFS server. We were able to model changes in response time and throughput for a variety of network parameters. However, more investigation is needed to better describe the effect of latency on response time.

Our results show that NFS is most sensitive to processor overhead. Significant reductions in overheads will be critical for future servers to utilize emerging multi-gigabit networks. Overhead reductions will have to come from two places: improvements to networking stacks and the local filesystem. However, a significant component of the overhead in our testbeds came from general kernel code with no easily identifiable source sub-system.

We found that NFS is quite insensitive to network bandwidth. However, our findings must be placed in light of the the SPECsfs benchmark. The regularity of the SPECsfs load combined with the small size of most NFS operations accounts for the low sensitivity to bandwidth. Given that real traffic is quite bursty our sensitivity result to bandwidth is lower than what would be seen in a production environment. Even assuming the worst-case behavior, however, gigabit LANs will not bandwidth-limit NFS.

NFS is quite insensitive for latencies of a typical high-performance LAN. Newer LANs will have even lower latencies as switch degrees increase. However, when crossing into the current IP routing regime of milliseconds latency does have an impact. It will be interesting to see if NFS is used over a wider area as IP switching becomes more common.

Our results show that NFS version 3 exhibits better tolerance to high latencies than version 2. In addition, at very high latencies our model breaks down; the system exhibits a hyper-sensitive response at WAN-class latencies on the order of 10's of milliseconds.

HPC benchmarking activities in Japan

Satoshi Sekiguchi
Electrotechnical Laboratory, MITI
Tsukuba, Japan

Mitushisa Sato
Real-World Partnership
Japan

Even as we have just begun to evolve computer system design from an art into a science, the movement to distributed and heterogeneous systems adds further complexity to an already difficult task. In a very short time period, computer users have expanded their interests from traditional workstations into the ensemble of workstations and the information "GRID" like network computing systems. The challenge of good HPC platform design is to successfully develop high performance components and then integrate them into a system which delivers maximal performance, as defined by the applications end-users. Therefore, it is by no means simple to determine which computing system has higher performance than ever.

In Japan there are several big HPC vendors still delivering exciting machines with new technologies, and the government is putting more research budget into the HPC market to encourage the HPC activities. However, the benchmarking activities in Japan has not been seen well from outside, so far. The authors would like to summarize those activities from the inside point of view. In addition, an overview of the US-Japan science and technology agreement between the both governments regarding the performance evaluation technologies on supercomputers are given.

Then, possible contributions to the SPEC HPG activity will be discussed. One of the authors has achieved performance numbers with the SPEC HPC benchmark running on the newer parallel machines. Furthermore, the computer center in the research center of Agency of Industrial Science and Technology would provide the opportunity to measure the performance on the machines such as Hitachi SR8000 with 64 nodes, IBM SP-2 with 256 processors, as well as their hand made cluster of workstations with 256 nodes at maximum. The benchmark suite used in the procurement in the computer center will be referred briefly.

Support for and Experience with Systematic Performance Tuning

Edward S. Davidson
Professor of EECS
University of Michigan
Ann Arbor, MI 48109-2122
email: davidson@eecs.umich.edu

In our work toward a Parallel Software Development Environment for Application Performance Enhancement and Comparative Assessment of Scalable Architectures, we have developed a suite of performance analysis tools. The documentation and source codes of an initial tool suite has now been posted on the web (under ``tools'' at http://www.eecs.umich.edu/PPP/ ) for others to download and use.

This suite, in addition to the SMAIT trace generation package, includes

DSM-MB, a microbenchmarking suite for accurately characterizing the performance of important primitive operations, primarily memory and communication operations, and more complex operations based on these primitives,
CIAT, a configuration-independent analysis tool to identify inherent properties of an application, including an assessment of working set size (% of references captured in an ideal cache vs. cache capacity), frequencies of various communication patterns and their sharing degrees, and the slack (between definition and use) that is available for latency tolerance or code restructuring, and
TDAT a time distribution analysis tool that indicates how critical events are distributed over the run time.

An overview of these tools as well as

CCAT, a configuration-dependent analysis tool to assess actual application performance on a specifiable system configuration, focusing primarily on the system interconnect, protocols and conflicts, and
CXbound, a hierarchical performance bound analysis tool that detects specific performance problems to guide the use of performance tuning techniques,

will be presented together with some illustrations of their use on applications in vehicle crash simulation and computational electromagnetics.

Performance Coupling: Case Studies for Measuring the Interactions of Kernels in Scientific Applications

> Jonathan Geisler¹ and Valerie E. Taylor²
ECE Department
Northwestern University
email: {geisler,taylor}@ece.nwu.edu

Traditional performance optimization techniques have focused on optimizing the most time consuming kernel in an application. An example of such an optimization entails restructuring an algorithm to increase data reuse (i.e., blocking), thereby reducing cache misses. It is well known that the performance increase that is achieved when optimizing a given kernel in isolation generally does not reflect the performance increase that occurs when the new kernel is included in the larger application. This disparity in performance increase between kernel and full application is due in part to a poor understanding of how the interaction, or coupling, between kernels affects the performance of the application.

In this talk, we present a methodology for measuring this coupling and describe how the measurements can be used to obtain an efficient application. In particular, we measure the performance of a kernel executed in isolation and of pairs of kernels executed together. The pairs are identified as the adjacent kernels in the application. These measurements lead to a coupling parameter that is a ratio between the performance of the kernel pair and the sum of the isolated performances. This parameter allows the algorithm designer to know how much a kernel interacts with the other kernels in the application. With this ratio and a full understanding of the algorithm, the designer can improve the application performance by changing the kernel so that it can benefit from the work done by previous kernels and/or doing work from which future kernels can benefit.

This work focuses on single processor performance. The methodology generalizes to a parallel processing framework by adding another level of interaction to represent the interconnection network. Future work will examine parallel applications more fully, but we begin with serial applications because of the importance of obtaining optimized serial codes. Within the single processor, we are looking at interaction that occurs within the memory hierarchy, in particular the first level cache.

We will present results on generating the coupling parameter for the NAS Parallel Benchmarks and one of the SPEChpc benchmarks, seis. We will show how the coupling parameter can provide insight into the development of hybrid data structures that can lead to good performance.

A Method for Benchmarks Analysis

Alex Predtechenski
Advanced Micro Devices

Presented is a methodology to analyze behavior of a benchmark without prior knowledge of the benchmark source code or detailed execution traces. This method is based on an old idea to profile a computer with a particular benchmark by varying independently the CPU core and system bus frequencies. Unlike traditional representation with the benchmark score as a dependent variable and the core clock rate as an independent variable, the proposed method seeks for functional dependencies (``profiles'') in the time domain - the benchmark runtime vs. CPU core clock period. In the time domain, the corresponding dependencies appear to be strictly linear for any statistically representative benchmark (like SPEC or Winstone). This simple functional form allows for reliable extrapolation of the benchmark scores into extended frequency ranges that are currently unattainable. Extrapolation of the runtime trendlines down to zero core clock period (infinitely-fast CPU) gives basis for useful interpretation of system behavior.

The method makes it possible to identify three components of the run-time ``budget'': 1) ``raw CPU time'', 2) system ``bus time'' (time to access main memory when the CPU stalls due to cache misses), and 3) ``peripheral time'' to wait for independent peripheral devices to complete I/O operations. It is done by constructing two kinds of profiles; R and B. To build a constant-ratio profile (R-profile), the system-to-core frequency ratio is kept constant, but the system frequency is varied. In this case, both the ``raw CPU time'' and the ``bus time'' tend to vanish in the limit of infinitely high frequency. However, if the benchmark has a heavy I/O component, the extrapolated profile will show some residual time. It is apparent that this residual time is only a function of peripheral disk and/or video equipment that operates at their own pace and does not depend on the CPU type nor on it's core frequency.

For the B-profile, when the system (bus) frequency is kept constant while the CPU core frequency is varied via changing system-to-core ratio, the benchmark run-time is still a linear function of the core clock period. However, the B-profile points to a somewhat higher offset when extrapolating into a hypothetical infinitely-fast CPU. This is because the bus state-machines still operate at a constant clock rate, so the time to access an external memory does not vanish in the limit and adds up to the ``peripheral time''. It is shown that the system ``bus time'' is also independent from the core CPU frequency. Hence, the ``raw CPU time'' at any particular core frequency can be found by subtracting the ``bus time'' and ``peripheral time'' from the total run-time. In addition, the slope of the B-profile is proportional to average IPC for the benchmark. This interpretation allows to compare architectural efficiencies of different CPU brands. Finally, by varying other system parameters like size and speed of caches, or memory bandwidth, it is possible to explore other system characteristics and identify their contribution to overall system performance.

The key elements of the methodology are illustrated using the AMD-K6(tm) processor running Winstone 98 under Windows 95. Then the SPEC95 results for Pentium-II and Alpha processors are analyzed and projections are presented. The ``peripheral time'' for SPEC95 benchmarks is found to be negligibly small.

The Need for Workload Benchmarks

Larry Rudolph
Laboratory for Computer Science
Massachusetts Institute of Technology
Cambrige, MA
email: rudolph@lcs.mit.edu

Traditional benchmark suites do not meet the needs of high performance computing, especially for general purpose distributed and parallel systems. On these computer systems, multiple application programs, each consisting of multiple independent tasks, may be executing in parallel. The constructive and destructive interactions of the tasks have a profound affect on the ultimate system performance. A benchmark suite consisting of a set of independent, self-contained, representative application programs does not exercise many of the important interactions.

A workload benchmark, on the other hand, can be used to develop, evaluate, and compare many features of modern high performance hardware and support software. A workload benchmark specifies the submission of jobs into the system, distributions of parallelism and runtime, as found by analyzing accounting traces, and characterizes the types of jobs including a description of the internal structure of the jobs themselves. We argue that the focus should be on on-line open systems, and propose that a standard workload should be used as a benchmark.

Consider an interactive job scheduler for a parallel computer. Fine-grained parallel jobs require gang-scheduling to ensure that interacting tasks will be executing at the same time. Coarse-grained and I/O intensive jobs, on the other hand, may perform well when its individual tasks are independently scheduled to execute whenever there is an idle processor and the physical memory requirements are met. A scheduler sensitive to the granularity and mass-storage needs of its workload may perform better than one that ignores them. A traditional benchmark suite will not highlight these differences. It is only with workload benchmarks can one evaluate the impact of various high-performance system features.

Need for Database Trace Benchmarks

Reinhard Riedl
Department of Computer Science
Faculty of Economics
University of Zurich
CH-8057 Zurich, Switzerland
email: riedl@ifi.unizh.ch
http://www.ifi.unizh.ch/~riedl

Workload benchmarks are necessary tools for the comparison of different implementations of distributed transaction processing systems. They have to represent the typical workload properties of real-world applications, which may be derived only from real-world workload measurements. Unfortunately, `typical' traces recorded in realistic application scenarios themselves may not play the role of benchmarks. Such recordings significantly degrade the performance of the system and are therefore usually performed only for short periods of time. Thus, available traces are rather short for a realistic evaluation of critical phenomena, such as page replacements in the page buffer.

We present an analysis of customer traces of high performance transaction processing in database/data communication systems. The data access patterns revealed by our analysis suggest the conclusion that meta-benchmarks are needed for the evaluation of realistic applications. Such meta-benchmarks define procedures for the generation of synthetic workloads of arbitrary length from available workload traces. These synthetic traces have to exhibit the same characteristic workload properties as the recorded real traces, in particular with respect to the footprint growth and the distribution of data access likelihoods. They may then serve as quasi-benchmarks for a particular application class. It should be especially mentioned that the most simple way of creating synthetic traces - namely the repetition of recorded traces - is insufficient, since the resulting workloads do not exhibit the desired footprint growth property. And the most elegant solution - namely the definition of a parameterized family of stochastic processes - does not work either due to the high complexity of trace patterns. Moreover, industrial benchmarks like the TPC-B benchmark also exhibit properties, which differ strongly from the properties observed in real-world customer traces.

Our conclusions, with respect to the need for meta-benchmarks, are supported by simulation experiments for distributed transaction processing, which we have carried out in order to compare different load distribution strategies for realistic application scenarios. The experiments demonstrate that the performance of a particular load distribution mechanism essentially depends on the workload processed, i.e., it depends on characteristic indicators for the locality of the workload and for its categorizability. This strongly suggests that there is nothing like a universal benchmark for distributed transaction processing. What we need instead is a generic technique that allows us to create the right benchmark for each application scenario.

Designing SPEC Benchmarks as Universal Benchmark Suites

Jozo J. Dujmovic
San Francisco State University

We present metrics for computing differences between individual SPEC benchmarks. The minimum difference is 0 (for identical benchmarks) and the maximum difference is 1 (or 100%, for the most different benchmarks). Individual benchmarks can be geometrically interpreted as points in a ``program space''. Redundant benchmarks are visualized as clusters of close points.

Using quantitative differences between individual benchmarks and cluster analysis techniques it is possible to develop efficient and practical quantitative methods for evaluation and design of benchmark suites. The main design goals are to eliminate excessive redundancy and to achieve a desired distribution of benchmarks in the program space.

A particularly interesting type of benchmark suite is a suite which has (1) uniform and low redundancy between component workloads, (2) maximum size, and (3) uniform distribution of workloads within the program space. Such a benchmark suite is called a universal benchmark suite.

Specific complex workloads can be characterized by a set of weights that are used when computing SPECmark indicators from a universal benchmark suite. As opposed to the traditional method of designing a new benchmark suite for each type of workload, SPEC can design a new generation of benchmarks as universal benchmark suites. In addition, SPEC can develop several sets of weights that reflect a spectrum of selected typical complex workloads, and provide a Web-based performance calculator for computing weighted SPECmark indicators for all selected typical complex workloads.

The proposed approach to standard performance evaluation has the following main advantages:

For each benchmark suite it is possible to provide a scientific proof of the validity of selecting component benchmarks.
The number of component benchmarks can be minimized.
Standard performance indicators can be customized and made workload-sensitive.
The process of updating benchmark suites can be strictly controlled, less frequent, faster, and less expensive.

These features can increase the credibility and versatility of standard industrial benchmarks, and significantly reduce the cost of benchmarking.

MPI, HPF or OpenMP - A Study with the NAS Benchmarks³

H. Jin, M. Frumkin, M. Hribar, A. Waheed, and J. Yan
NAS System Division, NASA Ames Research Center
Moffett Field, CA 94035-1000
email: {hjin,frumkin,hribar,waheed,yan}@nas.nasa.gov

Figure 1:

Porting applications to high performance parallel and distributed platforms is a challenging task. Writing parallel code by hand is time consuming and costly, but this task can be simplified by high-level languages and would even better be automated by parallelizing tools and compilers. The definition of HPF (High Performance Fortran, based on data parallel model) and OpenMP (based on shared memory parallel model) standards has offered great opportunities in this respect. Both provide simple and clear interfaces to language like FORTRAN and simplify many tedious tasks encountered in writing message passing (e.g. MPI) programs. HPF is independent of memory hierarchy and easy to use, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain.

In our study, we implemented parallel versions of the NAS Benchmarks with HPF and OpenMP directives. The NAS Parallel Benchmarks (NPB), developed by researchers at NASA, consist of five kernels and three simulated applications which mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. Our implementation starts with the serial version that is included in the most recent release of NPB2.3. Optimization of memory and cache usage has been applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. The performance of the HPF and OpenMP versions of six benchmarks is compared with the MPI implementation for the Class-A problem size in Figure 1. These results were obtained on an SGI Origin2000 (64 CPUs and 195MHz), MIPSpro-f77 compiler 7.2.1, and PGI pghpf-2.4.3 compiler with MPI interface.

In general, OpenMP performs remarkably well, fairly close to MPI in many cases and even better in FT. This is because no data transposition is needed for FT with OpenMP and, thus, there is less memory traffic. The 1-D pipeline implementation for LU performs better than the hyper-plane algorithm, but is still worse than the MPI code which contains a 2-D pipeline. The memory-optimized version of BT is about 50% better than the original MPI implementation in both serial and parallel performance. Although HPF performs still the worst in many cases, it is getting closer to MPI in FT and CG. The large difference in MG reflects the fact that HPF lacks of expressions in handling irregular computation.

Several computer-aided tools were used in helping parallelize the benchmarks. The techniques used can certainly be applied to more realistic applications.

Conventional Benchmarks as a Sample of the Performance Spectrum

Dr. John Gustafson and Mr. Rajat Todi
Ames Laboratory-US DOE
Ames, IA
email: {gus,todi}@ameslab.gov

Most benchmarks are smaller than actual application programs. One reason for this is to improve the universality of the benchmark by demanding resources every computer in question is likely to have. Another is that users dynamically increase the size of application programs to match the power available, whereas most benchmarks are static and of a size appropriate for computers available when the benchmark was created. The inevitable consequence of the size mismatch is that the benchmark overstates computer performance, since smaller programs spend more of their time in cache. The HINT benchmark is scalable in that it does not fix memory, problem size, or data precision. It can thus examine the full spectrum of performance through various memory regimes, and potentially express a superset of the information given by any particular fixed-size benchmark. We present the results of using HINT to predict the ranking of computers on other benchmarks, SPEC in particular, by looking at only subsets of the performance curve generated by HINT.

A National Trace Collection and Distribution Resource

J. Kelly Flanagan and Frank Sorenson
Performance Evaluation Laboratory
Computer Science
Brigham Young University
Provo, Utah 84602
email: {Kelly,sorenson}@cs.byu.edu

Computer systems are modeled before construction to minimize errors and performance bottlenecks. A common modeling approach is to build software models of computer system components, and use realistic trace data as input. This methodology is commonly referred to as trace-driven simulation. Trace-driven simulation can be very accurate if both the system model and input trace data represent the system under test. The accuracy of the model is typically under the control of the researcher, but little or no trace data is available that accurately represents current or future workloads. This type of trace data is needed by academic and industrial researchers, and instructors and students investigating computer architecture alternatives. These parties require trace data, but are not necessarily interested in collecting the data themselves.

The objective of this work is to collect accurate and representative trace data from a variety of computer systems running various operating systems and application workloads and make the collected trace data available to all interested parties via the Internet and other forms of media. The main focus will be to collect and distribute address traces containing references generated by both operating systems and user processes, but instruction traces and disk I/O traces will also be made available.

Address trace data will be collected using a previously developed hardware monitoring technique called BACH. This technique has been successfully used on a variety of machines to collect traces that are billions of references long and contain all references generated by the system under test. Once collected, the data will be managed and stored using large disk arrays and automated tape robot technology. The trace data will be made available to interested parties in several ways. Manageably sized traces will be made available via the WWW through a graphical user interface that allows the user to specify the machine type, configuration, operating system, and application that was used to collect the trace. Larger traces will be available on several tape formats or CD-ROM collections.

An important part of this work is to maintain a set of traces that are representative of current and future computing trends. A representative set of traces will be maintained by regularly updating the systems, operating systems, and workloads being traced. This requires input from researchers and other users of trace data. We need input in several areas:

What hardware platforms are of interest to others? What CPU types, multiprocessor configurations, memory capacities, etc.?
Which workloads are of interest? SPEC benchmarks, 3D graphics, interactive games, TPC benchmarks, etc.
What trace formats are most useful to others? What should the traces include?
What analysis tools should we make available to others? Cache simulators, processor simulators, locality surface generators, etc.

This work constitutes a significant contribution because it promises to supply researchers with accurate data to perform computer system simulations. Through the use of the resulting trace data, many researchers may make significant contributions to the computer architecture and performance evaluation communities.

POSITION STATEMENTS BY PANELISTS

PANEL:

Are SPEC Benchmarks Reaching Their Objective?

Moderator: Kaivalya Dixit, SPEC and IBM
Panelists:
Jozo Dujmovic, San Francisco State University
Larry Gray, SPEC and Hewlett Packard
Neil Gunther, Performance Dynamics
Margo Seltzer, Harvard University

EVALUATION AND COMPARISON OF SPEC 89, 92, AND 95

Jozo J. Dujmovic
San Francisco State University

Using a quantitative model of difference between component benchmarks and cluster analysis techniques it is possible to evaluate and compare individual SPEC benchmark suites. The results can be visualized as dendrograms and/or covergrams. In addition, benchmark suites can be compared using quantitative indicators of size, granularity, and density.

Our analysis shows that the evolution of SPEC benchmarks is a process that is moving in the right direction: new generations of SPEC benchmarks are provably better than previous generations. The size and completeness of SPEC89/92/95 suites increase, and the redundancy of workloads decreases. However, the existing shortcomings of SPEC benchmarks are measurable and could have been detected during the development of SPEC benchmark suites and not after their release.

The SPEC95 benchmarks can be substantially improved: a group of redundant workloads should be removed, and new workloads should be created to fill the empty regions of program space. The granularity and density of SPEC95 benchmarks can also be improved.

The distribution of SPEC92 workloads was non-uniform and concentrated in a few highly populated regions. In spite of twice as many workloads (and doubling of the cost of benchmarking) the global size of SPEC92 was less than the size of SPEC89. This was a serious failure and a warning that a quantitative workload selection method is indispensable.

References

[]: Dujmovic, J.J and Dujmovic I.J. EVOLUTION AND EVALUATION OF SPEC BENCHMARKS. Performance Evaluation Review, Vol. 26, No.3, pp. 2-9, December 1998.
[]: Dujmovic, J.J. EVALUATION AND DESIGN OF BENCHMARK SUITES. Chapter 12 in Bagchi, K. at al. (Ed.), Advanced Computer Performance Modeling and Sumulation, Gordon and Breach, pp. 278-323, 1998.

SPECial Relativity - On the Benchmarking of Moving Targets

Dr. Neil J. Gunther
Performance Dynamics Consulting(tm)
Castro Valley, California
email: ngunther@ricochet.net

SPEC was born out of workstation-vendor angst over a common and open reference for competitive CPU performance rankings. Despite some methodological errors[1] in the choice of codes, SPEC has been largely successful in that effort, having replaced ill-defined metrics like VUPS and MIPS with SPECmarks, and more recently CINT95 and CFP95. But let's not kid ourselves, the motivation for all this has been marketing, not science. So, although these SPEC metrics do offer benefits, many undesirable side-effects remain:

A continued heavy emphasis on a single number performance index
A continued heavy emphasis on an absolute performance index
Poor choice of workload characterization
Moving the reference baseline arbitrarily
Adopting a different metric scale between SPEC92 and SPEC95
Promoting the impotent SPECrate metric for multiprocessor performance []
A lack of continuity with the SDM, SFS and other system benchmarks []

For me, as a practicing performance analyst, these issues have reduced the value of the SPEC benchmarks for doing system procurement and capacity planning (which is when users look at the SPEC numbers). What's missing is relative system performance indicators(plural). SPEC needs a methodology similar to that provided by IBM using it's LSPR mainframe methodology []. I don't want a non-linearized composite index. At least, I want to be able to say, X is 3.4 times faster than Y on workload W_i; for all i where X and Y are either CPUs or systems.

Big system vendors want this, customers and my clients want this. I attempt to do it with the available SPEC data[] but SPEC could vastly improve the characterization of all their benchmark suites with a common relativistic approach where the W_i are clearly characterized. Examples will be given in the presentation.

References

[]: Gunther, N.J. Musings on Multiprocessor Benchmarks: Watershed or Waterloo? (Invited article) SPEC Newsletter, 1, 12, Fall 1989.
[]: Gunther, N.J. Flight Recorders, Timing Chains, and Directions for SPEC System Benchmarks SPEC Newsletter, v7, Issue 1, p.7, Mar 1995.
[]: Fitch, J.L. LSPR Processor Benchmarks: IBM's Approach to Large Processor Capacity Evaluation IBM Corp., Doc. GG66-0232, 1992.
[]: Gunther, N.J. The Practical Performance Analyst, Chap.8, McGraw-Hill, 1998.

The Case for Application-Specific Performance Evaluation

Margo I. Seltzer
Harvard University

Most performance analysis today uses either microbenchmark results or ``standard'' macrobenchmark results (e.g., SPEC, SPEC-SFS); however, these results do little to indicate how well a particular system will handle a particular application. Such results can be misleading if they reflect workloads that are different from those that will be run on a target system. As researchers, our goals for benchmarking and performance analysis are twofold. We strive to understand why systems perform as they do, and we seek to compare competing ideas, designs, and implementations. In this second goal, our needs correspond to those of consumers faced with purchasing decisions. In order for this second goal to be met, it is essential that benchmarks enable meaningful comparisons; raw benchmark numbers are useless to other researchers and customers unless those raw numbers map into some useful work.

The VINO group at Harvard University is developing a suite of application-directed benchmarks that evaluate the performance of a system in the context of one or more specific applications. We use three different approaches to application-specific measurement, one using vectors that characterize both the underlying system and an application, one using trace-driven techniques, and a hybrid approach.

In the vector-based approach, a system's performance is characterized, not by a single number, but by a vector of microbenchmark results that represent the behavior of the system on its underlying primitive operations. Similarly, an application is characterized by its use of the system's primitives. By themselves, the vectors provide insight into the system's performance and the application's demands. When multiplied together, the result yields an application-specific indicator of performance. Such results can be used effectively to compare different systems, different applications, and different hardware platforms.

In some cases, performance is not only application-specific, but specific to a particular offered load. In this case, the vector-based approach is insufficient and we resort to a trace-based approach. Rather than characterizing the underlying system and application, we characterize the trace and use that characterization to generate a stochastic workload that can be executed as a benchmark. Characterizing the trace is a more powerful mechanism than simply replaying the trace, because it allows us to scale workloads and alter their composition yielding benchmarks that can help predict performance of future systems. This approach is useful for evaluating the performance of servers such as Web Servers, Proxy Caches and Database servers.

Our hybrid methodology, unsurprisingly, combines the vector-based and trace-based approaches. This methodology arose from our attempts to apply vector-based analysis to file systems. At first glance, the analysis of file system performance would seem to demand a trace-based approach. However, a purely trace-based approach fails to provide insight into the underlying behavior of the system. While we can characterize the behavior of the file system via a microbenchmark suite, the interaction of successive requests in the various file system caches makes a simple characterization of an application impossible. Instead, we use an application trace to generate a trace-specific application vector.

These three different approaches allow us to evaluate the performance of complex systems in the context of specific applications. As machines become smaller, cheaper, and faster, we expect that it will be common to configure systems for specific applications. In these cases, it is imperative that the performance of such machines be characterized appropriately for their target applications. The techniques described here facilitate exactly this characterization.

Footnotes:

¹ Supported by a grant from NASA Ames.

² Supported by grants from NASA Ames and NSF, grant no. CCR-9215482.

³ Work supported by NASA Contract No. NAS2-14303 with MRJ Technology Solutions.

File translated from T_EX by T_TH, version 1.92.
On 18 Feb 1999, 12:53.