SPECsip_Infrastructure2011 models a VoIP deployment. We anticipate that future releases of the benchmark (or separate benchmarks released by the SPEC SIP SubCommittee) will support Instant Messaging and Presence.
SIPStone is more like SPEC CPU in that it specifies 10 call-flows (what we call "scenarios") and then reports a weighted average of the 10. It that sense it is more micro-benchmark oriented. SPECsip_Infrastructure2011 is more a macro or full-system benchmark that is meant to capture user behavior and be useful for capacity planning. Thus it uses "Simultaneous Number of Supported Subscribers" as its primary performance metric. SIPStone was developed by Columbia University and is available via license from SIPQuest.com. SPECsip_Infrastructure2011 was developed by consensus via the SPEC standardization process.
The ETSI IMS benchmark was developed by ETSI to provide a benchmark for wireless/3GPP providers using IMS. SPECsip_Infrastructure2011 is meant to be SIP specific and not make any IMS assumptions. The ETSI IMS benchmark is a specification, not a code release. SPECsip_Infrastructure2011 is both a specification and a released body of code that can be run and submitted for publication using SPEC's acceptance process. SPECsip_Infrastructure2011 focuses on a single node SIP server system under test, rather than a complete network architecture such as IMS.
Currently, only RFC3261.
Currently, only UDP is supported. TCP, TLS, and SCTP may be supported in future releases.
For simplicity. People running the benchmark will have to be careful to make sure that the UAS is not a bottleneck and does not significantly contribute to this latency. The values for response time need to have a sufficient"fudge" factor to account for it.
There are several ways to implement voicemail in SIP. We considered using the 302 response since it follows RFC 4458, appears to be a common method for handling voicemail, and places the least amount of requirements on the SUT. However, using the 302 provokes different behavior on the SUT depending on what parts of the SIP RFC are implemented. Since these parts are optional (MAY and SHOULD, not MUST), we cannot rely on an arbitrary SIP server implementing them. Thus, a voicemail call is a standard call, albeit with different ring durations and call durations (hold times).
Yes, for all methods that are appropriate (i.e., INVITE, BYE, and REGISTER, but not ACK or CANCEL).
For security reasons, to prevent hijacking of calls (INVITE, BYE) or assuming someone else's identity (REGISTER).
It is possible in some cases to cache the nonce that is used in the challenge, so that the Authorization header can be re-used in later SIP requests without necessitating going through the authorization challenge. However, for security reasons, the nonce is valid for only a limited period of time. SPECsip_Infrastructure2011 assumes that any nonce would have expired and thus the authorization challenge is necessary for each transaction.
The IP address is a single point-of-presence (POP) for a SUT, which may be a single SIP server or even a cluster of servers sharing a single virtual IP address. The benchmark is intended to measure a single SIP configuration. Supporting multiple IP addresses would make scaling the experiment trivial, essentially running N instances of the benchmark in parallel. Instead, those wishing to scale the benchmark using multiple machines must use a load-balancing proxy or switch, which exposes a single IP address.
The benchmark does make use of user name resolution (mapping URIs to IP addresses). All users are registered to the same domain, spec.org, and the SUT must map a user URI to the appropriate IP address. The benchmark does not perform any DNS resolution. While DNS resolution can have a significant impact on performance (e.g., if the server blocks while waiting for a resolve), modeling DNS resolution is complex and would require making estimates of DNS cache miss ratios and costs to retrieve DNS records over the network. Thus DNS resolution is outside the scope of the benchmark, at least for the first release.
All timers are assumed to use their defaults as specified in RFC 3261.
While DB-like interactions are an important component of any SIP deployment, the SPEC SIP SubCommittee decided to omit this from the benchmark for two reasons: simplicity and lack of standardization. Simplicity because omitting the DB made the benchmark simpler and more a measure of the native SIP stack rather than of a DB server. Lack of standardization because there does not yet seem to be an standard way of communicating with a DB-like server that is actually widely used across industry. At the moment, many different approaches appear to be used: JDBC, LDAP, RADIUS, Diameter, and even proprietary protocols. Choosing one of these protocols would be expressing favoritism and biasing against products that did not support that protocol; choosing several of them would make apples-to-apples comparisons difficult. After a great deal of discussion, the SPEC SIP SubCommittee decided to not include this in the benchmark, with the understanding that future releases of the benchmark might change this decision. For example, if Diameter becomes widely deployed in practice, it is possible to imagine that a later release of SPECsip_Infrastructure would require the SUT to use Diameter.
The values were based on workload studies done by Communigate Systems and IBM. Workload characterization of SIP server deployments are difficult to acquire and thus the values may reflect individual deployments rather than more representative scenarios. This is one reason why the benchmark is so parameterizable. The SPEC SIP SubCommittee encourages more thorough workload characterization of SIP servers and is open to improving the benchmark in terms of how representative it is.
The SPECsip_Infrastructure2011 benchmark is a stochastic benchmark that uses random number generation for load creation. A large number of subscribers is necessary to ensure that the workload traffic generated for a given number of subscribers is consistent and reproducible, and that results obtained for that number of users are comparable. Lower loads have larger variability that makes ensuring consistency more difficult.