The FSL-er's Guide to File System and Storage Benchmarking

File systems and Storage Laboratory (FSL)

Version 2 (May 4, 2007)

This page provides a set of guidelines to consider when evaluating the performance of a file system. This information was collected by Avishay Traeger, Nikolai Joukov, Charles P. Wright, and Erez Zadok. Our motivation is to improve the quality of performance evaluations presented in papers. If you are interested in more information, see the File Systems Benchmarking Project home page. If you have any comments on this page, please send them to Erez Zadok.

The two underlying themes are:

Explain what you did in as much detail as possible:
For example, if you decided to create your own benchmark, please detail what you have done. If you are replaying traces, describe where they came from, how they were captured, and how you are replaying them (what tool? what speed?). This can help others understand and validate your results.
In addition to saying what you did, say why you did it that way:
For example, while it is important to note that you are using ext2 as a baseline for your analysis, it is just as important (or perhaps even more important) to discuss why it is a fair comparison. Similarly, it is useful for the reader to know why you ran that random-read benchmark so that they know what conclusions to draw from the results.

1. Choosing The Benchmark Configurations

1.1 Pose questions that will reveal the performance characteristics of the system

Some examples are "how does my system compare to current similar systems?," "how does my system behave under its expected workload?," and "what are the causes of my performance improvements or overheads?"

1.2 Decide on what baseline systems, system configurations, and benchmarks should be used to best answer the questions posed

This will produce a set of <system, configuration, benchmark> tuples that will need to be run. It is desirable for the researcher to have a rough idea what the expected results might be for each configuration at this point; if the actual results differ from these expectations, then the causes of the deviations should be investigated.

Since a system's performance is generally more meaningful when compared to the performance of existing technology, one should find existing systems that provide fair and interesting comparisons. For example, for benchmarking an encryption storage device, it would be useful to compare the performance to other encrypted storage devices, a traditional device, and perhaps some alternate implementations (user-space, file system, etc.).

The system under test may have several configurations that will need to be evaluated in turn. In addition, one may create artificial configurations where a component of the system is removed to determine its overhead. For example, in an encryption file or storage system, you can use a null cipher (copy data only) rather than encrypt, to isolate the overhead of encryption. Determining the cause of overheads may also be done using profiling techniques. Showing this incremental breakdown of performance numbers helps the reader to better understand a system's behavior.

1.3 Choose the benchmarks

There are three main types of benchmarks:

Macro-benchmarks: These exercise multiple file system operations, and are usually good for an overall view of the system's performance, though the workload may not be realistic.
Traces: Replaying traces can also provide an overall view of the system's performance. Traces are usually meant to exercise the system with a representative real-world workload, which can help to better understand how a system would behave under normal use. However, one should ensure that the trace is in fact representative of that workload (for example, the trace should capture a large enough sample), and that the method used to replay the trace preserves the characteristics of the workload.
Micro-benchmarks: These exercise few (usually one or two) operations. These are useful if you are measuring a very small change, to better understand the results of a macro-benchmark, to isolate the effects of specific parts of the system, or to show worst-case behavior. In general, these benchmarks are more meaningful when presented together with other benchmarks.

Useful file system benchmarks should highlight the high-level as well as the low-level performance. Therefore, we recommend using at least one macro-benchmark or trace to show a high-level view of performance, along with several micro-benchmarks to highlight more focused views. In addition, there are several workload properties that might be considered:

CPU-boundedness: File and storage system benchmarks should generally be I/O-bound, but a CPU-bound benchmark may also be run for systems that exercise the CPU.
Accurate timing: If the benchmark records its own timings, it should use accurate measurements.
Workload scalability: The benchmark should exercise each machine the same amount, independent of hardware or software speed.
Multi-threaded workloads: These may provide more realistic scenarios, and may help saturate the system with requests.
Well-understood workloads: Although the code of synthetic benchmarks can be read, and traces can be analyzed, it is more difficult to understand some application workloads. For example, compile benchmarks can behave rather differently depending on the testbed's architecture, installed software, and the version of the software being compiled. The source code for ad-hoc benchmarks should be publicly released, as it is the only truly complete description of your benchmark that would allow others to reproduce it (including any bugs or unexpected behavior).

2. Choosing The Benchmarking Environment

The state of the system during the benchmark's runs can have a significant effect on results. After determining an appropriate state, it should be created accurately and reported along with the results.

The state of the system's caches can affect the code-paths that are tested and thus affect benchmark results. It is not always clear if benchmarks should be run with ``warm'' or ``cold'' caches. On one hand, real systems do not generally run with completely cold caches. On the other hand, a benchmark that accesses too much cached data may be unrealistic as well. Because requests are mainly serviced from memory, the file or storage system will not be adequately exercised. Further, not bringing the cache back to a consistent state between runs can cause timing inconsistencies. If cold-cache results are desired, caches should be cleared before each run. This can be done by allocating and freeing large amounts of memory, remounting the file system, reloading the storage driver, or rebooting. We have found that rebooting is more effective than the other methods [4]. When working in an environment with multiple machines, the caches on all necessary machines must be cleared. This helps create identical runs, thus ensuring more stable results. If, however, warm cache results are desired, this can be achieved by running the experiment N+1 times, and discarding the first run's result.

Most modern disks use Zoned Constant Angular Velocity (ZCAV) to store data. In this design, the cylinders are divided into zones, where the number of sectors in a cylinder increases with the distance from the center of the disk. Because of this, the transfer rate varies from zone to zone [2]. It has been recommended to minimize ZCAV effects by creating a partition of the smallest possible size on the outside of the disk [1]. However, this makes results less realistic, and may not be appropriate for all benchmarks (for example, long seeks may be necessary to show the effectiveness of the system). We recommend simply specifying the location of the test partition in the paper, to help reproducibility.

Most file system and storage benchmarks are run on an empty system, which could make the results different than a real-world setting. A system may be aged by running a workload based on system snapshots [3]. Some other ways to age a system before running a benchmark are to run a long-term workload, copy an existing raw image, or to replay a trace before running the benchmark. It should be noted that for some systems and benchmarks, aging is not a concern. For example, aging will not have any effect when replaying a block-level trace on a traditional storage device, since the benchmark will behave identically regardless of the disk's contents.

To ensure the reproducibility of the results, all non-essential services and processes should be stopped before running the benchmark. These processes can cause anomalous results (outliers) or higher than normal standard deviations for a set of runs. However, processes such as cron will coexist with the system when used in the real world, and so it must be understood that these results are measured in a sterile environment. Ideally, we would be able to demonstrate performance with the interactions of other processes present. However, this is difficult because the set of processes is specific to a machine's configuration. Instead, we recommend using multi-threaded workloads because they more accurately depict a real system that normally has several active processes. In addition, we recommend to ensure that no users log into the test machines during a benchmark run, and to also ensure that no other traffic is consuming your network bandwidth while running benchmarks that involve the network.

3. Running The Benchmarks

We recommend four important guidelines to running benchmarks properly. First, one should ensure that every benchmark run is identical. Second, each test should be run several times to ensure accuracy, and standard deviations or confidence levels should be computed to determine the appropriate number of runs. Third, tests should be run for a long enough period of time, so that the system reaches steady state for the majority of the run. Fourth, the benchmarking process should preferably be automated using scripts or available tools such as Auto-pilot to minimize mistakes associated with manual repetitive tasks.

4. Presenting The Results

Once results are obtained, they should be presented appropriately so that accurate conclusions may be derived from them. Aside from the data that is presented, the benchmark configurations and environment should be accurately described. Proper graphs should be displayed, with error bars, where applicable.

We recommend using confidence intervals, rather than standard deviation, to present results. The standard deviation is a measure of how much variation there is between the runs. The half-width of the confidence interval describes how far the true value may be from the captured mean with a given degree of confidence (e.g., 95%). This provides a better sense of the true mean. In addition, as more benchmark runs are performed, the standard deviation may not decrease, but the width of the confidence interval will generally decrease.
For experiments with fewer than 30 runs, one should be careful not to use the normal distribution for calculating confidence intervals. This is because the central limit theorem no longer holds with a small sample size. Instead, one should use the Student's t-distribution. This distribution may also be used for experiments with at least 30 runs, since in this case it is similar to the normal distribution.
Large confidence-interval widths or non-normal distributions may indicate a software bug or benchmarking error. For example, the half-widths of the confidence intervals are recommended to be less than 5% of the mean. If the results are not stable, then either there is a bug in the code, or the instability should be explained. Anomalous results (e.g., outliers) should never be discarded. If they are due to programming or benchmarking errors, the problem should be fixed and the benchmarks rerun to gather new and more stable results.

5. Validating Results

Other researchers may wish to benchmark your software for two main reasons:

to reproduce your results or confirm them, or
to compare their system to yours.

First, it is considered good scientific practice to provide enough information for others to validate your results. This includes detailed hardware and software specifications about the testbeds. Although it is usually not practical to include such large amounts of information in a conference paper, it can be published in an online appendix. Whereas it can be difficult for a researcher to accurately validate another's results without the exact testbed, it is still possible to see if the results generally correlate.

Second, there may be a case where a researcher creates a system that has similar properties to yours (e.g., they are both encryption file systems). It would be logical for the researcher to compare the two systems. However, if your paper showed an X% overhead over ext2, and the new file system has a Y% overhead over ext2, no claim can be made about which of the two file systems is better because the benchmarking environment is different. The researcher should benchmark both research file systems using a setup that is as similar as possible to that of the original benchmark. This way both file systems are tested under the same conditions. Moreover, since they are running the benchmark in the same way that you did, no claim can be made that they chose a specific case in which their file system might perform better.

To help solve these two issues:

Enough information should be made available about your testbed (both hardware and any relevant software) so that an outside researcher can validate your results.
If possible, make your software available to other researchers so that they can compare their system to yours. Releasing the source is preferred, but a binary release can also be helpful if there are legal issues preventing the release of source code. In addition, any benchmarks that were written and any traces that were collected should be made available to others.

References

[1] Ellard, D. and Seltzer, M. 2003. NFS Tricks and Benchmarking Traps. In Proceedings of the Annual USENIX Technical Conference. USENIX Association, San Antonio, TX, 101–114.

[2] Meter, R. V. 1997. Observing the Effects of Multi-Zone Disks. In Proceedings of the Annual USENIX Technical Conference. USENIX Association, Anaheim, CA, 19–30.

[3] Smith, K. A. and Seltzer, M. I. 1997. File System Aging — Increasing the Relevance of File System Benchmarks. In Proceedings of the 1997 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM SIGOPS, Seattle, WA, 203–213.

[4] Wright, C.P., Joukov, N., Kulkarni, D., Miretskiy, Y., and Zadok, E. 2005. Auto-pilot: A Platform for System Software Benchmarking. In Proceedings of the Annual USENIX Technical Conference, FREENIX Track. USENIX Association, Anaheim, CA, 175–187.