The FSL-er's Guide to File System and Storage Benchmarking
Version 2 (May 4, 2007)
This page provides a set of guidelines to consider when evaluating
the performance of a file system. This information was collected by
Avishay Traeger,
Nikolai Joukov,
Charles P. Wright,
and Erez Zadok.
Our motivation is to improve the quality of performance evaluations
presented in papers. If you are interested in more information, see
the File Systems
Benchmarking Project home page. If you have any comments on this page,
please send them to Erez
Zadok.
The two underlying themes are:
-
Explain what you did in as much detail as possible:
For example, if you decided to create your own benchmark, please detail what you
have done. If you are replaying traces, describe where they came from, how they
were captured, and how you are replaying them (what tool? what speed?). This
can help others understand and validate your results.
-
In addition to saying what you did, say why you did it that
way:
For example, while it is important to note that you are using ext2 as a baseline
for your analysis, it is just as important (or perhaps even more important) to
discuss why it is a fair comparison. Similarly, it is useful for the reader to
know why you ran that random-read benchmark so that they know what conclusions
to draw from the results.
|
1. Choosing The Benchmark Configurations
1.1 Pose questions that will reveal the performance characteristics of the
system
Some examples are "how does my system compare to current similar systems?," "how
does my system behave under its expected workload?," and "what are the causes of
my performance improvements or overheads?"
1.2 Decide on what baseline systems, system configurations, and benchmarks
should be used to best answer the questions posed
This will produce a set of <system, configuration,
benchmark> tuples that will need to be run. It is desirable for the
researcher to have a rough idea what the expected results might be for each
configuration at this point; if the actual results differ from these
expectations, then the causes of the deviations should be investigated.
Since a system's performance is generally more meaningful when compared to the
performance of existing technology, one should find existing systems that
provide fair and interesting comparisons. For example, for benchmarking an
encryption storage device, it would be useful to compare the performance to
other encrypted storage devices, a traditional device, and perhaps some
alternate implementations (user-space, file system, etc.).
The system under test may have several configurations that will need to be
evaluated in turn. In addition, one may create artificial configurations where
a component of the system is removed to determine its overhead. For example, in
an encryption file or storage system, you can use a null cipher (copy data only)
rather than encrypt, to isolate the overhead of encryption. Determining the
cause of overheads may also be done using profiling techniques. Showing this
incremental breakdown of performance numbers helps the reader to better
understand a system's behavior.
1.3 Choose the benchmarks
There are three main types of benchmarks:
- Macro-benchmarks:
These exercise multiple file system operations, and are usually good for an
overall view of the system's performance, though the workload may not be
realistic.
- Traces:
Replaying traces can also provide an overall view of the system's performance.
Traces are usually meant to exercise the system with a representative real-world
workload, which can help to better understand how a system would behave under
normal use. However, one should ensure that the trace is in fact representative
of that workload (for example, the trace should capture a large enough sample),
and that the method used to replay the trace preserves the characteristics of
the workload.
- Micro-benchmarks:
These exercise few (usually one or two) operations. These are useful if you are
measuring a very small change, to better understand the results of a
macro-benchmark, to isolate the effects of specific parts of the system, or to
show worst-case behavior. In general, these benchmarks are more meaningful when
presented together with other benchmarks.
Useful file system benchmarks should highlight the high-level as well as the
low-level performance. Therefore, we recommend using at least one
macro-benchmark or trace to show a high-level view of performance, along with
several micro-benchmarks to highlight more focused views. In addition, there
are several workload properties that might be considered:
- CPU-boundedness: File and storage system benchmarks should generally
be I/O-bound, but a CPU-bound benchmark may also be run for systems that
exercise the CPU.
- Accurate timing: If the benchmark records its own timings, it
should use accurate measurements.
- Workload scalability: The benchmark should exercise each machine
the same amount, independent of hardware or software speed.
- Multi-threaded workloads: These may provide more realistic
scenarios, and may help saturate the system with requests.
- Well-understood workloads: Although the code of synthetic benchmarks
can be read, and traces can be analyzed, it is more difficult to understand some
application workloads. For example, compile benchmarks can behave rather
differently depending on the testbed's architecture, installed software, and the
version of the software being compiled. The source code for ad-hoc benchmarks
should be publicly released, as it is the only truly complete description of
your benchmark that would allow others to reproduce it (including any bugs or
unexpected behavior).
2. Choosing The Benchmarking Environment
The state of the system during the benchmark's runs can have a significant
effect on results. After determining an appropriate state, it should be created
accurately and reported along with the results.
The state of the system's caches can affect the code-paths that are tested
and thus affect benchmark results. It is not always clear if benchmarks should
be run with ``warm'' or ``cold'' caches. On one hand, real systems do not
generally run with completely cold caches. On the other hand, a benchmark that
accesses too much cached data may be unrealistic as well. Because requests are
mainly serviced from memory, the file or storage system will not be adequately
exercised. Further, not bringing the cache back to a consistent state between
runs can cause timing inconsistencies. If cold-cache results are desired,
caches should be cleared before each run. This can be done by
allocating and freeing large amounts of memory, remounting
the file system, reloading the storage driver, or rebooting. We have found that
rebooting is more effective than the other methods [4]. When
working in an environment with multiple machines, the caches on all necessary
machines must be cleared. This helps create identical runs, thus ensuring more
stable results. If, however, warm cache results are desired, this can be
achieved by running the experiment N+1 times, and discarding the first run's
result.
Most modern disks use Zoned Constant Angular Velocity (ZCAV) to store data.
In this design, the cylinders are divided into zones, where the number of
sectors in a cylinder increases with the distance from the center of the disk.
Because of this, the transfer rate varies from zone to zone [2]. It has been recommended to minimize ZCAV effects by
creating a partition of the smallest possible size on the outside of the disk [1]. However, this makes results less realistic, and may
not be appropriate for all benchmarks (for example, long seeks may be necessary
to show the effectiveness of the system). We recommend simply specifying the
location of the test partition in the paper, to help reproducibility.
Most file system and storage benchmarks are run on an empty system, which
could make the results different than a real-world setting. A system may be
aged by running a workload based on system snapshots [3].
Some other ways to age a system before running a benchmark are to run a
long-term workload, copy an existing raw image, or to replay a trace before
running the benchmark. It should be noted that for some systems and benchmarks,
aging is not a concern. For example, aging will not have any effect when
replaying a block-level trace on a traditional storage device, since the
benchmark will behave identically regardless of the disk's contents.
To ensure the reproducibility of the results, all non-essential services and
processes should be stopped before running the benchmark. These processes can
cause anomalous results (outliers) or higher than normal standard deviations for
a set of runs. However, processes such as cron will coexist with the
system when used in the real world, and so it must be understood that these
results are measured in a sterile environment. Ideally, we would be able to
demonstrate performance with the interactions of other processes present.
However, this is difficult because the set of processes is specific to a
machine's configuration. Instead, we recommend using multi-threaded workloads
because they more accurately depict a real system that normally has several
active processes. In addition, we recommend to ensure that no users log into
the test machines during a benchmark run, and to also ensure that no other
traffic is consuming your network bandwidth while running benchmarks that
involve the network.
3. Running The Benchmarks
We recommend four important guidelines to running benchmarks properly. First,
one should ensure that every benchmark run is identical. Second, each test
should be run several times to ensure accuracy, and standard deviations or
confidence levels should be computed to determine the appropriate number of
runs. Third, tests should be run for a long enough period of time, so that the
system reaches steady state for the majority of the run. Fourth, the
benchmarking process should preferably be automated using scripts or available
tools such as
Auto-pilot
to minimize mistakes associated with manual repetitive tasks.
4. Presenting The Results
Once results are obtained, they should be presented appropriately so that
accurate conclusions may be derived from them. Aside from the data that is
presented, the benchmark configurations and environment should be accurately
described. Proper graphs should be displayed, with error bars, where
applicable.
- We recommend using confidence intervals, rather than standard deviation, to
present results. The standard deviation is a measure of how much variation
there is between the runs. The half-width of the confidence interval describes
how far the true value may be from the captured mean with a given degree of
confidence (e.g., 95%). This provides a better sense of the true mean. In
addition, as more benchmark runs are performed, the standard deviation may not
decrease, but the width of the confidence interval will generally decrease.
- For experiments with fewer than 30 runs, one should be careful not to use
the normal distribution for calculating confidence intervals. This is because
the central limit theorem no longer holds with a small sample size. Instead,
one should use the Student's t-distribution. This distribution may also be used
for experiments with at least 30 runs, since in this case it is similar to the
normal distribution.
- Large confidence-interval widths or non-normal distributions may indicate a
software bug or benchmarking error. For example, the half-widths of the
confidence intervals are recommended to be less than 5% of the mean. If the
results are not stable, then either there is a bug in the code, or the
instability should be explained. Anomalous results (e.g., outliers) should
never be discarded. If they are due to programming or benchmarking errors, the
problem should be fixed and the benchmarks rerun to gather new and more stable
results.
5. Validating Results
Other researchers may wish to benchmark your software for two main reasons:
- to reproduce your results or confirm them, or
- to compare their system to yours.
First, it is considered good scientific practice to provide enough information
for others to validate your results. This includes detailed hardware and
software specifications about the testbeds. Although it is usually not
practical to include such large amounts of information in a conference paper, it
can be published in an online appendix. Whereas it can be difficult for a
researcher to accurately validate another's results without the exact testbed,
it is still possible to see if the results generally correlate.
Second, there may be a case where a researcher creates a system that has similar
properties to yours (e.g., they are both encryption file systems). It would be
logical for the researcher to compare the two systems. However, if your paper
showed an X% overhead over ext2, and the new file system has a Y% overhead
over ext2, no claim can be made about which of the two file systems is better
because the benchmarking environment is different. The researcher should
benchmark both research file systems using a setup that is as similar as
possible to that of the original benchmark. This way both file systems are
tested under the same conditions. Moreover, since they are running the
benchmark in the same way that you did, no claim can be made that they chose a
specific case in which their file system might perform better.
To help solve these two issues:
- Enough information should be made available about your testbed (both
hardware and any relevant software) so that an outside researcher can validate
your results.
- If possible, make your software available to other researchers so that they
can compare their system to yours. Releasing the source is preferred, but a
binary release can also be helpful if there are legal issues preventing the
release of source code. In addition, any benchmarks that were written and any
traces that were collected should be made available to others.
References
[1]
Ellard, D. and Seltzer, M. 2003. NFS Tricks and Benchmarking Traps. In
Proceedings of the Annual USENIX Technical Conference. USENIX
Association, San Antonio, TX, 101–114.
[2]
Meter, R. V. 1997. Observing the Effects of Multi-Zone
Disks. In Proceedings of the Annual USENIX Technical
Conference. USENIX Association, Anaheim, CA, 19–30.
[3]
Smith, K. A. and Seltzer, M. I. 1997. File System Aging — Increasing the
Relevance of File System Benchmarks. In Proceedings of the 1997 ACM
SIGMETRICS International Conference on Measurement and Modeling of Computer
Systems. ACM SIGOPS, Seattle, WA, 203–213.
[4]
Wright, C.P., Joukov, N., Kulkarni, D., Miretskiy, Y., and Zadok,
E. 2005. Auto-pilot: A Platform for System Software
Benchmarking. In Proceedings of the Annual USENIX Technical Conference,
FREENIX Track. USENIX Association, Anaheim, CA, 175–187.