File System Reliability Test Suite

There are no good tools for exhaustively testing a file system for reliability, freedom from bugs, scalability, etc. Often file system developers will use some home-brewed scripts to run various operations such as creating and deleting files. Others use large compiles of source bases as a way of testing that the file system works well. Some use performance tools such as PostMark, PGMeter, or AIM7, which were designed for performance measurements, not conformance to specifications. The best known tool for testing file systems is called FSX, which will thoroughly test a small subset of file system operations concerning the mix of read/write operations, memory-mapped operations, and file truncation operations, all inside a single file. None of the existing solutions covers all file system operations.

In this project we are interested in producing a suite of tools that can run to verify that every file system operation works well; that every combination of file system operations works as expected; that there are no race conditions, deadlocks, and obvious crashes of the file system code; that the is no way to corrupt the file system's data or crash the kernel by any combination of possibly obscure set of file system operations. Of particular interest to us is exploring boundary conditions that are often missed. For example, when disks begin to fail, the file system may not gracefully handle the receipt of many EIO errors; or when memory runs out, getting ENOMEM can confuse file system code; or when creating a single file that fills an entire file system with a file larger than 2GB; or when directory data (struct dirent) gets corrupt somehow.

To give you an idea of the techniques we are developing, consider a suspect bad code that does not properly lock the directory inode before modifying the directory. Perhaps the lock is misplaced and a minor race could be triggered under the right conditions. Our suite will pick a set of machines with different characteristics, and especially SMP machines; on these machines, our tool will run concurrent processes such that one thread tries to create a named file with a given rate of creation (often as quickly as possible), and another tread tries to delete the same named file with a given rate of progress. We will try this on SMP and UMP machines, different disks, different amounts of memory, different CPU speeds, etc. The reason is that races often can be won or lost due to specific timing conditions. The creation thread should always get either a success code or EEXIST (the file already exists); the deletion thread should always get a success code or ENOENT (file does not exist). If either thread gets a different error, we may have very likely discovered a bug that needs to be investigated further.

Conference and Workshop Papers:

# Title (click for html version) Formats Published In Date Comments
1 Auto-pilot: A Platform for System Software Benchmarking PS PDF BibTeX Usenix Technical Conference, FREENIX Track Apr 2005  
2 High-Confidence Operating Systems PS PDF BibTeX Tenth ACM SIGOPS European Workshop Sep 2002  

Past Students:

# Name (click for home page) Program Period Current Location
1 Charles P. Wright PhD May 2003 - May 2006 Partner, Senior Software Architect, Illumon (New York, NY)
2 Naveen Gupta MS Sep 2004 - Dec 2005 Member of the Technical Staff, Systems Software group, Google (Mountain View, CA)
3 Kiran-Kumar Muniswamy-Reddy MS Jan 2002 - May 2004 Consulting Member of Technical Staff, Oracle Corp (Seattle, WA)
4 Sunil Satnur MS Sep 2004 - Dec 2005 Staff Engineer, Storage and Avaliability Group, VMware Inc. (Palo Alto, CA)

Sponsors:

# Sponsor Amount Period Type Title (click for award abstract)
1 NSF HECURA $760,253 2006-2009 Lead-PI File System Tracing, Replaying, Profiling, and Analysis on HEC Systems