next up previous
Next: 10. Portability to Other Up: Stackable File Systems as Previous: 8. Acknowledgments


9. Stacking Implementation Details

This appendix explains some of the more difficult parts of the implementation of Wrapfs and unexpected problems we encountered. Additional details are beyond the scope of this paper and are available elsewhere[32].

9.1 Reading and Writing

By design, we perform reading and writing on whole blocks of size matching the native page size. Whenever a read for a range of bytes is requested, we compute the extended range of bytes up to the next page boundary, and apply the operation to the lower file system using the extended range. Upon successful completion, the exact number of bytes requested are returned to the caller of the vnode operation.

Figure 2: Writing Bytes in Wrapfs
\epsfig{file=figures/write.eps, width=3in, height=1.5in}\vspace{-0.5em}

Writing a range of bytes is more complicated than reading. Within one page, bytes may depend on previous bytes (e.g., encryption), so we have to read and decode parts of pages before writing other parts of them. The example depicted in Figure 2 shows what happens when a process asks to write bytes of an existing file from byte 9000 until byte 25000. Let us also assume that the file in question has a total of 4 pages (32768) of bytes in it. First, compute the extended range of bytes that covers pages 1-3, read those three pages in, and decode them. Then, write into these three (in memory) pages the new bytes passed from the user, at the proper offsets. Next, encode these three pages. Finally, write out to the lower level file system only those bytes that could have changed.

9.2 File Names and Directory Reading

The readdir vnode operation is implemented in the kernel as a restartable function. A user process calls the readdir C library call, which is translated into repeated calls to the getdents(2) system call, passing it a buffer of a given size. The buffer is filled by the kernel with enough bytes representing files in a directory being read, but no more. If the kernel has more bytes to offer the process (i.e., the directory has not been completely read) it will set a special EOF flag to false. As long as the C library call sees that the flag is false, it must call getdents(2) again. Each time it does so, it will read more bytes starting at the file offset of the opened directory as was left off during the last read.

The important issue with directory reading is how to continue reading the directory from exactly the offset it was left off the last time. This is accomplished by recording the last position and ensuring that it is returned to us upon the next invocation. We implemented readdir by reading a number of bytes from the lower level directory, breaking these bytes into individual records representing one directory entry at a time (struct dirent), calling decode_filename on each name, and then composing a new block of dirent data structures containing the decoded names. This new block is returned to the caller. If there is more data to read, then the EOF flag is set to false before returning from this function, and the last read offset is recorded.

9.3 Memory Mapping

To support MMAP operations and execute binaries we had to implement memory-mapping vnode functions. As discussed in Section 2.2.1, Wrapfs maintains its own cached (decoded) pages, while the lower file system keeps cached encoded pages.

When a page fault occurs, the kernel calls the vnode operation getpage. This function retrieves one or more pages from a file. For simplicity, we implemented it as repeatedly calling a function which retrieves a single page, getapage. The implementation of getapage appeared simple. We first look for the page in the cache and return it if found. Otherwise we allocate a new page, call the getpage routine on the lower level file system, and then decode the bytes in the page just read into the new page. The new page now contains decoded bytes. It is added to the page cache and Wrapfs returns it to the caller.

The implementation of putpage was similar to getpage. In practice we also had to carefully handle two additional details, to avoid deadlocks and data corruption. First, pages contain several types of locks, and these locks must be held and released in the right order and at the right time. Secondly, the MMU keeps mode bits indicating status of pages in hardware, especially the referenced and modified bits. We have to update and synchronize the hardware version of these bits with their software version kept in the pages' flags. For a file system to have to know and handle all of these low-level details blurs the distinction between the file system and the VM system, and further complicates porting.

next up previous
Next: 10. Portability to Other Up: Stackable File Systems as Previous: 8. Acknowledgments
Erez Zadok