Next: 4. Examples Up: Extending File Systems Using Previous: 2. Design

Subsections

3. Implementation

This section details the more difficult parts of the implementation and unexpected problems we encountered. Our first implementation concentrated on the Solaris 2.5.1 operating system because Solaris has a standard vnode interface and we had access to kernel sources. Our next two implementations were for the Linux 2.0 and the FreeBSD 3.0 operating systems. We chose these two because they are popular, are sufficiently different, and they also come with kernel sources. In addition, all three platforms support loadable kernel modules, which made debugging easier. Together, the platforms we chose cover a large portion of the Unix market.

The discussion in the rest of this section concentrates mostly on Solaris, unless otherwise noted. In Section 3.5 we discuss the differences in implementation between Linux and Solaris. Section 3.6 discusses the differences for the FreeBSD port.

3.1 Stacking

Wrapfs was initially similar to the Solaris loopback file system (lofs)[19]. Lofs passes all Vnode/VFS operations to the lower layer, but it only stacks on directory vnodes. Wrapfs stacks on every vnode, and makes identical copies of data blocks, pages, and file names in its own layer, so they can be changed independently of the lower level file system. Wrapfs does not explicitly manipulate objects in other layers. It appears to the upper VFS as a lower-level file system; concurrently, Wrapfs appears to lower-level file systems as an upper-layer. This allows us to stack multiple instances of Wrapfs on top of each other.

The key point that enables stacking is that each of the major data structures used in the file system (struct vnode and struct vfs) contain a field into which we can store file system specific data. Wrapfs uses that private field to store several pieces of information, especially a pointer to the corresponding lower level file system's vnode and VFS. When a vnode operation in Wrapfs is called, it finds the lower level's vnode from the current vnode, and repeats the same operation on the lower level vnode.

3.2 Paged Reading and Writing

We perform reading and writing on whole blocks of size matching the native page size. Whenever a read for a range of bytes is requested, we compute the extended range of bytes up to the next page boundary, and apply the operation to the lower file system using the extended range. Upon successful completion, the exact number of bytes requested are returned to the caller of the vnode operation.

Writing a range of bytes is more complicated than reading. Within one page, bytes may depend on previous bytes (e.g., encryption), so we have to read and decode parts of pages before writing other parts of them.

Throughout the rest of this section we will refer to the upper (wrapping) vnode as V, and to the lower (wrapped) vnode as V'; P and P' refer to memory mapped pages at these two levels, respectively. The example² depicted in Figure 3 shows what happens when a process asks to write bytes of an existing file from byte 9000 until byte 25000. Let us assume that the file in question has a total of 4 pages (32768) worth of bytes in it.

**Figure 3:** Writing Bytes in Wrapfs
$\begin{figure} \begin{centering} \epsfig{file=figures/write.eps, width=3in, height=1.5in}\vspace{-0.5em} \end{centering}\end{figure}$

1.: Compute the extended page boundary for the write range as 8192-32767 and allocate three empty pages. (Page 0 of V is untouched.)
2.: Read bytes 8192-8999 (page 1) from V', decode them, and place them in the first allocated page. We do not need to read or decode the bytes from 9000 onwards in page 1 because they will be overwritten by the data we wish to write anyway.
3.: Skip intermediate data pages that will be overwritten by the overall write operation (page 2).
4.: Read bytes 24576-32767 (page 3) from V', decode them, and place them in the third allocated page. This time we read and decode the whole page because we need the last 32767-25000=7767 bytes and these bytes depend on the first 8192-7767=426 bytes of that page.
5.: Copy the bytes that were passed to us into the appropriate offset in the three allocated pages.
6.: Finally, we encode the three data pages and call the write operation on V' for the same starting offset (9000). This time we write the bytes all the way to the last byte of the last page processed (byte 32767), to ensure validity of the data past file offset 25000.

3.2.1 Appending to Files

When files are opened for appending only, the VFS does not provide the vnode write function the real size of the file and where writing begins. If the size of the file before an append is not an exact multiple of the page size, data corruption may occur, since we will not begin a new encoding sequence on a page boundary.

We solve this problem by detecting when a file is opened with an append flag on, turn off that flag before the open operation is passed on to V', and replace it with flags that indicate to V' that the file was opened for normal reading and writing. We save the initial flags of the opened file, so that other operations on V could tell that the file was originally opened for appending. Whenever we write bytes to a file that was opened in append-only mode, we first find its size, and add that to the file offsets of the write request. In essence we convert append requests to regular write requests starting at the end of the file.

3.3 File Names and Directory Reading

Readdir is implemented in the kernel as a restartable function. A user process calls the readdir C library call, which is translated into repeated calls to the getdents(2) system call, passing it a buffer of a given size. The kernel fills the buffer with as many directory entries as will fit in the caller's buffer. If the directory was not read completely, the kernel sets a special EOF flag to false. As long as the flag is false, the C library function calls getdents(2) again.

The important issue with respect to directory reading is how to continue reading the directory from the offset where the previous read finished. This is accomplished by recording the last position and ensuring that it is returned to us upon the next invocation. We implemented readdir as follows:

1.: A readdir vnode operation is called on V for N bytes worth of directory data.
2.: Call the same vnode operation on V' and read back N bytes.
3.: Create a new temporary buffer of a size that is as large as N.
4.: Loop over the bytes read from V', breaking them into individual records representing one directory entry at a time (struct dirent). For each such, we call decode_filename to find the original file name. We construct a new directory entry record containing the decoded file name and add the new record to the allocated temporary buffer.
5.: We record the offset to read from on the next call to readdir; this is the position past the last file name we just read and decoded. This offset is stored in one of the fields of the struct uio (representing data movement between user and kernel space) that is returned to the caller. A new structure is passed to us upon the next invocation of readdir with the offset field untouched. This is how we are able to restart the call from the correct offset.
6.: The temporary buffer is returned to the caller of the vnode operation. If there is more data to read from V', then we set the EOF flag to false before returning from this function.

The caller of readdir asks to read at most N bytes. When we decode or encode file names, the result can be a longer or shorter file name. We ensure that we fill in the user buffer with no more struct dirent entries than could fit (but fewer is acceptable). Regardless of how many directory entries were read and processed, we set the file offset of the directory being read such that the next invocation of the readdir vnode operation will resume reading file names from exactly where it left off the last time.

3.4 Memory Mapping

To support MMAP operations and execute binaries we implemented memory-mapping vnode functions. As per Section 2.3, Wrapfs maintains its own cached decoded pages, while the lower file system keeps cached encoded pages.

When a page fault occurs, the kernel calls the vnode operation getpage. This function retrieves one or more pages from a file. For simplicity, we implemented it as repeatedly calling a function that retrieves a single page--getapage. We implemented getapage as follows:

1.: Check if the page is cached; if so, return it.
2.: If the page is not cached, create a new page P.
3.: Find V' from V and call the getpage operation on V', making sure it would return only one page P'.
4.: Copy the (encoded) data from P' to P.
5.: Map P into kernel virtual memory and decode the bytes by calling wrapfs_decode.
6.: Unmap P from kernel VM, insert it into V's cache, and return it.

The implementation of putpage was similar to getpage. In practice we also had to carefully handle two additional details, to avoid deadlocks and data corruption. First, pages contain several types of locks, and these locks must be held and released in the right order and at the right time. Secondly, the MMU keeps mode bits indicating status of pages in hardware, especially the referenced and modified bits. We had to update and synchronize the hardware version of these bits with their software version kept in the pages' flags. For a file system to have to know and handle all of these low-level details blurs the distinction between the file system and the VM system.

3.5 Linux

When we began the Solaris work we referred to the implementation of other file systems such as lofs. Linux 2.0 did not have one as part of standard distributions, but we were able to locate and use a prototype³. Also, the Linux Vnode/VFS interface contains a different set of functions and data structures than Solaris, but it operates similarly.

In Linux, much of the common file system code was extracted and moved to a generic (higher) level. Many generic file system functions exist that can be used by default if the file system does not define its own version. This leaves the file system developer to deal with only the core issues of the file system. For example, Solaris User I/O (uio) structures contain various fields that must be updated carefully and consistently. Linux simplifies data movement by passing I/O related vnode functions a simple allocated (char *) buffer and an integer describing how many bytes to process in the buffer passed.

Memory-mapped operations are also easier in Linux. The vnode interface in Solaris includes functions that must be able to manipulate one or more pages. In Linux, a file system handles one page at a time, leaving page clustering and multiple-page operations to the higher VFS.

Directory reading was simpler in Linux. In Solaris, we read a number of raw bytes from the lower level file system, and parse them into chunks of sizeof(struct dirent), set the proper fields in this structure, and append the file name bytes to the end of the structure (out of band). In Linux, we provide the kernel with a callback function for iterating over directory entries. This function is called by higher level code and ask us to simply process one file name at a time.

There were only two caveats to the portability of the Linux code. First, Linux keeps a list of exported kernel symbols (in kernel/ksyms.c) available to loadable modules. To make Wrapfs a loadable module, we had to export additional symbols to the rest of the kernel, for functions mostly related to memory mapping. Second, most of the structures used in the file system (inode, super_block, and file) include a private field into which stacking specific data could be placed. We had to add a private field to only one structure that was missing it, the vm_area_struct, which represents custom per-process virtual memory manager page-fault handlers. Since Wrapfs is the first fully stackable file system for Linux, we feel that these changes are small and acceptable, given that more stackable file systems are likely to be developed.⁴

3.6 FreeBSD

FreeBSD 3.0 is based on BSD-4.4Lite. We chose it as the third port because it represents another major section of Unix operating systems. FreeBSD's vnode interface is similar to Solaris's and the port was straightforward. FreeBSD's version of the loopback file system is called nullfs[12], a template for writing stackable file systems. Unfortunately, ever since the merging of the VM and Buffer Cache in FreeBSD 3.0, stackable file systems stopped working because of the inability of the VFS to correctly map data pages of stackable file systems to their on-disk locations. We worked around two deficiencies in nullfs. First, writing large files resulted in some data pages getting zero-filled on disk; this forced us to perform all writes synchronously. Second, memory mapping through nullfs paniced the kernel, so we implemented MMAP functions ourselves. We implemented getpages and putpages using read and write, respectively, because calling the lower-level's page functions resulted in a UFS pager error.

Next: 4. Examples Up: Extending File Systems Using Previous: 2. Design

Erez Zadok
1999-04-26