Next: 4 File Operations Up: Fast Indexing: Support for Previous: 2 Background

3 The Index File

In a stacking environment that supports SCAs, data offsets may change arbitrarily. An efficient mapping is needed that can tell where the starting offset of the encoded data is for a given offset in the original file. We call this mapping the index table.

The index table is stored in a separate file called the index file, as shown in Figure 2. This file serves as the fast meta-data index into an encoded file. For a given data file $F$ , we create an index file called $F$ .idx. Our system encodes and decodes whole pages or their multiples--which maps well to file system operations. The index table assumes page-based operations and stores offsets of encoded pages as they appear in the encoded file.

**Figure 2:** Overall structure of size-changing stackable file systems. Each original data file is encoded into a lower data file. Additional meta-data index information is stored in an index file. Both the index file and the encoded data files reside in the lower level file system.
$\begin{figure}\begin{centering} \epsfig{file=figures/file-struct.eps, width=2.75in} \end{centering}\end{figure}$

Consider an example of a file in a compression file system. Figure 3 shows the mapping of offsets between the upper (original) file and the lower (encoded) data file. To find out the bytes in page 2 of the original file, we read the data bytes 3000-7200 in the encoded data file, decode them, and return to the VFS that data in page 2.

**Figure 3:** An example of a 32-bit file system that shrinks data size (compression). Each upper page is represented by an encoded lower ``chunk.'' The mapping of offsets is shown in Table 1.
$\begin{figure}\begin{centering} \epsfig{file=figures/gzipfs-reg.eps, width=4.00in} \end{centering}\end{figure}$

To find out which encoded bytes we need to read from the lower file, we consult the index file, shown in Table 1. The index file tells us that the original file has 6 pages, that its original size is 21500 bytes, and then it lists the ending offsets of the encoded data for an upper page. Finding the lower offsets for the upper page 2 is a simple linear dereferencing of the data in the index file; we do not have to search the index file linearly. Note that our design of the index file supports both 32-bit and 64-bit file systems, but the examples we provide here are for 32-bit file systems.

Table 1: Format of the index file for Figures 3 and 4. Fast Tails are described in Section 4.2. The first word encodes both flags and the number of pages in the index file. The ``ls'' (large size) flag is the first bit in the index file and indicates if the index file encodes a 32-bit (0) or 64-bit (1) file system.

Word	Representing	Regular	With Fast
(32/64 bits)		IDX File	Tail (ft)
1 (12 bits)	flags	ls=0,ft=0,...	ls=0,ft=1,...
1 (20/52 bits)	# pages	6	5
2	orig. file size	21500	21500
3	page 0	1100	1100
4	page 1	3000	3000
5	page 2	7200	7200
6	page 3	7700	7700
7	page 4	10000	10000
8	page 5	10120

The index file starts with a word that encodes flags and the number of pages in the corresponding original data file. We reserve the lower 12 bits for special flags such as whether the index file encodes a file in a 32-bit or a 64-bit file system, whether fast tails were encoded in this file (see Section 4.2), etc. The very first bit of these flags, and therefore the first bit in the index file, determines if the file encoded is part of a 32-bit or a 64-bit file system. This way, just by reading the first bit we can determine how to interpret the rest of the index file: 4 bytes to encode page offsets on 32-bit file systems or 8 bytes to encode page offsets on 64-bit file systems.

We use the remaining 20 bits (on a 32-bit file system) for the number of pages because $2^{20}$ 4KB pages (the typical page size on i386 and SPARCv8 systems) would give us the exact maximum file size we can encode in 4 bytes on a 32-bit file system, as explained next; similarly $2^{52}$ 4KB pages is the exact maximum file size on a 64-bit file system.

The index file also contains the original file's size in the second word. We store this information in the index file so that commands like ls -l and others using stat(2) would work correctly; a process looking at the size of the file through the upper level file system would get the original number of bytes and blocks. The original file's size can be computed from the starting offset of the last data chunk in the encoded file, but it would require decoding the last (possibly incomplete) chunk (bytes 10000-10120 in the encoded file in Figure 3) which can be an expensive operation depending on the SCA. Storing the original file size in the index file is a speed optimization that only consumes one more word--in a physical data block that most likely was already allocated.

The index file is small. We store one word (4 bytes on a 32-bit file system) for each data page (usually 4096 bytes). On average, the index table size is 1024 times smaller than the original data file. For example, an index file that is exactly 4096 bytes long (one disk block on an Ext2 file system formatted with 4KB blocks) can describe an original file size of 1022 pages, or 4,186,112 bytes (almost 4MB).

Since the index file is relatively small, we read it into kernel memory as soon as the main file is open and manipulate it there. That way we have fast access to the index data in memory. The index information for each page is stored linearly and each index entry typically takes 4 bytes. That lets us compute the needed index information simply and find it from the index table using a single dereference into an array of 4-byte words (integers). To improve performance further, we write the final modified index table only after the original file is closed and all of its data flushed to stable media.

The size of the index file is less important for SCAs which increase the data size, such as unicoding, uuencoding, and some forms of encryption. The more the SCA increases the data size, the less significant the size of the index file becomes. Even in the case of SCAs that decrease data size (e.g., compression) the size of the index file may not be as important given the savings already gained from compression.

Storing the index information in a separate file has four advantages: meta-data separation, improved performance, recoverability, and simplified code management.

Many file systems separate data and meta data. This is a time-honored file system design tradition and is done for efficiency and reliability. Meta-data is considered more important and so it gets cached, stored, and updated differently than regular data. For example, in most Unix file systems [BertoniBertoni1998,McKusick, Joy, Leffler, and FabryMcKusick et al.1984,McKusick and GangerMcKusick and Ganger1999], file data is separated from meta-data in its placement on disk (file name, inode number, owner, timestamps, access modes, etc.). The index file is separate from the encoded file data for the same reasons.
Having a separate and small meta-data file allows us to look up offset information in that file quickly, then apply that offset to the main data file in one operation. An alternative would have been to merge the index information and data together. We considered three variations to that alternative: to append, prepend, or spread the index information in the data file. If we append the index information to the data file, performance would be hurt for sequential file reads from the beginning of the file, because we would have to seek to the end of the file constantly to look up and update index information. For the same reason, prepending the index information to the data file would hurt data access at the end of the file, such as when data is appended to files. The third variant was to include a small amount of index information at different locations in the file, perhaps at fixed locations or right before each encoded data chunk. With this technique, however, random reads and writes would be hurt; to find the right place to access the data with an index spread throughout the file, potentially many index chunks would have to be located and read from the beginning of the file.
For reliability reasons, we designed the index file so it could be recovered from the data file in case the index file is lost or damaged (see Section 5). This offers certain improvements over typical Unix file systems: if their meta-data (inodes, inode blocks and indirect blocks, directory data blocks, etc.) is lost, it rarely can be recovered. Note that the index file is not needed for our system to function: it represents a performance enhancing tool. Without the index file, size-changing file systems would perform poorly. Therefore, if it does not exist (or is lost), our system automatically regenerates the index file.
In addition, if we had not separated the index information from the data file, our kernel code would have been significantly more complicated. Particularly cumbersome would have been the need to shift index and data information around to make space for growing either a data portion or an index chunk.

Separating the index information and data file also has two disadvantages. First, since the index information is stored in a separate file, it uses up one more inode. We measured the effect that the consumption of an additional inode would have on typical file systems in our environment. We found that disk data block usage is often 6-8 times greater than inode utilization on disk-based file systems, leaving plenty of free inodes to use. Nevertheless, to save resources even further, we efficiently support zero-length files: a zero-length original data file is represented by a zero-length index file (consuming an inode but no data blocks).

Second, with a separate index file, potential naming conflicts could occur with the names of files and the names of their respective index files. For example, with our chosen design, if a file system has a file $F$ , its index file would be named $F$ .idx. This prevents a user from creating a data file whose name is $F$ .idx. Although this file's index file could exist and would be named $F$ .idx.idx, the actual data file would conflict with the index file of $F$ . To alleviate this problem in our chosen design, the prefix .idx can be dynamically assigned at mount time as a mount option.

Overall, however, we believe that the benefits of the separation of index and data files outweigh the disadvantages.

Next: 4 File Operations Up: Fast Indexing: Support for Previous: 2 Background

Erez Zadok 2001-10-06