Updated: 2013-01-30
Created: 2005-10-31
Older references are not quite accurate, because things in kernel 2.6 are quite better than in kernel 2.4 and filesystem maintainers have reacted to older unfavourable benchmarks by tuning their designs. So the references below are ordered by most recent first.
ext3 FAQ
2004-10-14.| Feature | ext3 |
JFS | XFS |
|---|---|---|---|
| Block sizes | 1024-4096 | 4096 | 512-4096 |
| Max fs size | 8TiB (243B) | 32PiB (255B) | 8EiB (263B)
16TiB (244B) on 32b system |
| Max file size | 1TiB (240B) | 4PiB (252B) | 8EiB (263B)
16TiB (244B) on 32b system |
| Max files/fs | 232 | 232 | 232 |
| Max files/dir | 232 | 231 | 232 |
| Max subdirs/dir | 215 | 216 | 232 |
| Number of inodes | fixed | dynamic | dynamic |
| Indexed dirs | option | auto | auto |
| Small data in inodes | no | auto (xattrs, dirs) | auto (xattrs, extent maps) |
fsck speed |
slow | fast | fast |
fsck space |
? | 32B per inode | 2GiB RAM per 1TiB + 200B per inode
(half on 32b CPU) |
| Redundant metadata | yes | yes | no |
| Bad block handling | yes | mkfs only | no |
| Tunable commit interval | yes | no | metadata |
| Supports VFS lock | yes | yes | yes |
| Has own lock/snapshot | no | no | yes |
| Names | 8 bit | UTF-16 or 8 bit | 8 bit |
noatime |
yes | yes | yes |
O_DIRECT |
yes | yes | yes |
barrier |
yes | no | yes (and checks) |
| commit interval | yes | no | no |
| EA/ACLs | both | both | both |
| Quotas | both | both | both |
| DMAPI | no | patch | option |
| Case insensitive | no | mkfs only |
mkfs only (since 2.6.28) |
| Supported by GRUB | yes | yes | mostly |
| Can grow | online | online only | online only |
| Can shrink | offline | no | no |
| Journals data | option | no | no |
| Journals what | blocks | operations | operations |
| Journal disabling | yes | yes | no |
| Journal size | fixed | fixed | grow/shrink |
| Resize journal | offline | maybe | offline |
| Journal on another partition | yes | yes | yes |
| Special features or misfeatures | In place convert from ext2.
MS Windows drivers. |
Case insensitive option.
Low CPU usage. DCE DFS compatible. OS2 compatible. |
Real time (streaming) section.
IRIX compatible. Very large write behind. Project (subtree) quotas. Superblock on sector 0. |
This section is about known hints and issues with various aspects of common filesystems. They can be just inconveniences or limitations or severe performance problems.
inodesize of 128 bytes is used or kept. If so, multiple updates per second are not recorded. This can impact make processing and fsync.
Kernel version independent hints:
rotors directories across AGs, and then attempts to allocate space for new files in the AG containing the directory, which is quite different from the alternative because
if you create a bunch of files in the same directory, without inode64 XFS will scatter the extents all over the disk rather than trying to allocate them next to each other.
allocation group, if all allocation groups are in use to grown extents writing can stop for all other files, or similarly if the files are in the same allocation group. Having more allocations groups typically improves multithreaded performance.
will disable XFS' write barrier support.
Kernel version dependent hints:
The default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.
prepare_krb5_rfc_cfx_buffer: not implemented
Summary of conditions for a working NFSv4 with Kerberos GSSAPI authentication and/or encryption:
realmname used for both.
enctypedes-cbc-src for kernel versions older than 2.6.35.
Some useful pages for using NFSv4 with Kerberos:
Version independent:
volumeis actually a subtree of AFS directories and files, and a
partitionthat holds volumes is actually a subtree of some native operating system filesystem, whether the partition is on a fileserver or is the cache on a client.
partitionanything that is mounted under directories whose name begins with
vicepin the system's root directory.
partitionholding OpenAFS volumes, as long as it is mounted vi a loop device.
partitionfor AFS volumes on an OpenAFS fileserver does not need to be in its own dedicated block device, and neither does the AFS cache filetree on an OpenAFS client, but out of space conditions caused by space in the filetree being less than that declared for the OpenAFS may be handled badly. The /vicepAB
partitionswhich are not mount points will be however ignored unless they contain a file called AlwaysAttach.
cellnames are case insensitive but they are stored internally in uppercase and printed in lower case. As a rule by convention they should always bwe specified in lower case, as there are default mappings to case sensitive Kerberos realm names in all upper cases and to case insentitive DNS domain names in all lower case.
dynrootbecause it relies on libafscp which does not handle synthetic roots.
windowstyle flow control algorithm similar to TCP, but the maximumwindow size is much smaller, which limits performance links with a large BDP. The protocol allows up to 256 outstanding packets, but versions of OpenAFS limit that for 32 packets, with the exception of the YFS version which allows for the full 256 packets.
So, setting a UDP buffer of 8Mbytes from user space is _just_ enough to handle 4096 incoming RX packets on a standard ethernet. However, it doesn't give you enough overhead to handle pings and other management packets. 16Mbytes should be plenty providing that you don't
a) Dramatically increase the number of threads on your fileserver
b) Increase the RX window size
c) Increase the ethernet frame size of your network (what impact this has depends on the internals of your network card implementation)
d) Have a large number of 1.6.0 clients on your networkTo summarise, and to stress Dan's original point - if you're running with the fileserver default buffer size (64k, 16 packets), or with the standard Linux maximum buffer size (128k, 32 packets), you almost certainly don't have enough buffer space for a loaded fileserver.
partitionas the read-write one, as that is essentially free as it does not require file copying.
release(updating read-only volumes to have the same content as a read-write volume) the read-only replica in the same partitions gets updated very quickly, and then other read-only replicas get updated from it, reducing the latency of the
releaseoperation.
partitionson the same server is a design error, and there are checks against that, but some corner cases can be missed by the checks.
quorumrelated reasons the number of AFS db servers should be odd (1, 2).
Version dependent:
partitionsup to and including OpenAFS version 1.4 must be on an ext2 or ext3 filesystem.
These are pointers to some of the entries in my technical blog where filesystems are discussed:
fsck timesext2 for all my MS
Windows filesystems except the boot one.ext3 with and without extended attributes
and ext3's new hash directory indices.fsck.davtools package to visualize
ext3 fragmentation.fsck takes more than one
month, and some filesystems being VLDBs.ext2 for MS Windows.noatime.ext3 into something else.worksmeans for filesystems.
worksfor file systems.
rootfilesystem.
This is a summary in my own words of this more detailed description of JFS data structures. But there is a much better PDF version of the same document, with inline illustrations, also available inside this RPM from SUSE.
ABNR which describes an extent
contaning zero bytes only.btree and the leaf extents are
called xtrees (and contain an array of
entries called xads)
if they are for an allocation map, and dtrees
if they are for a directory map.jfs_fsck.bmap,
is a file (not a B+-tree, despite being
called map) divided into 4KiB pages. The first block is the bmap control page, and then there are up to three levels of dmap control pages that point to many dmap pages. Each dmap page contains:
jfs_fsck if any.dinomap, and after that a number
of extents called
inode allocation groups.
dinomap contains:
tiedto it, until all such extents are freed.
dtree entries.