The UNIX filesystem is a classification device; files (inodes) are identified by a set of keywords separated by slashes. Usually these keywords involve, among others: * the hierarchy ('/', '/usr', 'usr/local', '/opt', ...) * the role of the file ('bin', 'lib', 'man', 'include', ...) * the name of the file ('bash.1', 'bc', 'gnus.el', 'libX11.so', ...) * the (opt.) framework to which the file belongs ('TeX', 'X11R6', 'kde', ...) * the (opt.) name of the package ('bash', 'dvips', 'xemacs', 'gcc', ...) NOTE: Please note that a framework, like 'X11R6' or 'kde', is not a package; it is a collection of packages, grouped as they need/support each other. Typically it is a collection of packages that are all based on the same libraries and/or file format. If the set of keywords could be given in any order to identify the same inode, everything would be fine. Unfortunately order matters, so one has to arrange inodes in a hierarchical fashion, which means that some keyword have to more significant than other; the set of keywords becomes a pathname. Now there are many possible plausible ordering conventions of the components of a pathname. The DOS ordering that you admire is basically: PACKAGE/FILE and this has been carried over to a large extent into Windows. This is what I call a ``separate'' package filesystem layout: in the filesystem a separate subtreee is used for each package. The UNIX ordering instead is the classic: HIERARCHY/[FRAMEWORK/]ROLE/[PACKAGE/]FILE This is what I call a ``merged'' package filesystem layout: packages are merged into the same subtree(s). Let's look at the different layouts for example for JPEG: DOS ``separate'': jpeg-6b/README jpeg-6b/cjpeg jpeg-6b/cjpeg.1.gz jpeg-6b/coderules.doc jpeg-6b/djpeg jpeg-6b/djpeg.1.gz jpeg-6b/filelist.doc jpeg-6b/install.doc jpeg-6b/jconfig.doc jpeg-6b/jconfig.h jpeg-6b/jerror.h jpeg-6b/jmorecfg.h jpeg-6b/jpeglib.h jpeg-6b/jpegtran jpeg-6b/jpegtran.1.gz jpeg-6b/libjpeg.a jpeg-6b/libjpeg.la jpeg-6b/libjpeg.so jpeg-6b/libjpeg.so.62 jpeg-6b/libjpeg.so.62.0.0 jpeg-6b/rdjpgcom jpeg-6b/rdjpgcom.1.gz jpeg-6b/structure.doc jpeg-6b/usage.doc jpeg-6b/wizard.doc jpeg-6b/wrjpgcom jpeg-6b/wrjpgcom.1.gz UNIX ``merged'': /usr/bin/cjpeg /usr/bin/djpeg /usr/bin/jpegtran /usr/bin/rdjpgcom /usr/bin/wrjpgcom /usr/include/jconfig.h /usr/include/jerror.h /usr/include/jmorecfg.h /usr/include/jpeglib.h /usr/lib/libjpeg.a /usr/lib/libjpeg.la /usr/lib/libjpeg.so /usr/lib/libjpeg.so.62 /usr/lib/libjpeg.so.62.0.0 /usr/share/doc/packages/jpeg/README /usr/share/doc/packages/jpeg/coderules.doc /usr/share/doc/packages/jpeg/filelist.doc /usr/share/doc/packages/jpeg/install.doc /usr/share/doc/packages/jpeg/jconfig.doc /usr/share/doc/packages/jpeg/structure.doc /usr/share/doc/packages/jpeg/usage.doc /usr/share/doc/packages/jpeg/wizard.doc /usr/share/man/man1/cjpeg.1.gz /usr/share/man/man1/djpeg.1.gz /usr/share/man/man1/jpegtran.1.gz /usr/share/man/man1/rdjpgcom.1.gz /usr/share/man/man1/wrjpgcom.1.gz Now, what are the implications of these two organizations? For the DOS ``separate'' style: Advantages: * Removing a package is very easy: just delete the one directory. * There are no namespace conflicts between packages. Disadvantages: * You have 2000 directories if you have 2000 packages. * Every possible path in the system has got 2000 elements: PATH, MANPATH, INFOPATH, and you need 2000 thousand '-I' options to 'cc', 2000 '-L' options to 'ld', and so on. * Each of these directory may contain from a couple to a few thousand files, with extremely wide variations. * When searching for a command via the PATH, the full package directory must be linearly scanned (for conventional directory implementations), even if there are only a few executables among thousands of other files, and so on for all the other paths. The disadvantages are enormous, and the advantages are much smaller and can be obtained in better way (see below); so it is such a stupid joke that it cannot be taken seriously. With the UNIX ``merged'' style: Advantages: * Paths can be pretty short; they contain the names of the major hierarchies, and frameworks inside the hierarchy; for example: PATH='/bin:/usr/bin:/usr/X11/bin:/usr/TeX/bin' * Directories to be searched can be neither too small nor too large: given that only the 'bin' directory of a hierarchy is scanned wwhen looking for a command, all the icons, library, ... file names need not be skipped over. * Given that files are grouped by hierarchy and framework, people who don't want to use TeX just omit it from their paths; people who don't want to use unofficial packages just omit '/usr/local' from their paths. * It is very easy to see that kind of commands/libraries/icons/... one has on a system: files are all conveniently grouped by role in a few locations. Disadvantages: * It is hard to remove a package, as its files are scattered in a number of directories they share with other packages. * There can be namespace conflicts between packages, inasmuch they belong to the same hierarchy/framework. The critical point here is paths: UNIX does use paths extensively for very good reasons, and is based on very many relatively small, modular packages. The DOS layout works for DOS because under DOS one has only a few monolithic packages that are in effect their own specialized shells. The same applies to a large extent to Windows: if you look at the typical 'Add/Remove Programs' list of installed packages it has only 2-3 dozen entries; and invocations of programs is not via a shell that searches paths, but via a set of cascading menus with dozens of entries pointing to GUI frontends that are mini shells themselves. And as to these mini GUI shells, Windows cheats: the registry in effect contains immense numbers (sometimes dozens of thousands) of hardcoded paths to each little COM component invoked by these mini GUI shells, that have to searched for painfully and slowly by the system, thus leading to the registry bloat and increasing slowness of all Windows systems. While a UNIX style ``merged'' package layout is very good for use (both convenience and speed), it is not as good for administration. The reason is that if the keywords must be ordered, use is easier if files are grouped by listing the hierarchy/framework/role first, administration is easier the package name is listed first. The reason for the latter is that for administration one wants to be able to get easy answers to two package/file membership questions: * Given a package name, what are the pathnames of the files that belong to it? * Given a pathname, which package does the file belong to? The DOS filesystem layout is just a crude way of being able to answer these two questions by brutally encoding the package name as the most significant bit of the pathname, and usability be damned. The UNIX layout does not allow these questions to be answered easily, even if it results in a rather more usable layout. There are two common solutions to the problem under UNIX: * The 'depot'-style (``link farm'') one, which is to create _two_ filesystem structures, each file is linked into both, once with a the package name as a prefix (hard link), and once with hierarchy/framework as a prefix (usually this is a symbolic link). * The RPM/DPKG style in which files are given just one pathname of the UNIX sort, but there is also a separate database that records the mapping between packages and pathnames. For example under the 'depot' system, the 'jpeg' package would be laid out as follows: PACKAGE LAYOUT HIERARCHY LAYOUT /usr/depot/jpeg-6b/bin/cjpeg /usr/bin/cjpeg /usr/depot/jpeg-6b/bin/djpeg /usr/bin/djpeg .... /usr/depot/jpeg-6b/include/jconfig.h /usr/include/jconfig.h /usr/depot/jpeg-6b/include/jerror.h /usr/include/jerror.h .... /usr/depot/jpeg-6b/lib/libjpeg.a /usr/lib/libjpeg.a /usr/depot/jpeg-6b/lib/libjpeg.la /usr/lib/libjpeg.la .... /usr/depot/jpeg-6b/share/doc/packages/jpeg/README /usr/share/doc/packages/jpeg/README /usr/depot/jpeg-6b/share/doc/packages/jpeg/coderules.doc /usr/share/doc/packages/jpeg/coderules.doc .... /usr/depot/jpeg-6b/share/man/man1/cjpeg.1.gz /usr/share/man/man1/cjpeg.1.gz /usr/depot/jpeg-6b/share/man/man1/djpeg.1.gz /usr/share/man/man1/djpeg.1.gz .... Overall RPM/DPKG style package management is perhaps justifiably more popular, because most link farms use a symbolic link for the hierarchy/framework style pathname, e.g.: /usr/bin/cjpeg -> /usr/depot/jpeg-6b/bin/cjpeg .... /usr/include/jconfig.h -> /usr/depot/jpeg-6b/include/jconfig.h .... /usr/lib/libjpeg.a -> /usr/depot/jpeg-6b/lib/libjpeg.a .... /usr/share/doc/packages/jpeg -> /usr/depot/jpeg-6b/share/doc/packages/jpeg .... /usr/share/man/man1/cjpeg.1.gz-> /usr/depot/jpeg-6b/share/man/man1/cjpeg.1.gz is ugly and somewhat inefficient (hard links would be better). In practice using an external database is OK, because package files are used much more often than they are installed/removed, so the is better for usage (UNIX style) is used, and a separate, out-of-band, structure is used to track package/file relationships. This is an excellent compromise, and a quick, simple, good database for tracking package/file ownership is a good deal better than perveting the filesystem into a stupid database by using the key (the package name) as the first components of every pathname, never mind puting all files in one directory. All this said, I agree with dismay at '/usr/bin' having 2000 entries in many distributions. This is due to some very regrettable factors: * Most grunts doing scutwork like building packages at distribution vendors seem to be hopeless, ingrained newbies. The few people who had some idea of how things should be done have long promoted themselves to management and mostly spend their time talking ``strategy'' at Linux conventions. Just the same as any other company really. The historically correct UNIXy approach to the problem of excessively large 'bin', 'include', 'lib' directories is to recognize that anyhow some important, large groups of packates cluster naturally into ``frameworks'', like TeX or X, and that 'usr' is just the ``file/text based'' framework, and to give frameworks involving large numbers of file their own [sub]hierarchy, thus reducing the number of files in the ``main'' hierarchies. This lengthens a bit the path, but as long it is quite below 2000 elements long, it's fine. * Most distribution vendors try desperately to minimize support calls, and don't have the intellectual fortitude (what Bill Gates calls ``bandwidth'') to think things thru. If UNIX/Linux packages are properly clustered into frameworks each with its own [sub]hierarchy, then with that comes the power and responsibility to decide which frameworks to put in the various paths; for example a StarOffice user will maybe not want to pollute his visible namespace and paths with the TeX framework. Unfortunately this requires educating unaware users as to these important concepts, and/or to default paths to the set of available frameworks instead of the default set of frameworks, which typically is like '/bin:/usr/bin'. For the bandwidth challenged it is much easier to do neither, and to merge absolutely evertyhing into '/usr'. The same factors demand that '/dev' have 8000 entries, one for every possible installed device; if your users are too unaware to create those they need themselves, and you want to minimize support calls, just create a single directory with all possible devices.