Notes about UNIX/Linux coding pragmatics

Updated: 2006-04-20
Created: 2004-05-09

Pragmatics and UNIX/Linux programming style

Pragmatics is with semantics and syntax one of the central aspects of a program. While syntax is about language and semantics about effects, pragmatics is about quality, that is usefulness.

Under UNIX/Linux there are some established programming conventions that amount to good pragmatics, and are inspired by some important aspects of the UNIX/Linux architecture:

Programs can be connected using pipes.
The major coarse level abstraction mechanism of UNIX/Linux is the pipe, by which the output of a program immediately is input to another program.
Programs can be invoked within scripts.
Not only can commands be invoked on the command line, but they can also be invoked by scripts, as scripts are a good way to combine programs to provide new functionality.
Programs have a very long useful life, and get modified a lot because of source availability.
Many UNIX/Linux programs have been around for thirty years, and have been ported to many platforms and have spawned many derivatives.
Files/memory are plain flat byte streams/arrays.
Both the ephemeral (memory) and the persistent (files) storage abstractions are untyped, boundaryless byte streams, including most devices, and the API for accessing all types of files is (mostly) the same.
Programs can have multiple output channels, and these can be independently (re)directed to different output media.
This is supported by the OS with file descriptors, the stdio library with FILE pointers and by the shell with file descriptor redirection.
Programs can be decomposed into libraries, and use libraries.
UNIX has two program decomposition techniques: in the big, pipes and scripts, in the small libraries. Both are very frequently used, as C in particular is a language suitable (with some important limitation) for writing standalone runtime libraries.
There are powerful search/replace and sorting tools and libraries.
This means that reprocessing large amounts of data output by a program is easy, and it is useful to do so in a surprisingly large number of cases.
Many popular tools generate source code as output or process source code as input.
This means that source code is not always, or even often, authored by humans; for example the C compiler almost never processes source code authored by humans, but almost always the output of the C preprocessor. Also, humans must always eventually use a program, such as an editor (or even cat) to actualy record a program text, and it is a good idea to make it easy for that program to process, and in the case of an editor to reprocess, that text.

These aspects are radically different from those that pertain to many other popular operating systems.

There is also a general principle of programming, that program texts should be speak for themselves, as their purpose is to communicate precisely and clearly a program (both to humans and other programs or hardware that reads them).

The ostensible purpose of a program is to achieve an intended effect, but programmers (as well as compilers, CPUs, etc.) cannot write or read programs, only program texts.

The quality of a program is then a consequence of the quality of the program text, as the program text is much more important to the lifetime of a software project.

Many of the conventions listed below apply to any platform, and they will be marked as such.

Pragmatics details

All debug, error and progress messages should go to stderr.
To make it possible to redirect or pipe just the actual output of the program, for further processing.
Every program message should contain the name of the program as the first thing. (any platform)
If several programs are used in a script, message texts should make it easy to figure which one emitted it.
Error messages should contain a direct report of the operation that failed and its operands, not a periphrase. (any platform)
A periphrase does not identifiy what is needed to fix the problem. For example, a message like Configuration unavailable does not help anywhere as much as Cannot use configuration '/etc/prog.conf', which is in turn rather inferior to Cannot open for reading file '/etc/prog.conf', if that is what actually was attempted.
By default a program should only print very minimal output, ideally nothing, and there should be an option to make it verbose.
Programs that have verbose output by default don't fit well within scripts or pipes, especially as conditions in if or while.
Both program source and program output should fit in less than 80 columns, ideally less than 72 columns
This principle comes from the days when punched cards had 80 columns, of which often the last 8 contained a card number.
However it embodies a profound wisdom: that the human eye tracks pages vertically more easily than horizontally, beyond a certain narrow horizontal limit, which seems to be around 65 characters.
Around 70 characters is also, not by mere coincidence, what fits on a line on a letter/A4 page when printed in a decent point size.
Also, it is much easier to indicate grouping and structure vertically, for example by using blocks and empty lines between blocks.
Finally, a definite, traditional and low limit on line length means that there is a chance that nobody will have to scroll horizontally to read the source or output text of a program, and scrolling horizontally usually works a lot less smoothly and cleanly than vertical scrolling.
Embedded in each program's source and binaries there should be a string that identifies the program and its version, and there should be an option that prints it. (any platform)
When asking for support or reporting bugs there must be a way to identify exactly the version being used. Software often lives for a long time, and many variants of any successful software then coexist.
It is important that the exit code of a program be properly set to zero for success or to non zero for failure.
To make it possible to do error checking and recovery in scripts.
Programs should be written keeping in mind that they can be killed at any time. (any platform)
Because at the very least killing a script, or a process group may have as side effect killing that program. Fortunately many if not most programs are idempotent.
All parameters and other local names should be declared as much const as possible. (any platform)
  • Accidental errors are prevented.
  • The compiler can often perform much better optimization.
  • On reading the source, one can be sure that certain entities will not change, without having to check for the rest of the scope in which they are defined, making it much quicker to understand a bit of code.
Function parameters should be in left to right order most specific to most generic. (any platform)
  • Allows a natural sorting order of functions that is both nicer for reading and easier for error checking.
  • Makes it much clearer in which order partial application should happen.
  • Helps with understanding the goals of the function.
File sections should usually be most specific to most generic top to bottom. (any platform)
This based on the idea that one should see first when looking at a file the most specific information. However in some cases (usually configuration files) if there are multiple otherwise somewhat equivalent data item, the first one is taken, in others the last one; then the specific order of file sections should respect the particular override logic of the applications reading the file.
Programs should use directory paths.
Where applicable, when a program attempts to open a file whose name is not absolute, it should attempt to open by using as prefix the elements of a list of directories, as executables are searched for in the list contained in the PATH environment variable. Especially if the file is a configuration file. As a rule the program should have a default directory list and this should be overridable with an environment variable.
Default directory paths should include the current and home directory.
When a directory path is used for searching for files with a relative name, the default should include the current and home directory. The default directory path should have directories in most specific to most generic order, and prefixed with the current directory (as such), the home directory, /usr/local, the root directory and /usr in this order. For example configuration files should be searched for in a directory path like:
.:$HOME/etc:/usr/local/etc:/etc:/usr/etc
just like executables should be searched for in a directory path like:
.:$HOME/bin:/usr/local/bin:/bin:/usr/bin
Program input should not have arbitrary and small size limits. (any platform)
With pipes, a program, rather than a human user, can be the author of the input, and programs may have many less limitations than humans as to the size of the things they can output.
Output and input files should be in text format in almost all cases.
For easy piping into text-based processing tools like sort or perl, and for easy reading and writing by humans.
Input expected by programs should be terse.
To make it easier for programs and humans to generate it.
White space should be allowed in input. (any platform)
Where possible, arbitrary white space should be allowed in textual program input (as a separator usually); usually “newline” should be considered as white space too.
The default output or input column separator should be a sequence of spaces and tabs, or else a colon or other punctuation character if the data can contain spaces or tabs
This matches tradition, and makes splitting each line into fields easy, for the benefit of columnar oriented scripting languages like AWK or Perl, or utilities like sort.
GUI based programs should have an equivalently featured command line mode. (any platform)
For use in scripts and pipes.
All variable parts of a program message should be enclosed in some kind of delimiter so that it be obvious when they are the empty string. (any platform)
To avoid idiocies like Cannot open file %s., which becomes Cannot open file . if the argument is the empty string.
Identifiers should be built with most generic to most specific subparts in left to right order. (any platform)
This gives the natural sorting order for sorting in languages with left to right writing order. Too bad that email addresses, domains, numbers and many date formats don't respect this principle.
In each source file, whether it is an header or code does not matter, there should be a list of all and only the header files that contain definitions of entities used by the program. (any platform)
This is the only way to ensure correct dependencies among headers and among sources and headers.
Header file includes should be listed in most generic to most specific top to bottom, first thing in a file. (any platform)
This to prevent more specific definitions overriding more generic ones. Such an override is particularly awful if a definition in a system header is overriden.
All file scope identifiers should have storage class static. (any platform)
Both to prevent interfile namespace pollution and mysterious problems because of accidental name coincidence. It is particularly awful if a definition in one file has the same name as a definition in a system library used by many other files.
Header files should always be protected in their entirety by a multiple inclusion guard. (any platform)
This is the only clean way to prevent multiple redefinitions with multiple inclusions, and as a rule also speeds up compilation. The only possible exception is a header file that intentionally define an entity differently for each inclusion, but these should almost never be written, or even imagined.
As a rule program options should be processed with the getopt() library function, and preferably with the getopt_long() GNU variant.
There should be an option that prints a brief help message with the command invocation syntax. (any platform)
This means that the program is to a limited but useful extent self documenting at runtime, and that documentation is easy to keep up to date as it is small and within the program text itself.
Programs and libraries should have suitable man pages.
Documentation usually is either reference or task oriented and man pages are the summary of reference documentation and they are very useful as to avoid putting too much help material inside the program itself; programs should not be documentation processors, man does that.
Reference documentation should be terse. Task oriented documentation, like HOWTOs and user guides, doesn't need to be terse. (any platform)
Reference documentation's most important property is that it should be accurate, and the second most important is ease of finding the relevant bit. Verbosity interferes with both. Reference documentation is not meant to explain what/how to do things, but it may contain examples to clarify meanings.
Declarations and definitions should be shared in header files in a single copy, not repeated in several places. (any platform)
This sounds obvious, but then some people don't.
Programs should be written as collections of libraries glued together by fairly small data navigation code. (any platform)
So that libraries be reusable and behaviour and even representation be sharable, which helps in scripts and similar. For example if many programs use the same hash database library means that one can embed that library in a scripting language like Perl and then scripts in that language can be used to access and manipulate the data maintained by many other applications.
The principle to keep most data files in text form is in effect a special case of this principle.
There should not be magic numbers in the code, but almost all constants, even if used only once, should be given descriptive names. (any platform)
This is because usually such numbers embody assumptions, the assumptions are pragmatic, and such pragmatics ought to be made explicit; many of these pragmatics for example involve units of measure.
A number does not speak for itself; a named constant does, and also allows easy consistency checking between its value and its name. For example fragments like if (length > MIN_WEIGHT) or #define MAX_VOLUME -10 tend to suggest something is amiss.
There should be common use of assert() to document expected invariants. (any platform)
assert() can be a debugging aid, but it is mostly a code reading aid, as it documents what the author expects at that point.
Defensive coding in libraries is not appropriate. (any platform)
It is appropriate as to input. Internal and library functions should instead use assert() to document assumptions about their parameters.
In particular large complex programs should contain copious conditional debugging traces. (any platform)
These debugging traces are usually far more useful than a debugger session because they make the program text speak for itself, not only dynamically, but statically too, as they document assumptions of the author as to what is relevant and/or hairy.
Comments should be used to elucidate non trivial assumptions and design decisions that are not obvious from the code. (any platform)
Other than that the program text should speak for itself. But the program text usually cannot express well its intent or the possible alternatives that have not been written and whose consideration might elucidate it.
Some careful consideration should be given to naming. (any platform)
The most important aim of program text writing is not that the program it describes works, but that it communicates clearly what it does. Working correctly is a consequence of that. Naming has a large impact on the reading of the program text by a human.
The traditional UNIX naming convention for functions is object then verb, not verb then object.
As in fopen for file open. This respects the principle that names should be in most specific to most generic part order.
Output files should be in strictly simple tabular/columnar format, possibly without headers, unless verbose output is requested.
For easy postprocessing by simple columnar tools like sort, which in particular is extremely important. Stupid things like /proc/meminfo are hard to easily split and process.
Variables should be defined in the narrowest scope for which they are used; in particular, global variables should be avoided. (any platform)
This aids program comprehension and debugging considerably; if a variably is only used in a small range of code, that range should become a block and the variable defined in it, so that it be clear that it cannot be used or modified anywhere else, which makes understanding its role significantly quicker and easier, as one needs only to comprehend a small scope.
Identifiers should be longer the wider their scope. (any platform)
In part to reduce the change of name ambiguities, in part to communicate implicitly by that length the width of the scope in which an identifier is defined, in part because identifiers with a wide scope usually are mentioned less frequently than identifiers with a narrow scope.
Common subexpressions or paragraphs of code should not be repeated but given names. (any platform)
If the same subexpression or paragraph of code occurs identically or similarly in a section of program text usually it expresses a particular concept relevant to that section; writing it down once and naming it explicitly with a suggestive name helps the program text speak for itself. It also helps ensuring that the various uses of the same concept are indeed the same.
Comments should be in traditional parenthetical form, with no boxing, and preferably without the left asterisk margin either. (any platform)
For ease of justification and other processing by source code tools, including editors.
Code should be disabled with #if 0 or if (0), not with commenting. (any platform)
Code is not text, and any tools that process source files, for example beautifiers will handle code differently from comments.
Code should be written and indented in a regular way with systematic layout and naming conventions. (any platform)
This help the program text speak for itself, and the regularity as a rule helps make structure evident, as a changing of shape of the text then only reflects a change of shape of the structure of the program, not shifts in the layout or naming conventions. It also helps catch mistakes, which often assume the textual form of irregularities.
Program output should be in the same syntax as program input
In order to make it easy to pipe back to a program its own output, it should be in the same syntax as its input or command line arguments. For example, the contents of the lines of /etc/fstab corresponds to the syntax for the arguments to mount.
Processes and system modules should publish extensive state as real or virtual files
When a process or a system module keep state, this should available as a summary in a file, either a plain or device file, and such file can be a real file or a virtual one, like those under /proc in some versions of the system.

References