With all of the excitement lately about various software firms planning Linux ports of their products, it’s easy to lose sight of the great power and versatility of the unsung small utilities which are a part of every Linux distribution. These tools, mostly GNU versions of small programs such as awk, grep and sed, date back to the early pioneer days of Unix and have been in wide use ever since. They typically have specialized capabilities and become especially useful when they are chained together and data is piped from one to another. Often a shell script serves as the matrix in which they do their work.
Sometimes a piece of software native to another operating system is ported to Linux as an independent unit without taking advantage of pre-existing tools which might have reduced the size of the program and reduced memory usage. It’s always a pleasure to happen upon software written with an awareness of the power of Linux and its native utilities. Bu is a backup program and NoSQL is an ASCII-table relational database system; what they have in common is their usage of simple but effective Linux tools to accomplish their respective tasks.
Making a backup of the myriad files on a Linux system isn’t necessary for most stand-alone single-user machines. Backing up configuration and personal files to floppies or other removable media is normally all that is necessary, assuming that a recent Linux distribution CD and a CDROM drive are available. The situation becomes more complex with multi-user servers or with machines used in a business setting, where the sheer number of irreplaceable files makes this simple method impractical and time-consuming; in these cases the traditional method in the unix world has been to use cpio or another archiving utility to copy files to a tape drive. Though the price of hard disks has plummeted in recent years while their capacity has ballooned, reliable tape drives capable of storing the vast amounts of data a modern hard-disk can hold are still quite expensive, sometimes rivalling the cost of the computer they protect from loss of data.
Vincent Stemen has developed a small backup utility called bu which is shell-based and makes good use of standard Linux utilities such as cp and sed. Rather than being intended for backups to tape or other streaming device, bu is designed to mirror files on another file-system, preferably located on a separate hard drive.
Bu is just a twelve kilobyte shell script along with a few configuration files. It’s remarkably capable; compare this list of features with those of other backup utilities:
- Checks timestamps and only copies new or changed files
- Deals with symbolic links intelligently
- Writes a log-file upon completion
- Will ignore directories which are mounted filesystems
- Easy specification of files and directories to include or exclude
Bu in its earlier versions used cpio extensively, but due to a problem with new directory permissions cp is the main engine of the utility now. Cp -a used by itself can be used to bulk-copy entire filesystems to a new location, but the symbolic links would have to be dealt with manually, which is time-consuming. Also missing would be the ability to automatically include and exclude specific files and directories; bu refers to two configuration files, /usr/local/backups/Exclude and /usr/local/backups/Include, for this information.
This small and handy utility isn’t intended to completely supplant traditional tape-drive backup systems, but its author has been using bu as the basis of a backup strategy involving several development machines and several gigabytes of data. Bu can be obtained from this web-page; be sure to read the white paper included in the distribution which details the rationale behind the utility.
Carlo Strozzi (a member of the Italian Linux society) has developed a relational database management system (RDBMS) which uses tab-delimited ASCII tables as its data format. NoSQL is a descendant of an RDBMS developed by Walter W. Hobbs (of the RAND Organization) called RDB. The commercial product /rdb sold by Revolutionary Software is similar, but uses more compiled C code for greater speed.
Carlo Strozzi had this to say about his motivation for developing NoSQL (excerpted from the documentation):
Several times I have found myself writing applications that needed to rely upon simple database management tasks. Most commercial database products are often too costly and too feature-packed to encourage casual use. There are also plenty of good freeware databases around, but they too tend to provide far more that I need most of the times, and they too lack the shell-level approach of NoSQL. Admittedly, having been written with interpretive languages (Shell, Perl, AWK), NoSQL is not the fastest DBMS of all, at least not always (a lot depends on the application).
The philosophy behind these database systems is well-expressed in an article titled A 4GL Language, which was written by Evan Schaffer and Mike Wolf, founders of Revolutionary Software. The paper originally appeared in the March 1991 issue of the Unix Review; a Postscript version is included with the NoSQL documentation. Here is the abstract:
There are many database systems available for UNIX. But almost all are software prisons that you must get into and leave the power of UNIX behind. Most were developed on operating systems other than UNIX. Consequently their developers had very few software features to build upon, and wrote the functionality they needed directly, without regard for the features provided by the operating system. The resulting database systems are large, complex programs which degrade total system performance, especially when they are run in a multi-user environment. UNIX provides hundreds of programs that can be piped together to easily perform almost any function imaginable. Nothing comes close to providing the functions that come standard with UNIX. Programs and philosophies carried over from other systems put walls between the user and UNIX, and the power of UNIX is thrown away. The shell, extended with a few relational operators, is the fourth generation language most appropriate to the UNIX environment.
The complete article is well worth reading for anyone who has ever wondered just why Linux software is different than that used with mainstream operating systems, and why GUI software has only recently began to become common.
NoSQL incorporates the ideas presented above. A major difference between Walter W. Hobbs’ RDB database and NoSQL is that NoSQL uses awk extensively to perform tasks handled by perl in RDB. Awk is a more specialized tool with a much smaller memory footprint, and since the data-pipelining which is the essence of these relational database management systems requires repeated invocation of their respective interpreters, NoSQL exerts less of a strain on a system’s resources, especially important in a multi-user environment.
After installing the package (no compilation is involved) a new subdirectory under /usr/local/lib called nosql will be created and populated; it will have these subdirectories:
awk contains several awk scripts which are responsible for most of the table-manipulation jobs doc contains both Postscript and HTML versions of the readable and complete NoSQL documentation, as well as a Postscript version of the Schaffer and Wolf article from the Unix Review mylib an empty directory for new scripts and programs perl perl scripts which perform other NoSQL functions sh shell scripts which act as wrappers for the awk and perl scripts.
The entire subdirectory occupies just under 600 kb., most of which is documentation.
After installing the files, the only other step needed before trying out the database is setting three environment variables. Here are three lines from my .zshenv file (bash users should have these lines in the .bash_profile file):
export NSQLIB=/usr/local/lib/nosql export NSQSH=/bin/ash export NSQAWK=/usr/bin/mawk
Carlo Strozzi recommends using ash rather than one of the larger and more powerful shells such as bash or zsh; ash uses less memory. and since the shell is repeatedly invoked while using NoSQL the upshot will be a noticeable increase in speed and a reduction in memory requirements.
Since there is no compiled code in the package, NoSQL should run on any machines which have awk and perl available; in other words the database isn’t Linux-centric. The ASCII format of the data tables is also very portable, and can be manipulated by text editors and common filesystem tools. Data can be extracted from tables by means of various « operators » via input-output redirection (e.g., pipes, STDIN and STDOUT). The only limits on the amount of data which can be handled are in the machine running NoSQL; the installed memory and processor speed are the limiting factors.
As the name implies this is not an SQL database, which should make NoSQL more accessible to users lacking SQL expertise. I don’t know SQL at all and I found the basic commands of NoSQL easy to learn. All commands are executed as parameters of the nosql shell script. Here’s an example NoSQL table:
Name Freq Height Season ---- ---- ------ ------ laccaria 27 6 Fall lepiota 5 8 Summer amanita 42 7 Summer lentinus 85 5 Spring-Fall morchella 45 6 Spring boletus 65 5 Summer russula 75 4 Summer
Single tabs must separate the fields from each other, even the spaces between the groups of dashes on the dashed separator line must be single tabs. An alternate format for the tabular data is the list; the above table can be converted to this format with the command
nosql tabletolist < [filename]
The results look like this:
Name laccaria Freq 27 Height 6 Season Fall Name lepiota Freq 5 Height 8 Season Summer Name amanita Freq 42 Height 7 Season Summer Name lentinus Freq 85 Height 5 Season Spring-Fall Name morchella Freq 45 Height 6 Season Spring Name boletus Freq 65 Height 5 Season Summer Name russula Freq 75 Height 4 Season Summer
If the above table were named pilze.rdb, either the command
nosql istable < pilze.rdb
nosql islist < pilze.rdb
would ask nosql to check the table or list format’s validity, depending on which format is being checked. Another command,
nosql edit < pilze.rdb
will open the file in the editor defined by the EDITOR environment variable (often set to vi by default). A file in table format is automatically converted into the vertical list format for easier editing, then changed back into a table when exiting the editor. When the file is saved or closed NoSQL will automatically check the validity of the format and give the line numbers where any errors occur. This seemingly obsessive concern with correct format isn’t mere pedantry; the various NoSQL operators which manipulate and extract data need to be able to quickly distinguish headers from data and data-fields from each other, and single tabs are the criteria.
There are over forty operator functions available, some of which extract or rearrange fields while others are used to generate reports. Their names are more-or-less mnemonic, such as inscol and addcol, which are used to insert a column into a table, respectively on the left- or right-hand side. Other operators index and search tables. Examples of typical usage (i.e., connecting NoSQL commands with pipes) are included in the documentation.
As with any Open-Source software, it’s hard to tell how many people or organizations are using it. In an e-mail, I asked Carlo Strozzi for examples of real-world usage of NoSQL; he replied that he has been using it quite a bit for database-backed CGI scripts for the WWW. He also stated that several companies in Italy are using it internally. Carlo Strozzi works for IBM in Italy, and he has developed several web applications backed by NoSQL; three of the publicly accessible pages are:
Fortune companies and people profiles
Classifieds – this is in Italian
Car classifieds, in Italian
The latest version of NoSQL can be obtained from this FTP site. Last modified: Thu 29 Oct 1998