"Linux Gazette...making Linux just a little more fun!"


How Not To Re-Invent The Wheel

By Larry Ayers


Introduction

With all of the excitement lately about various software firms planning Linux ports of their products, it's easy to lose sight of the great power and versatility of the unsung small utilities which are a part of every Linux distribution. These tools, mostly GNU versions of small programs such as awk, grep and sed, date back to the early pioneer days of Unix and have been in wide use ever since. They typically have specialized capabilities and become especially useful when they are chained together and data is piped from one to another. Often a shell script serves as the matrix in which they do their work.

Sometimes a piece of software native to another operating system is ported to Linux as an independent unit without taking advantage of pre-existing tools which might have reduced the size of the program and reduced memory usage. It's always a pleasure to happen upon software written with an awareness of the power of Linux and its native utilities. Bu is a backup program and NoSQL is an ASCII-table relational database system; what they have in common is their usage of simple but effective Linux tools to accomplish their respective tasks.


Shell-script Backups With bu

Making a backup of the myriad files on a Linux system isn't necessary for most stand-alone single-user machines. Backing up configuration and personal files to floppies or other removable media is normally all that is necessary, assuming that a recent Linux distribution CD and a CDROM drive are available. The situation becomes more complex with multi-user servers or with machines used in a business setting, where the sheer number of irreplaceable files makes this simple method impractical and time-consuming; in these cases the traditional method in the unix world has been to use cpio or another archiving utility to copy files to a tape drive. Though the price of hard disks has plummeted in recent years while their capacity has ballooned, reliable tape drives capable of storing the vast amounts of data a modern hard-disk can hold are still quite expensive, sometimes rivalling the cost of the computer they protect from loss of data.

Vincent Stemen has developed a small backup utility called bu which is shell-based and makes good use of standard Linux utilities such as cp and sed. Rather than being intended for backups to tape or other streaming device, bu is designed to mirror files on another file-system, preferably located on a separate hard drive.

Bu is just a twelve kilobyte shell script along with a few configuration files. It's remarkably capable; compare this list of features with those of other backup utilities:

Bu in its earlier versions used cpio extensively, but due to a problem with new directory permissions cp is the main engine of the utility now. Cp -a used by itself can be used to bulk-copy entire filesystems to a new location, but the symbolic links would have to be dealt with manually, which is time-consuming. Also missing would be the ability to automatically include and exclude specific files and directories; bu refers to two configuration files, /usr/local/backups/Exclude and /usr/local/backups/Include, for this information.

This small and handy utility isn't intended to completely supplant traditional tape-drive backup systems, but its author has been using bu as the basis of a backup strategy involving several development machines and several gigabytes of data. Bu can be obtained from this web-page; be sure to read the white paper included in the distribution which details the rationale behind the utility.


The NoSQL Relational Database

Carlo Strozzi (a member of the Italian Linux society) has developed a relational database management system (RDBMS) which uses tab-delimited ASCII tables as its data format. NoSQL is a descendant of an RDBMS developed by Walter W. Hobbs (of the RAND Organization) called RDB. The commercial product /rdb sold by Revolutionary Software is similar, but uses more compiled C code for greater speed.

Carlo Strozzi had this to say about his motivation for developing NoSQL (excerpted from the documentation):

Several times I have found myself writing applications that
needed to rely upon simple database management tasks. Most
commercial database products are often too costly and too
feature-packed to encourage casual use. There are also plenty of
good freeware databases around, but they too tend to provide far
more that I need most of the times, and they too lack the
shell-level approach of NoSQL. Admittedly, having been written
with interpretive languages (Shell, Perl, AWK), NoSQL is not the
fastest DBMS of all, at least not always (a lot depends on the
application).

The philosophy behind these database systems is well-expressed in an article titled A 4GL Language, which was written by Evan Schaffer and Mike Wolf, founders of Revolutionary Software. The paper originally appeared in the March 1991 issue of the Unix Review; a Postscript version is included with the NoSQL documentation. Here is the abstract:

There are many database systems available for UNIX. But almost
all are software prisons that you must get into and leave the
power of UNIX behind. Most were developed on operating systems
other than UNIX. Consequently their developers had very few
software features to build upon, and wrote the functionality they
needed directly, without regard for the features provided by the
operating system. The resulting database systems are large,
complex programs which degrade total system performance,
especially when they are run in a multi-user environment.

UNIX provides hundreds of programs that can be piped together to
easily perform almost any function imaginable. Nothing comes
close to providing the functions that come standard with
UNIX. Programs and philosophies carried over from other systems
put walls between the user and UNIX, and the power of UNIX is
thrown away.

The shell, extended with a few relational operators, is the
fourth generation language most appropriate to the UNIX
environment.

The complete article is well worth reading for anyone who has ever wondered just why Linux software is different than that used with mainstream operating systems, and why GUI software has only recently began to become common.

NoSQL incorporates the ideas presented above. A major difference between Walter W. Hobbs' RDB database and NoSQL is that NoSQL uses awk extensively to perform tasks handled by perl in RDB. Awk is a more specialized tool with a much smaller memory footprint, and since the data-pipelining which is the essence of these relational database management systems requires repeated invocation of their respective interpreters, NoSQL exerts less of a strain on a system's resources, especially important in a multi-user environment.

After installing the package (no compilation is involved) a new subdirectory under /usr/local/lib called nosql will be created and populated; it will have these subdirectories:

awk
contains several awk scripts which are responsible for most of the table-manipulation jobs

doc
contains both Postscript and HTML versions of the readable and complete NoSQL documentation, as well as a Postscript version of the Schaffer and Wolf article from the Unix Review

mylib
an empty directory for new scripts and programs

perl
perl scripts which perform other NoSQL functions

sh
shell scripts which act as wrappers for the awk and perl scripts.

The entire subdirectory occupies just under 600 kb., most of which is documentation.

After installing the files, the only other step needed before trying out the database is setting three environment variables. Here are three lines from my .zshenv file (bash users should have these lines in the .bash_profile file):


export NSQLIB=/usr/local/lib/nosql
export NSQSH=/bin/ash
export NSQAWK=/usr/bin/mawk

Carlo Strozzi recommends using ash rather than one of the larger and more powerful shells such as bash or zsh; ash uses less memory. and since the shell is repeatedly invoked while using NoSQL the upshot will be a noticeable increase in speed and a reduction in memory requirements.

Since there is no compiled code in the package, NoSQL should run on any machines which have awk and perl available; in other words the database isn't Linux-centric. The ASCII format of the data tables is also very portable, and can be manipulated by text editors and common filesystem tools. Data can be extracted from tables by means of various "operators" via input-output redirection (e.g., pipes, STDIN and STDOUT). The only limits on the amount of data which can be handled are in the machine running NoSQL; the installed memory and processor speed are the limiting factors.

As the name implies this is not an SQL database, which should make NoSQL more accessible to users lacking SQL expertise. I don't know SQL at all and I found the basic commands of NoSQL easy to learn. All commands are executed as parameters of the nosql shell script. Here's an example NoSQL table:

Name	   Freq	       Height	   Season
----	   ----	       ------	   ------
laccaria   27	       6	   Fall
lepiota	   5	       8	   Summer
amanita	   42	       7	   Summer
lentinus   85	       5	   Spring-Fall
morchella  45	       6	   Spring
boletus	   65	       5	   Summer
russula	   75	       4	   Summer

Single tabs must separate the fields from each other, even the spaces between the groups of dashes on the dashed separator line must be single tabs. An alternate format for the tabular data is the list; the above table can be converted to this format with the command

nosql tabletolist < [filename]

The results look like this:


Name	laccaria
Freq	27
Height	6
Season	Fall

Name	lepiota
Freq	5
Height	8
Season	Summer

Name	amanita
Freq	42
Height	7
Season	Summer

Name	lentinus
Freq	85
Height	5
Season	Spring-Fall

Name	morchella
Freq	45
Height	6
Season	Spring

Name	boletus
Freq	65
Height	5
Season	Summer

Name	russula
Freq	75
Height	4
Season	Summer

If the above table were named pilze.rdb, either the command

nosql istable < pilze.rdb

or

nosql islist < pilze.rdb

would ask nosql to check the table or list format's validity, depending on which format is being checked. Another command,

nosql edit < pilze.rdb

will open the file in the editor defined by the EDITOR environment variable (often set to vi by default). A file in table format is automatically converted into the vertical list format for easier editing, then changed back into a table when exiting the editor. When the file is saved or closed NoSQL will automatically check the validity of the format and give the line numbers where any errors occur. This seemingly obsessive concern with correct format isn't mere pedantry; the various NoSQL operators which manipulate and extract data need to be able to quickly distinguish headers from data and data-fields from each other, and single tabs are the criteria.

There are over forty operator functions available, some of which extract or rearrange fields while others are used to generate reports. Their names are more-or-less mnemonic, such as inscol and addcol, which are used to insert a column into a table, respectively on the left- or right-hand side. Other operators index and search tables. Examples of typical usage (i.e., connecting NoSQL commands with pipes) are included in the documentation.

As with any Open-Source software, it's hard to tell how many people or organizations are using it. In an e-mail, I asked Carlo Strozzi for examples of real-world usage of NoSQL; he replied that he has been using it quite a bit for database-backed CGI scripts for the WWW. He also stated that several companies in Italy are using it internally. Carlo Strozzi works for IBM in Italy, and he has developed several web applications backed by NoSQL; three of the publicly accessible pages are:

Fortune companies and people profiles

Classifieds - this is in Italian

Car classifieds, in Italian

The latest version of NoSQL can be obtained from this FTP site.


Last modified: Thu 29 Oct 1998


Copyright © 1998, Larry Ayers
Published in Issue 34 of Linux Gazette, November 1998


[ TABLE OF CONTENTS ] [ FRONT PAGE ]  Back  Next