The TONTO input file: technical specification

The TONTO input file serves the dual purpose of being a database style specification, and a command language interface.

The input file itself is a sequence of characters written in a certain way, according to certain grammar rules. In order to explain these grammar rules, a Backus-Naur notation is first introduced. The specification is given in this notation.

Programmers who write new modules should stick to the input format described

The ideas for the input file described here are similar to the STAR (Self defining Text Archive) format already used and standardised in crystallographic databases.

It is extremely important to have a well defined but flexible way to write input files, in order to facilitate data deposition and database construction. Mining of databases for critical information will be a significant way in which future scientific research will be conducted.

Rules to explain the rules for writing input file

An input file is made up of a sequence of grammar elements. Grammar elements are represented in uppercase type-font, possibly followed by a comma separated list of other grammar elements all surrounded by round brackets, LIKE-THIS, or THIS(2,3), or EVEN(LIKE,THIS). A grammar element does not stand for itself, literally, but for a specific sequence of characters. The exact sequence of characters is given after an arrow symbol. Thus:

ZERO     -> 0
GREETING -> hi 
MY-NAME  -> dylan
means that the grammar element ZERO stands for the digit 0, while grammar elements GREETING and MY-NAME stand for the characters hi and dylan, respectively. Here, 0 and hi and dylan stand literally for themselves, and not any other group of characters. Except for the special characters discussed below, literal text will be always represented in lowercase.

Although it is possible to use uppercase characters in an input file, we shall not do so here to avoid confusion with the uppercase grammar element symbols. Uppercase characters in the input file are equivalent to lower case characters, unless surrounded by a double quote characters.

Grammar elements can be composed of a number of alternatives. The different possibilities are separated by a | symbol. Thus:

POSITIVE-DIGITS -> 1|2|3|4|5|6|7|8|9
represents all the symbols 1 to 9, inclusive.

Sometimes, to save typing, we will use the ellipsis, ..., to indicate an obvious range of characters. For example, in the previous example, we might type

POSITIVE-DIGITS -> 1| ... |9

Grammar elements can be composed of concatenations of characters. The characters to be concatenated are enclosed by curly brackets, { and } and are followed immediately by a descriptor. Thus:

MANY-X -> {x}*
stands for any number of the letter x concatenated together, including none at all. For example MANY-X represents xxxxx. Similarly
AT-LEAST-ONE-X -> {x}+
stands for xxxxx, but it does not stand for zero x characters. Finally,
TRIPLE-X -> {x}3
stands for three x characters in succession, xxx. Note that a curly bracket which is not matched, or not followed by a descriptor just stands for itself. (Sorry about the confusion, we should really be using a different font for these syntax elements).

Grammar elements can be composed of optional strings of characters. The optional characters are enclosed in square brackets [ and ]. Thus

TO-BE -> to be [or not to be]
says that TO-BE stands either for to be, or to be or not to be.

Grammar elements can be composed of other grammar elements, like this:

SELF-GREETING -> GREETING MY-NAME
Note that blank spaces are always significant. The blank spaces before GREETING, between GREETING and MY-NAME, and after MY-NAME do not stand for themselves, literally, but it instead stand for WHITESPACE. WHITESPACE is any combination of: blank spaces; end-of-line characters; or, comment characters (!, and #) and all the characters inclusive to the end of line. Thus, SELF-GREETING represents
   hi        dylan
and also
hi 
dylan
and even
hi      ! this is a greeting
dylan   ! this is my name
In the above, the characters following the exclamation mark are treated as WHITESPACE, and hence ignored, because the exclamation mark is a comment character. Since WHITESPACE is quite complicated, but effectively just means a blank character or its equivalents, we represent it just as a blank character, for simplicity. The proper definition of it is:
WHITESPACE      -> {WHITESPACE-CHAR}+
WHITESPACE-CHAR -> BLANK-CHAR |END-OF-LINE-CHAR | COMMENT
COMMENT         -> COMMENT-CHAR {^END-OF-LINE-CHAR}* END-OF-LINE-CHAR
COMMENT-CHAR    -> !|#
In the above, the symbol ^END-OF-LINE represents any character which is not the END-OF-LINE-CHAR character. BLANK-CHAR is, of course the blank character, which we have to represent by BLANK-CHAR, since we have agreed a literal blank character means WHITESPACE. The default COMMENT-CHAR characters shown above are defined in the macros file in a variable COMMENT-CHARS, and they may be changed when the program is compiled.