The TONTO input file serves the dual purpose of being a database style
specification, and a command language interface.
The input file itself is a sequence of characters written in a certain
way, according to certain grammar rules. In order to
explain these grammar rules, a Backus-Naur notation is first introduced. The
specification is given in this notation.
The ideas for the input file described here are similar to the STAR (Self
defining Text Archive) format already used and standardised in crystallographic
databases.
It is extremely important to have a well defined but flexible way to
write input files, in order to facilitate data deposition and database
construction. Mining of databases for critical information will be a
significant way in which future scientific research will be conducted.
Rules to explain the rules for writing input file
An input file is made up of a sequence of grammar
elements. Grammar elements are represented in uppercase type-font,
possibly followed by a comma separated list of other grammar elements all
surrounded by round brackets, LIKE-THIS, or
THIS(2,3), or EVEN(LIKE,THIS). A grammar
element does not stand for itself, literally, but for a specific sequence of
characters. The exact sequence of characters is given after an arrow symbol.
Thus:
ZERO -> 0
GREETING -> hi
MY-NAME -> dylan |
means that the grammar element
ZERO stands for the digit
0, while grammar elements
GREETING and
MY-NAME stand for the characters
hi and
dylan, respectively. Here,
0 and
hi and
dylan stand literally for
themselves, and not any other group of characters. Except for the special
characters discussed below, literal text will be always represented in
lowercase.
Although it is possible to use uppercase characters in an input file, we
shall not do so here to avoid confusion with the uppercase grammar element
symbols. Uppercase characters in the input file are equivalent to lower case
characters, unless surrounded by a double quote characters.
Grammar elements can be composed of a number of alternatives. The
different possibilities are separated by a | symbol. Thus:
POSITIVE-DIGITS -> 1|2|3|4|5|6|7|8|9 |
represents all the symbols
1 to
9,
inclusive.
Sometimes, to save typing, we will use the ellipsis,
..., to indicate an obvious range of characters. For example,
in the previous example, we might type
POSITIVE-DIGITS -> 1| ... |9 |
Grammar elements can be composed of concatenations of characters. The
characters to be concatenated are enclosed by curly brackets,
{ and } and are followed
immediately by a descriptor. Thus:
stands for any number of the letter
x concatenated together,
including none at all. For example
MANY-X represents
xxxxx. Similarly
AT-LEAST-ONE-X -> {x}+ |
stands for
xxxxx, but it does not stand for zero
x characters. Finally,
TRIPLE-X -> {x}3 |
stands for three
x characters in succession,
xxx. Note that a curly bracket which is not matched, or not
followed by a descriptor just stands for itself. (Sorry about the confusion, we
should really be using a different font for these syntax elements).
Grammar elements can be composed of optional strings of characters. The
optional characters are enclosed in square brackets [ and
]. Thus
TO-BE -> to be [or not to be] |
says that
TO-BE stands either for
to be,
or
to be or not to be.
Grammar elements can be composed of other grammar elements, like this:
SELF-GREETING -> GREETING MY-NAME |
Note that blank spaces are always significant. The blank spaces before
GREETING, between
GREETING and
MY-NAME, and after
MY-NAME do not stand
for themselves, literally, but it instead stand for
WHITESPACE.
WHITESPACE is any combination
of: blank spaces; end-of-line characters; or, comment characters
(
!, and
#) and all the characters
inclusive to the end of line. Thus,
SELF-GREETING represents
and also
and even
hi ! this is a greeting
dylan ! this is my name |
In the above, the characters following the exclamation mark are treated as
WHITESPACE, and hence ignored, because the exclamation mark
is a comment character. Since
WHITESPACE is quite
complicated, but effectively just means a blank character or its equivalents, we
represent it just as a blank character, for simplicity. The proper definition of
it is:
WHITESPACE -> {WHITESPACE-CHAR}+
WHITESPACE-CHAR -> BLANK-CHAR |END-OF-LINE-CHAR | COMMENT
COMMENT -> COMMENT-CHAR {^END-OF-LINE-CHAR}* END-OF-LINE-CHAR
COMMENT-CHAR -> !|# |
In the above, the symbol
^END-OF-LINE represents any
character which is not the
END-OF-LINE-CHAR character.
BLANK-CHAR is, of course the blank character, which we have
to represent by
BLANK-CHAR, since we have agreed a literal
blank character means
WHITESPACE. The default
COMMENT-CHAR characters shown above are defined in the
macros file in a variable
COMMENT-CHARS,
and they may be changed when the program is compiled.