The following text from the Kelly/Stone book, beginning on page
16, provides general background about the marker origins and their
use. Further detailed information is supplied in the book. Note that
this marker system was developed to operate under the computer-memory
constraints at the time, limiting searches to a narrow text window
(the sentence) for word meaning. Some of our semantic markers might
be better assigned depending upon the characteristics of a wider
window (such as the whole document), which is now quite feasible.
Indeed, such a wider window disambiguation strategy is used by CL
Research.
Disambiguation rules were created from a KWIC database ("Key word
in context") showing instances of each word surrounded by four or
five words on each side as they occurred in a text corpus. This
corpus was drawn from the texts of various early General Inquirer
projects. The accuracy of the rules (both type one and type two
errors) was assessed by testing the rules on a second corpus drawn
from the same source and is reported by a "Kappa" score for each word
in the large appendix to the Kelly/Stone book.
The basic structure was set up at the beginning of the
project, essentially by shrewd guesswork. Although there have been
numerous minor revisions, the system as a whole has fortunately
proved rather serviceable, and just manageably complex for our
purpose. We found it to be intractable in the sense that the only
changes we could conveniently make were subdivisions of existing
categories and whole-cloth additions. Rearrangements are hideously
complex, from a purely clerical standpoint, the more so as the
collection of entries grows. Our experience makes it abundantly
clear that sustained growth of this system or any system like it
(e.g., that contemplated by Katz and Fodor) will require careful
attention to mechanization of many of these clerical functions. At
the very least there should be a marker directory showingwhich
entries have/use which markers. This is one of several points at
which our dictionary threatens to defy control through sheer bulk
-- yet the current system can only be regarded as impoverished
relative to anything resembling realistic dimensions for a general
and powerful languageprocessing system...
As mentioned earlier, the bulk of our sense distinctions follow
part-of-speech breaks. Such distinctions naturally correspond to
divergent syntactic environments. The set of syntactic markers,
crude though it is, gives us enough power to mark these
differences quite sharply. They implicitly supply rudimentary
constituent analysis, telling us simply where clause boundaries
lie in relation to the keyword. For example, a PRON is a one word
noun phrase; a DET is likely to be the leftmost element of a noun
phrase; and an immediately preceding MOD, TO or NEG probably
implies verb. Such clues are of course not perfectly reliable, but
they are usually reliable enough for our purposes, and where not,
can always be supplemented by further (conditional) analysis.
Within-part-ofspeech distinctions of course fall within
syntactically comparable environments, and therefore depend
heavily on the use of semantic markers for disambiguation. This
part of the system is much less satisfactory, reflecting the
general chaos in semantic theory. Many plausible candidate areas
are not represented at all in the system, and some of the marker
categories we do include are broad and ill-defined (particularly
ABS). Nevertheless, these categories have proved useful, when
liberally supplemented with tests for specific words. If we were
to redesign the system, however, this is the part that would merit
the most effort.
This then is the basic weaponry of our system. There are no
hard-and-fast principles governing its deployment in the
construction of rule sets; the process is severely underdetermined
by the data. In a general way we tried to give efficiency its due
-- for example, high frequency rules tend to be ordered first,
other things being equal, and idioms are normally identified from
the leftmost content word -- but the logic of the rules is our
principal concern. There is an interesting parallel with the
problem of curve-fitting. Just as any set of data points can be
fit arbitrarily closely by constructing a polynomial of
appropriate degree, so any set of KWIC tokens can be handled
perfectly by allowing sufficiently cumbersome rule sets, for
example consisting vacuously of one rule for each distinct
environment of the entry-word. Such a solution, however, would
possess little generality; the craft in writing rules is learning
to pitch them at a level which will optimize transfer to new text.
Among our disambiguators there was considerable and stable
individual variation in this "feel" for regularities in word
usage. There are other stylistic differences as well -- for
example, some regularly produced rule sets with great logical
depth, whereas others used branches relatively rarely; a few
people delighted in tests for the absence of items, whereas most
of us bowed to the cognitive psychologists in shunning negative
information; and so on. In general there seemed to be almost as
many ways of attacking an entry as there were disambiguators; and
there is no dependable way of gauging the quality of a
construction in advance of a test base on new data.
In practice, we worked as follows: A disambiguator (there have
been some 25 in all, of whom about 8 did the bulk of the useful
work) would pick an entry from the master list and check it off as
"in progress." Then, working directly on the corresponding section
of the KWIC, in consultation with a dictionary as described above,
he would write in next to each token of the entry the appropriate
sense number. This allowed the accumulation of sense totals and
set the stage for rule writing. As rules were successively
devised, the corresponding totals were accumulated and the cases
thereby handled stricken from the listings with colored markers,
using different colors for different rules to facilitate possible
recount. The output of this process was a recording sheet in
standard format containing all the basic information about the
entry. These sheets were then reviewed for glaring problems with
senses and/or rules, plus clerical errors. Otherwise we
necessarily relied on the competence of the disambiguator.