Untitled Document

Course Home Page

Some of the corpora described so far are already using SGML, the standardized general markup language. Also, some of their tools are supposed to be SGML-based; for example, the tool supplied to use the large British National Corpus is called "SARA", which stands for SGML-aware retrieval application. But most of these tools, although SGML-based, still rely on the particular kinds of tags and mark-up that is adopted for particular corpora.

Some online examples of SGML-marked up corpora and tools are:

the IMS (Institut für Maschinelle Sprachverarbeitung, Stuttgart) Corpus Workbench that provides access to the Penn Treebank
the restricted concordancer for the Bank of English: CobuildDirect Corpus Sampler.
the restricted java interface to the Bank of English: Javademo
the restricted (only words containing 'j' !) telnet interface to the Bank of English: BoE telnet instructions

The re-usability of tools for a variety of corpora and purposes depends on the emergence of standards. Currently there is great discussion and increasing accepting of a very long and complicated general coding scheme for all kinds of texts (novels, poems, plays, conversation, etc.). This coding scheme is the result of the Text Encoding Initiative. All tags and schemes that are compatible with the Text Encoding Initiative recommendations are called TEI-conformant or TEI-conforming. A quite detailed description (with examples) of a simple form of the TEI guidelines can be looked at here (note: this does not make very light reading! best thing is probably just to look at the examples at first).

Some of the more well known 'tag sets' include:

Geoffrey Sampson's Oxford University Press book English for the Computer also gives an extensive account of working out a tagset for English (the one in fact used in the Susanne corpus).

Course Home Page

Computer Tools and Applications - Sommersemester 2000 - Anglistik

Working with generalized mark-up and SGML-aware tools