Computer Tools and Applications - Sommersemester 2000 - Anglistik

Step-by-step instructions

Susanne Corpus: downloading and inspection

 

The Susanne corpus is a marked-up corpus of written English. The complete corpus can be downloaded here. This contents of these files is decribed by Geoff Sampson, the corpus' designer, as follows:

"The SUSANNE Corpus consists of 64 files (apart from this documentation file), each containing an annotated version of one 2000+ word text from the Brown Corpus. Files average about 83 kilobytes in size, thus the entire Corpus totals about 5.3 megabytes. The file names are those of the respective Brown texts, e.g. A01, N18. Sixteen texts are drawn from each of the following Brown genre categories:

  • A: press reportage
  • G: belles lettres, biography, memoirs
  • J learned (mainly scientific and technical) writing
  • N adventure and Western fiction "

An example of one of these files can be looked at here. This shows quite a different kind of 'mark-up' in which the contents of the corpus have been set out in a tabular form rather than as straight markup within the text. Again, it is really necessary to have suitable tools to access this kind of information.A description of the corpus in some detail can be downloaded here.
 

A description of the tagging used in this corpus and its development can be found in the library in: The Computational Analysis of English: a corpus-based approach. Garside/Leech/Sampson (1987). Signature: a ang 115r/865.