This mini-corpus is drawn from the Susanna corpus of Geoffrey Sampson.
We have four files each converted from the Susanna form into XML. The particular text extracts are taken from group "A": press reportage. The files in our collection are:
You will need to download all of these files into a folder in order to continue with this exercise.
In order to use them as a corpus, we also define an index file (of course in XML!). Here is what one might look like.
We can then start using this collection with queries such as the following: get-index.xsl
The tags Geoffrey Sampson describes thus:
"The SUSANNE wordtag set is based on the "Lancaster" tagset listed in Garside
et al. (1987: Appendix B); additional grammatical distinctions have been
drawn in this set, and these are indicated by suffixing lower-case letters
to the Lancaster tags. For instance, "revealing" is tagged "VVG" (present
participle of verb) in the Lancaster scheme, but as "VVGt" (present
participle of transitive verb) in the SUSANNE scheme. Apart from the
lower-case extensions, the wordtags are normally identical to the Lancaster
tags: punctuation marks are assigned alphabetical tags beginning Y...
(e.g. YC for comma), and the dollar sign which appears in some Lancaster
tags for genitive words is replaced by G (e.g. GG for the apostrophe-s
suffix), so that the modified Lancaster tags always consist wholly of
alphanumeric characters, beginning with two capital letters. (In a few
cases, tags from the Lancaster set have been merged or eliminated from
the SUSANNE scheme in the light of experience.)The tag YG appears in the wordtag field to represent a "ghost" -- the logical
position of a constituent which has been shifted elsewhere, or deleted,
in the surface grammatical structure.The SUSANNE tagset comprises 353 distinct wordtags, not counting tags for
elements of "grammatical idioms"."