Corpus links and tools: annotated corpora

Course Home Page

There are an increasing number of ready-prepared corpora with more or less information contained in mark-up in order to make searching and investigation easier. Some of these are very extensive and are accordingly quite expensive. They also often have their own systems of mark-up, as they have taken many years to produce and standards for mark-up are still being developed. Some of the largest corpora (with homepage pointers) available are:

the British National Corpus (100 million words)
the International Corpus of English (1 million words in several varieties of English)
the Bank of English (350 million words)
the Susanne Corpus (written text: 130,000 words) and Christine Corpus (spoken)

We will take a closer look at the Susanne Corpus and the International Corpus of English for Great Britain: for the latter again working with a freely available sample version of the corpus and the tool that is provided for it. But there are also some things that you can do directly across the web: for example, the restricted concordancer for the Bank of English: CobuildDirect Corpus Sampler. Here you can already see some of the value of having some mark-up, since part-of-speech tags can be used directly in the query; a full list of tags and information about query syntax is given here.

More information relevant here may be found on the Linguist List general information page:
Definitive Resource of Organizations, Programs and Centers in Linguistics

Happy hunting!

Step-by-step instructions for a lab session for the Susanne Corpus

Step-by-step instructions for a lab session for the International Corpus of English

Course Home Page

Computer Tools and Applications - Sommersemester 2000 - Anglistik

Working with 'corpora' with specialized markup

Step-by-step instructions for a lab session for the Susanne Corpus

Step-by-step instructions for a lab session for the International Corpus of English