Corpus list/The GuardianThe GeM corpus contains an extract of one online edition of the Guardian newspaper from 11th. April 2000. The online editions of the Guardian are extremely extensive with substantial cross-linking, as well as additional links outside of the site. These latter links are both to sites of related interest, providing the possibility of more background information about the contents of articles, and to advertising. For the corpus we have selected one strand running through the edition and extracted the set of interconnected articles that are related, as well as leaving several sibling articles (i.e., in the hypertext structure, not content) intact also. The extracted corpus fragment consists of 198 interlinked files. The 'story' followed is specifically one about Zimbabwe and appears mainly in the International and archive sections of the website. The following screendumps show a series of webpages leading into the set of Zimbabwe-related articles in the international section, starting from the start page.
The unanalysed corpus data consists of our extracted set of interlinked webpages. Pages are being analysed from this set. Following the link below takes you into that extracted set of pages; in most cases, trying to go outside of the corpus when following a link will lead to an explicit dead-end informing you that you have gone beyond the corpus; in some cases, it will still lead to a Page Not Found error message. Following links within the newspaper that are obviously to do with Zimbabwe will generally keep you within the corpus. All non-Zimbabwe menu-driven links and all searches will take you outside of the corpus. The Corpus: Zimbabwe linked web pages The following link takes you into the original pages of the Guardian edition of 11/4/2000. These pages have not been changed in any way and so may or may not contain links to information that is still being served. Advertisement links, where still functional, will also be leading to up-to-date advertisements, not those that were in the original edition. Out-of-site links to images also may or may not work, and relative links in the pages extracted that point to pages outside of the collection will naturally fail. This set of pages is therefore not strictly the pages as used in the corpus, but may be used occasionally for checking that links have not gone missing when working with the corpus pages proper. Original Guardian Webpages as Served for 11/4/2000
|