|
Making an XML file... |
In this tutorial we will start by creating
a very simple XML file and checking that it is correct. The same kind
of steps are used for all XML files and so this serves as an easy introduction.
In order to work with an XML, it is best to use special
XML editors. You will see this very quickly, because we will start without
one. You will see that when we move to an XML editor, things get much
simpler! |
The task |
The simple exercise that we will carry out is the following. Take the
following simple text extract (taken from text N01 of the Susanne corpus):
Dan Morgan told himself he would forget Ann Turner.
He was well rid of her. He certainly didn't want a wife who was as fickle
as Ann. If he had married her, he'd have been asking for trouble. But
all of this was rationalization. Sometimes he woke up in the middle
of the night thinking of Ann, and then could not get back to sleep.
His plans and dreams had revolved around her so much and for so long
that now he felt as if he had nothing. The easiest thing would be to
sell out to Al Budd and leave the country, but there was a stubborn
streak in him that wouldn't allow it. The best antidote for the bitterness
and disappointment that poisoned him was hard work. He found that if
he was tired enough at night , he went to sleep simply because he was
too exhausted to stay awake. Each day he found himself thinking less
often of Ann; each day the hurt was a little duller , a little less
poignant. He had plenty of work to do. Because the summer was unusually
dry and hot, the spring produced a smaller stream than in ordinary years.
The grass in the meadows came fast , now that the warm weather was here.
You can save this text to a plain text file here
by right clicking and selecting a place to keep the file.
We want now to annotate this text so that it has basic sentence
structure marked up. This means that we will be able to find some of the
syntactic structure in the text easily. Later on we will consider how
to annotate particular kinds of semantic and pragmatic information, but
we will start with syntactic structure as this is fairly simple but also
sufficiently structured as to raise some problems.
The task will have be successfully completed when you have annotated the
plain text by turning it into an XML file that conforms to a particular
structure defined for this tutorial. |
The required structure |
For the purposes of this exercise, we will want a structure as follows.
- The entire text is to be tagged as a text, within <text> and
</text> tags.
- Each sentence is to be tagged as a sentence, within <s> and
</s> tags.
- Each clause is to be tagged as a clause, within <cl> and </cl>
tags.
- Clauses may have other clauses inside them.
- Any clause may have a verbal group inside it, that is, a part of the
clause that it concerned with the activity. The verbal groups are to
be tagged inside <vb> and </vb> tags.
Here is the first sentence of the text as an example:
<s><cl>Dan Morgan <vb>told</vb> himself </cl><cl>he
<vb>would forget</vb> Ann Turner.</cl></s>
You can see that there may be places in the text where you are not sure
of how to analyse the structure: that is less important for the present
tutorial. Just make some sensible decision, and code that.
- Note that verbal groups can only be inside clauses.
- And note that clauses can only be inside sentences.
- And, finally, that sentences can only be inside a text.
If you violate any of these rules that would mean that the analysis of
the text is not following the structure set out. One of the benefits of
using XML is that we can check for this kind of mistake automatically.
|
Method |
Take your saved version of the text above and then:
- make a copy of it in another file
- choose your favourite text editor and open the copied file for editing
- add the necessary annotations into the text by typing directly what
is required, just as indicated above in the example annotation of the
first sentence.
- Add the following standard XML as the first line of the file: this
is what tells real XML editors what kind of XML they are dealing with:
<?xml version="1.0"
encoding="UTF-8"?>
- save the edited file, which should now look like an XML file, as plain
text with the extension ".xml"
|
Hint |
If you really have no idea what your edited file should look like, a simple
example can be looked at here. |
Next step |
Checking what you have created... |