a pilot project for ice-mauritius - VLE

Transcription

A PILOT PROJECT FOR
ICE-MAURITIUS
Dolly Koo Tee Fong
BSc Computing and Management
2004-2005
Koo Tee Fong, Dolly
A PILOT PROJECT FOR ICE-MAURITIUS
SUMMARY
The overall objective of this project was to develop a prototype of the Mauritius component of
the International Corpus of English (ICE) to demonstrate feasibility and potential problems
for a larger-scale follow- up project.
In doing so, a proposal was also drafted in accordance to the EPSRC requirements with a
possibility to be sent for funding.
The following was achieved in the project:
•
Tools and techniques available for corpus development and processing were
investigated and discussed, along with the main ones used by ICE.
•
The Mauritius component of ICE, named as ICE-Mauritius, had been collected and
compiled up to 5% of the original size of an ICE project.
•
A full work plan was written for a follow-up project to develop a full-scale ICE- lite
corpus, consisting not only of English from Mauritius but also from other 39 Englishspeaking countries.
•
Finally, the prototype and the work plan were evaluated by three people who are
experienced and involved in corpus collection and funding application.
i
Koo Tee Fong, Dolly
ACKNOWLEDGEMENT
I would like to thank my project supervisor and personal tutor, Eric Atwell, for his help and
support throughout this project and also through my whole third year at Leeds University
I would also like to thank Gerald Nelson and Serge Sharoff for kindly agreeing to take part in
evaluating the project and for their advice.
Finally, I would like to thank my boyfriend, family and flatmates for their input, support and
encouragement throughout the course of the project.
ii
Koo Tee Fong, Dolly
CONTENTS
1. Introduction
1.1
1.2
1.3
1.4
1.5
___________________________________________ 1 - 3
Aim ____________________________________________________
Objectives ______________________________________________
Minimum Requirements __________________________________
Deliverables _____________________________________________
Initial Project Schedule __________________________________
2. Survey of computer technol ogies for corpus development _________
2.1
2.2
Background to the problem ________________________________
2.1.1 Introduction ___________________________________________
2.1.2 What is a Corpus? ______________________________________
2.1.3 Overview of The International Corpus of English (ICE)_________
2.1.4 Other Corpora ________________________________________
2.1.5 Reasons for Encoding a Corpus____________________________
2.1.6 ICE Corpus Design _____________________________________
3 - 23
3 - 10
3
4
5
6
7
8
Corpus Collection and Encoding ____________________________ 10 - 17
2.2.1
2.2.2
2.2.3
2.2.4
2.2.5
2.3
1
1
1
1
2
Collecting Data ________________________________________
Computerising Data _____________________________________
ICE Markup System _____________________________________
Corpus Tagging ________________________________________
Syntactic Parsing _______________________________________
10
11
12
15
16
Annotation Tools ________________________________________ 17 - 23
2.3.1 The ICE Markup Assistant ________________________________
17
2.3.2 The Different Taggers Available ___________________________
17
2.3.3 The ICE Tag Selection System ____________________________
19
2.3.4 The ICE Syntactic Marking System _________________________
19
2.3.5 Different Varieties of Syntactic Annotation ___________________
20
2.3.6 The ICE Syntactic Tree Annotator __________________________
22
3. Methodology _____________________________________________ 23 - 27
3.1
Corpus Design ___________________________________________ 23 - 25
3.1.1 Methods to be Used _____________________________________
23
3.1.2 Copyright Issues ________________________________________
25
3.1.3 Corpus Layout ________________________________________
25
3.2
Capturing Text in Electronic Format ______________________
3.2.1 Computerising Speech __________________________________
3.2.2 Computerising Written Texts ____________________________
25 - 26
25
26
3.3
Corpus Annotation _______________________________________
3.3.1 Structural Mark-up _____________________________________
3.3.2 Procedure for Annotating the Corpus ______________________
26 - 27
26
27
iii
Koo Tee Fong, Dolly
4. The Pilot Project __________________________________________ 27 - 37
4.1
Collection of Texts ________________________________________ 27 - 32
4.1.1 Search Methods ________________________________________
27
4.1.2 Text Collection ________________________________________
29
4.1.3 Written Text Classification ________________________________
30
4.1.4 Permission Letters ______________________________________
31
4.1.5 Layout of the Pilot Project ________________________________
32
4.2
Corpus Annotation ________________________________________ 32 - 37
4.2.1 TEI-Header
________________________________________
32
4.2.2 Texts Encoding ________________________________________
33
5. The Proposal
__________________________________________
38 - 44
5.1
Funding opportunities
__________________________________
5.1.1 Research at University of Leeds, School of Computing__________
5.1.2 Introduction to the EPSRC _______________________________
5.1.3 Eligibility of Investigators ________________________________
5.1.4 Research Opportunities __________________________________
5.1.5 How to Apply ________________________________________
38 - 40
38
38
39
39
40
5.2
Writing up the proposal __________________________________
5.2.1 Original Idea ________________________________________
5.2.2 Expansion of Corpus Design ______________________________
5.2.3 Writing Up Proposal ____________________________________
40 - 44
40
41
43
6. Evaluation ______________________________________________
6.1
6.2
6.3
6.4
Product _________________________________________________
Minimum Requirements __________________________________
Project Stages ____________________________________________
Planning and Schedule _____________________________________
7. Conclusion ______________________________________________
44 - 51
44
47
47
50
51 - 51
References ________________________________________________ 52 - 54
APPENDIX A: Personal Experience
____________________________
APPENDIX B: Markup Symbols
____________________________
APPENDIX C: Corpus Design Layout
____________________________
APPENDIX D: List of Texts Collected
____________________________
APPENDIX E: Sample of the Letters of Copyright ______________________
APPENDIX F: Template for the Header ____________________________
APPENDIX G: Example of Raw Text
____________________________
APPENDIX H: Examples of Encoded Text ____________________________
APPENDIX I: First Draft of the Case for Support for ICE-lite_____________
APPENDIX J: EPSRC Application Form ____________________________
APPENDIX K: Revised Case for Support for the ICE-lite Proposal_________
iv
55 - 56
57 - 58
59 - 59
60 - 66
67 - 68
69 - 69
70 - 71
72 - 75
76 - 81
82 - 90
91 - 96
Koo Tee Fong, Dolly
1. Introduction
1.1
Aim
To develop a prototype of the Mauritius component of the International Corpus of English, to
demonstrate feasibility and potential problems for a larger-scale follow-up project.
1.2
Objectives
The objectives of the project are to:
•
Compare and evaluate the different computer technologies available to extend the
International Corpus of English to Mauritius English.
•
Investigate data-sources and instigate data-collection for a Mauritius ICE sub-corpus.
•
Research on infrastructure and data collection methods.
•
Investigate the requirements and feasibility of a larger-scale follow-on project to develop
a full-scale ICE-Mauritius Corpus.
1.3
Minimum Requirements
The minimum requirements are:
•
Develop a small-scale prototype of the Mauritian Corpus of English.
•
Survey of computer technologies for corpus development and processing.
The possible extensions:
•
1.4
Work plan for a follow -up project to develop a full-scale ICE-Mauritius corpus.
Deliverables
The project deliverables are:
•
The project report
•
The prototype of the Mauritian Corpus of English
1
Koo Tee Fong, Dolly
1.5
Initial Project Schedule
The initial project schedule, Schedule 1 below , does not reflect the actual work done since after
obtaining the assessor’s feedback in January, it became clear that the project ha d to take a new
direction, with some changes to the aims and requirements. The new aims and requirements were
given above. This also resulted in a new work plan for the second semester, as shown in Schedule
2 on the next page.
Schedule 1: Schedule before Christmas Break
Dates
11/10/04 - 22/10/04
22/10/04 - 08/11/04
Milestones
15/11/04 - 24/01/05
29/11/04 - 10/12/04
10/12/04 - 22/12/04
Section on Background
Research
Research
Research
Appendix I,J & K
Mid-project report
n/a
10/12/04 - 22/12/04
n/a
10/12/04 - 24/01/05
n/a
24/01/05 - 31/01/05
31/01/05 - 07/02/05
n/a
n/a
07/02/05 - 21/02/05
21/02/04 - 07/03/05
n/a
n/a
08/03/05
n/a
08/02/05 - 18/03/05
18/03/05 - 01/04/05
n/a
n/a
01/04/05 –18/04/05
Final Report
08/11/04 - 22/11/04
08/11/04 - 22/11/04
2
Tasks
Identify and specify aim and minimum
requirements
Research requirements of EPSRC
Research the International Corpus of
English
Research on Mauritius English
Draft Proposal
Collate mid-project report
Research and evaluate other EPSRC
training courses available
Research and evaluate different
training/education techniques
Christmas break, revision and end of
semester 1 exams
Decide on delivery mechanism
Research and decide on what to
include in training course
Write up tutorial
Work on improving aspects of tutorial
and the draft proposal
Give training course to research staff
and students
Collect feedback on training course
Analyse feedback and evaluate
training course
Complete final report. Most chapters
should be already partially written up,
but may need reworking.
Koo Tee Fong, Dolly
Dates
24/01/05 - 31/01/05
01/02/05 - 07/02/05
07/02/05 - 09/02/05
10/02/05 - 17/02/05
18/02/04 - 25/02/05
26/02/05 - 28/02/05
01/03/05 - 18/03/05
18/03/05 - 01/04/05
01/04/05 –18/04/05
Schedule 2: New schedule for second semester
Milestones
Tasks
Section 1
Decide on new aims & objectives and
design new plan
Research on methods available to
Research (sections 1 &
extend the ICE corpus to Mauritius
2)
Appendix C and Section
Design layout and text categories of ICE
3
Mauritius
Section 4 and Appendices Collect sample texts from the Internet &
D&E
send request for copyright permission
Section 4 and Appendices Annotate corpus
F,G & H
Section 4 & 5 and
Investigate feasibility of ICE-Mauritius
Appendices I, J & K
Appendices I,J & K
Draft a proposal for ICE-Mauritius
Section 6
Evaluate corpus & proposal
Final Report
2. Survey of computer technologies for corpus development
2.1
Background to the problem
2.1.1
Introduction
‘Leeds has a track record for research on computer analysis of English language texts, also known
as English Corpus Linguistics. For example, the University has developed a Part-of-Speech
analysis system, used on other research projects such as the International Corpus of English (ICE),
which includes research teams in fifteen countries where English is the is the first language or
second official language language. In many of these English-speaking countries, the national ICE
sub-corpus is a recognised resource used in research and teaching’ (Atwell, 2004).
Mauritius is one of the many English-speaking African countries, but there is no Mauritian subcorpus in ICE. However, the government has started a Cyber City project, which is the first of its
kind of a new generation of IT parks in this part of the world. ‘The construction of Ebène
CyberCity is a historical milestone towards achieving the Government’s objective of transforming
Mauritius into a diversified, high-tech, high income services and knowledge economy’ (BPML,
2004).
Therefore, it may be feasible to collect at least some samples of Mauritian English
remotely, via the World Wide Web.
3
Koo Tee Fong, Dolly
Mauritius has been chosen because of the special characteristics of the English used there.
English has been used in Mauritius for around 195 years, since the British settlers arrived in 1810,
and is the official language of the country (Republic of Mauritius, 2004). However, at that time,
slaves were imported from Africa and Madagascar, a large number of labourers from India were
brought to work in the sugar cane fields and a small number of Chinese came to trade, and the
influence of the French who were the rulers before the British was still very strong. The different
languages brought by the different settlers have therefore influenced significantly the official
English language of the country and the mixture of those languages have also resulted in a new
language, Creole, which is nowadays spoken by everyone. Even if Creole is the most spoken
language in Mauritius, all official communications and teaching in schools are done in English.
However, the English used has particular characteristics. For instance, in many of the official
communications or press reports, some French and Creole words are used to emphasise a specific
theme. In school textbooks, other dialect words are bound to appear, such as names of individuals
or companies written in Chinese or Hindi. Another characteristic worth noting is that people in
Mauritius tend to think in their native language and then translating what they want to say in
English, keeping the same structure and grammar as in their native language. This results in a
great variation of English among the cultures in the same country and also in young people
learning English differently. Developing a corpus for Mauritius will therefore allow an interesting
and useful analysis and understanding of the different types of English used and will hopefully
help the government, teaching organisations and the people in general to understand and build up
a common standard of English.
The pilot project has hence investigated what can be done to instigate and start data collection for
a Mauritian ICE sub-corpus. A proposal for the infrastructure and data collection methods has
also been made in respect of the EPSRC requirements.
2.1.2
What is a Corpus?
The term ‘corpus’ was traditionally used to designate a body of naturally-occurring or authentic
language data which usually might consist of written or spoken texts, or samples of spoken and
written language in a particular language or language variety. The corpus could be used as a basis
for linguistic research. In the last thirty-five years, the term ‘corpus’ has been used to describe
more the electronic form of the set of language material which may be processed by computer for
various purposes such as language research and engineering. This includes the study of all aspects
of language such as syntax, semantics, pragmatics, speech and recently in lexicographic studies.
4
Koo Tee Fong, Dolly
Due to the small scale of data available at hand in the past, many past theories and interpretations
explaining linguistics phenomena, although accurate, were too narrow to be applied to the whole
set of languages. In addition, the focus was more on language structure than on the use of the
language. (Leech, 1997a, Al-Sulaiti, 2004)
Due to the recent explosion in technology, corpora have increased dramatically in size, variety
and ease of access. The combined use of corpora and computers to study languages has changed
the way linguistics phenomena is analysed. Nowadays, corpus linguistics are analysing the use of
structures and investigating factors that affect our choice of a particular structure. For instance,
the factors may be related to the nature of the writing or speaking such as science or literature or it
may be related to discover typical linguistic patterns in some defined contexts.
With the
computer, storing a huge amount of data, this new view of language analysis becomes more
accessible and hence, the computer corpus is fast becoming a universal resource for language
research.
2.1.3
Overview of The International Corpus of English (ICE)
ICE was initiated in 1988 by the late Sydney Greenbaum, the then Director of the Survey of
English Usage, UCL. From 1996 to 2001, it was coordinated by Charles Meyer, University of
Massachusetts-Boston and it is now coordinated by Gerald Nelson, who recently returned to UCL
from the University of Hong Kong (Nelson et al., 2002). The ICE’s primary aim is to collect
material for comparative studies of English worldwide and its long-term aim is to produce up to
twenty-one million-word corpora. Around the world there are fifteen research teams, shown in
Table 1 below, who are preparing electronic corpora of their own national or regional variety of
English (UCL, 2002). Five other ICE projects for Cameroon, Fiji, Ghana, Nigeria and Sierra
Leone have been considered but no text has been collected yet.
Table 1: Components of the ICE project
Australia
Great Britain
Ireland
New Zealand
South Africa
Canada
Hong Kong
Jamaica
Philippines
Sri Lanka
East Africa
India
Malaysia
Singapore
USA
Each ICE corpus consists of one million written and spoken words of English. Each team is
following very closely the same corpus design, as well as the same scheme for grammatical
annotation so as to ensure compatibility between the individual corpora in ICE (UCL, 2002). The
5
Koo Tee Fong, Dolly
texts in the corpus date from 1990 or later. The authors and speakers of the texts have grown up
and been taught through the English medium. They are aged 18 or over and are either born or
emigrated at an early age to the country in whose corpus they are included and are educated
through the English medium in the country concerned (UCL, 2002). The corpora in ICE are being
annotated at various levels to enhance their value in linguistic research. These levels are Textual
Mark-up, Word class Tagging and Syntactic Parsing.
Despite attempts to achieve conformity, complete identity between corpora is not possible. There
are inevitable differences in the samples taken for speech and writing and in some countries
certain categories are difficult to obtain. Information about each author or speaker may also be
unavailable and the projects have different start dates which result in discrepancies in the date of
the texts. However, for global comparisons, the corpora are similar enough to justify any analysis
carried out. (Greenbaum, 1996)
2.1.4
Other corpora
The Brown Corpus of Standard American English (Ku.era and Francis, 1967) is the first modern
electronically readable corpus to be developed. The corpus consists of one million words of
American English texts printed in 1961 and the texts are sampled in different proportions from 15
different text categories, some of which are press, skills and hobbies, religion and fiction.
Compared with the various corpora available today, the Brown Corpus of Standard American
English is considered to be small. However, it is still used in teaching and as a model for the
development of other corpora.
The British National Corpus (BNC) is another large corpus and it was completed in 1994 (Leech,
1997a). The corpus consists of 100 million words and it contains both written and spoken
material. In addition to the British and American English corpora, there are other varieties of
English corpora such as the Australian Corpus of English (ACE), the Finland Corpus of Early
English Correspondence Sampler and others (Breyer, 2005). Many other corpora have also been
developed for different languages, such as the Czech National Corpus, CORIS (an Italian Corpus)
and the French Corpus. These corpora are for general-use in linguistics research. There exist
other corpora which are more specialised, such as the Air Traffic Control (ATC) corpus and the
Trains Spoken Dialogue Corpus (Al-Sulaiti, 2004).
6
Koo Tee Fong, Dolly
Modern-day corpora are of various types and the difference in their composition depends on their
use. Balanced corpora like the Brown Corpus, which includes different types of written English,
are more valued by individuals who are interested in linguistic description and analysis. In other
corpora, size may be more important than balance. One such example is the Penn Treebank. In
this case, linguists are more interested in the computational aspects of the corpus, involving
research in natural language processing. For instance, these types of corpora have been used in
the development of taggers and parsers. (Meyer, 2002)
It is useful to note that part of the British component of the ICE and the BNC have been funded by
the Engineering and Physical Sciences Research Council (EPSRC) in the UK and the Brown
Corpus has been funded by the equivalent of the EPSRC in America.
2.1.5
Reasons for encoding a Corpus
Two types of corpus can be identified whether it is written or spoken: raw corpus and annotated
(or marked-up) corpus. The former is mainly the natural text itself with no other additional
information and in the latter the text is “enriched with a variety of information” (Al-Sulaiti, 2004).
Although raw corpora can be used with the help of tools to carry out any linguistic analysis,
annotated corpora provide better analysis.
Leech (1997a) has identifie d the following advantages of annotated corpora:
•
Extracting information: a piece of language can have various meanings and uses in its
orthographic form, for instance the word ‘left’ can be a noun, an adjective or a verb.
Therefore, extracting information becomes easier and more efficient if the corpus is
grammatically tagged since each occurrence of ‘left’ will be accompanied by a label
indicating its type.
•
Re-usability: once the corpus has been annotated, it can be handed on to other users and this is
a valuable advantage since corpus annotation are usually an expensive and time consuming
process.
•
Multi-functionality: annotation adds overt linguistic information to a corpus and this makes it
useful for a multitude of purposes.
7
Koo Tee Fong, Dolly
2.1.6
ICE Corpus Design
Length of Corpus
Meyer (2002) has state d that the first questions to ask when designing a corpus are:
1. “What will be the overall length of the corpus?” The lengthier the corpus, the better it is, but it
has to be feasible.
2. “How long the corpus needs to be to permit the kinds of studies one envisions for it?”
The standard requirement of ICE is that the core corpus should contain a total of one million
written and spoken words of English. However, some region might want to collect more material
in certain text categories or to include additional categories, depending on their needs
(Greenbaum, 1991b).
Type of genres
Again, Meyer (2002) has raised an important question: “Why these genres and not others?” To
answer this question, we have to consider the different types of corpora that have been created and
the purpose of each one. As mentioned above, some corpora are multi-purpose, namely the BNC
and the ICE Corpus, which means that they are intended to be used for a variety of different
purposes and therefore these corpora need to contain a broad range of genres. However, the
multi-purpose corpora do not always cover a full representation of all genres. Therefore, special
genres need to be collected for special- purpose corpora such as the Michigan Corpus of Academic
Spoken English (MICASE), which is used to study the type of speech used by individuals
conversing in an academic setting (Meyer, 2002).
The ICE corpus is usually divided into the ratio of 60:40 for spoken and written English
respectively. Within both halves a distinction is made between private (conversation or letter) and
public (news report or lecture). Both the private and public sections can be further divided into
monologue and dialogue for speech and scripted, non-printed and printed for written texts
(Greenbaum, 1991b). Below are the typical ICE Text Categories, taken from the ICE website.
Table 2: ICE Text Categories
Numbers in brackets indicate the number of 2,000-word texts in each category.
Spoken
Dialogues
Private
Conversations (90)
(300)
(180)
(100)
Phone calls (10)
Public
(80)
8
Class Lessons (20)
Broadcast Discussions (20)
Broadcast Interviews (10)
Koo Tee Fong, Dolly
Parliamentary Debates (10)
Cross-examinations (10)
Business Transactions (10)
Monologues Unscripted Commentaries (20)
(120)
(70)
Unscripted Speeches (30)
Demonstrations (10)
Legal Presentations (10)
Written
(200)
Non-printed
(50)
Printed
(150)
Scripted
(50)
Broadcast News (20)
Broadcast Talks (20)
Non-broadcast Talks (10)
Student
Writing
(20)
Student Essays (10)
Exam Scripts (10)
Letters
(30)
Social Letters (15)
Business Letters (15)
Academic
(40)
Humanities (10)
Social Sciences (10)
Natural Sciences (10)
Technology (10)
Popular
(40)
Humanities (10)
Social Sciences (10)
Natural Sciences (10)
Technology (10)
Reportage
(20)
Press reports (20)
Instructional Administrative Writing (10)
(20)
Skills/hobbies (10)
Persuasive
(10)
Creative
(20)
Editorials (10)
Novels (20)
Length of individual text samples
Each text in the corpus contains about 2000 words (UCL, 2002) , following the sample -size nor ms
of pioneering Brown and LOB corpora. Therefore, there are 500 texts in each regional corpus
with 10 texts (20,000 words) as the minimum for each text category. Since most corpora contain
relatively short samples of text, text fragments instead of complete texts tend to be stored.
Ideally, it is be better to include complete text in the corpora but the length of the text is one of the
main reasons why this is not possible. For instance, a book is too lengthy and it will take up the
whole corpus if it is to be used as a whole. If only part of a text is used, the 2000 word sample
can be chosen from any part of the text. In existing ICE Corpora, many samples also consist of
composite texts, that is, a series of complete short texts that total 2,000 words in length (UCL,
2002, Meyer, 2002). These often include personal letters which are usually less than 2,000 words.
9
Koo Tee Fong, Dolly
Range of speakers and writers
Meyer (2002) has pointed out that it is “not simply whether one obtains texts from native or nonnative speakers but rather that the texts selected for inclusion are obtained from individuals who
accurately reflect actual users of the particular language variety that will make up the corpus”.
Greenbaum (1991b) has also emphasised that it is not the language that has to be selected but the
people and their language should not be excluded on subjective criteria of correctness, adequacy
or appropriateness. Therefore, since the ICE project is restricted to educated English, the only
criteria for selecting the population should be “adults of eighteen or over who ha ve received
formal education through the medium of English to at least the completion of secondary school”
(UCL, 2002).
The selection of text should not be random but the population differences and the textual
differences should be taken into account. Some relevant variables to consider are age, gender,
level of educational, dialect variation (e.g. urban or rural locations), ethnic group, region,
occupation and status in occupation, social contexts and social relationships.
2.2
Corpus Collection and Encoding
2.2.1
Collecting Data
Spoken Texts
Collecting spoken texts, especially spontaneous speech, is the most difficult and frustrating task in
the development of the corpus (Sharoff, 2005, Nelson, 1996a). The cooperation of the speakers is
required but often there is the problem of the “observer’s paradox”. That is, people tend to
behave differently when being observed (or recorded) and therefore the way they speak may
change. According to Meyer (2002), one way around this problem is to record a longer speech
and then choose the most natural part. However, speech collection is already time -consuming and
recording a longer speech just to obtain a natural part will be too costly. To record the speech,
either analogue or digital recorder can be used since they both yield satisfactory result. Meyer
(2002) has nevertheless recommended using digital recorder since it is easier to transfer to the
computer for manipulation and longer speech can be recorded.
To improve the quality of
recordings, the type of microphones being used is also an important consideration. For the other
spoken categories, such as broadcast speech, it is best to use radio or television for direct
recordings.
10
Koo Tee Fong, Dolly
Written Texts
Compared to the collection of spoken texts, according to Nelson (1996a) written texts are the
easiest to obtain, but in Sharoff’s (2005) experience, extra efforts are needed to obtain private
texts, such as personal letters. With the Internet, a wide range of texts are easily and freely
available today. Although using electronic texts saves us the time and effort of computerising
printed texts, some important questions raised by Meyer (2002) also need to be considered: “Are
electronic texts essentially the same as traditionally published written texts?”, “Is an article from a
personal webpage any different from one that has gone through the editorial process?”
Copyright Issues
Collecting the texts is one complex task but without copyright permission the texts cannot be used
in the corpus, especially if the corpus is going to be made accessible to anyone and is going to be
used internationally for research purposes. Based on the experience of the other ICE teams,
Nelson (1996a) has found that owners of texts are usually willing to help. The only frustration is
getting the permission within a short time period. Having other priorities, owners usually take a
long time to reply or some may not even bother replying. Nelson (1996a) has also discovered that
due to major confidentially issues, it is more difficult to obtain permission for texts in the
commercial sector.
2.2.2
Computerising Data
Computerising Written Data
Nowadays, most texts are readily available in electronic form. Those texts downloaded from the
Internet however contain a significant amount of HTML code. Meyer (2002) has suggested using
software such as “HTMASC” (http://www.bitenbyte.com) to automatically strip the HTML
coding from the texts to produce an ASCII text file with no coding. If it is not possible to obtain
the texts in electronic form, a printed copy of the text can be converted with an optical scanner.
These exist in two types: form-feed and flatbed scanners. Meyer (2002) has encouraged the use
of the fla tbed scanner since experience with ICE-USA has shown that they are slightly more
accurate.
Transcription of Spoken Texts
After collecting the spoken texts in digital form and having obtained copyright permissions, the
texts need to be written down. This process is known as transcription and more precisely, as
11
Koo Tee Fong, Dolly
defined by Edwards (1995) “it involves capturing who said what, in what manner, to whom and
under what circumstances”. Software programs such as “Voice Walker 2.0” used within the Santa
Barbara Corpus of Spoken American English and “Sound-Scriber” used in the Michigan Corpus
of Academic Spoken English (MICASE) have been designed specifically to help in the
transcription of digitised speech. These programs (Meyer, 2002) can be downloaded freely from
http://www.linguistics.ucsb.edu/resources/computing/download/download.htm
and
http://www.lsa.umich.edu/eli/micase/soundscriber.html respectively.
2.2.3
ICE Mark-up System
Mark-up is the first stage in the annotation process of ICE corpora. Nelson (1996b) has described
mark-up as two distinct types: textual mark-up, which is added to the texts themselves and
bibliographical and biographical mark-up, which is stored externally in the form of a file header
for each text. There exist two manuals , one for spoken and one for written texts (Nelson, 1991a,
1991b) which describe the textual mark-up system and a third one is available for encoding
bibliographical and biographical information (Nelson, 1991c). In written texts, mark-up symbols
are used to encode typographic features such as boldface, italics and underlining, and structural
features such as sentence boundaries, paragraph boundaries and headings. In spoken texts, markup is needed to indicate sentence boundaries, speaker turns, overlapping strings and pauses
(Nelson et al., 2002).
More recently, with the increasing use of electronic documents, a standard for the markup of these
types of documents has been developed. This standard, known as Standard Generalized Markup
Language (SGML), offers the advantage of computer independence, that is, the corpus can be
transferred from computer to computer while keeping its original description. However, although
it is a flexible language, problems such as lack of general style sheets, do arise when transferring
the text over the Web. Due to these problems, interests have been shifted to a newly emerging
mark-up system, the Extensible Markup Language (XML), which has been designed mainly for
use in web documents. (Meyer, 2002)
In the ICE components, all mark-up symbols are characterised by angled brackets, appearing with
an opening symbol <symbol> and a closing symbol </symbol>.
(1991a, 1991b) manuals are given below.
12
Some examples from Nelson’s
Koo Tee Fong, Dolly
Written Text:
Boldface <bold> </bold>
Example: Readers must return all books to the library
Markup: Readers <bold>must</bold> return all books to the library
Italics <it> </it>
Example: You must attend every day during term
Markup: You must attend <it>every</it> day during term
Typeface <typeface> </typeface>
Example: Warhol is alive and well
Markup: Warhol is <typeface: courier>alive</typeface: courier> and well
Spoken Text:
Overlapping speech <[> </[> and <{> </{>
Example: $A's utterance "Nothing stands out" overlaps completely
with $B's "Yeah I suppose".
Markup: <$A> <#><{><[>Nothing stands out</[>
<$B> <#><[>Yeah I suppose</[></{>
Anthropophonics (non-verbal sounds)
Examples: <O>cough</O> <O>sneeze</O> <O>laugh</O>
Mark-up can be done manually but to speed up the process, Nelson (1996b) has proposed to
partially automate it with the use of the Mark-up Assistant program, a set of WordPerfect macros
that assigns whole mark-up symbols to single keys. The minimum set of ICE mark-up symbols
which has been used is given in Appendix B.
Bibliographical and biographical data
The description of each text is represented as bibliographical and biographical infor mation and the
mark-up is stored separately in a header file.
The description includes ‘category’, ‘date’,
‘publisher’ among others and the data is enclosed within opening and closing symbols like in the
textual mark-up, for example, <date> 1996 </date>. A common standard used in many corpora is
the Text Encoding Initiative (TEI) (Al-Sulaiti, 2004), which has been working to incorporate
XML within its standard and which comprises of four main components:
•
File Description <fileDesc>: includes bibliographic information about the text. Below is an
example from the TEI website:
13
Koo Tee Fong, Dolly
<fileDesc>
<titleStmt>
<title> Thomas Paine: Common sense, a machine-readable
transcript </title>
<respStmt><resp> compiled by </resp>
<name> Jon K Adams </name>
</respStmt>
</titleStmt>
<publicationStmt>
<distributor> Oxford Text Archive </distributor>
</publicationStmt>
<sourceDesc>
<bibl> The complete writings of Thomas Paine, collected and edited
by Phillip S. Foner (New York, Citadel Press, 1945)</bibl>
</sourceDesc>
</fileDesc>
•
Encoding Description <encodingDesc>: states the relationship between the text and its
source. The simplest example from Baker et al. (2003) is shown below:
<encodingDesc>
<projectDesc>Text collected for use in EMILLE project</projectDesc>
<sampleDesc>simple written text only has been transcribed. Diagrams,
pictures and tables have been omitted and their place marked
with a gap element </sampleDesc>
</encodingDesc>
•
Profile Description <profileDesc>: supplies non-bibliographic information about the text
and the participants. The profile description can be divided into two parts: the text description
and the person description. Again an example from the TEI website is given below:
<profileDesc>
<textDesc n='novel'>
<channel mode=w>print; part issues</channel>
<constitution type=single>
<derivation type=original>
<domain type=art>
<factuality type=fiction>
<interaction type=none>
<preparedness type=prepared>
<purpose type=entertain degree=high>
<purpose type=inform degree=medium>
</textDesc>
<person id=P1 sex=F age='mid'>
<birth date='1950-01-12'>
<date>12 Jan 1950</date>
<name type=place>Shropshire, UK</name>
</birth>
<firstLang>English</firstLang>
<langKnown>French</langKnown>
<residence>Long term resident of Hull</residence>
<education>University postgraduate</education>
<occupation>Unknown</occupation>
<socecstatus source=PEP code=B2>
</person>
</profileDesc>
14
Koo Tee Fong, Dolly
•
Revision description <RevisionDesc>: gives a summary of the history of the text and
provides a detailed change log in which each change made to a text may be recorded. Some
examples of changes made to a text, adapted from the TEI website is shown below:
<revisionDesc>
<change><date>1996-01-22 <name>CM SMcQ<what>finished proofreading</change>
<change><date>1995-10-30 <name>L.B. <what>finished proofreading</change>
<change><date>1995-07-20 <name>R.G. <what>finished proofreading</change>
<change><date>1995-07-04 <name>R.G. <what>finished data entry</change>
<change><date>1995-01-15 <name>R.G. <what>began data entry</change>
</revisionDesc>
For the ICE corpora, Nelson (1996b) has re-classified the above information in the header file
into four different levels, but with very similar attributes:
•
Text Description: specif ies the text category and subcategories so that it can be located in the
hierarchy of the corpus
•
Text Source: records bibliographical data about the sources of texts in the corpus, such as
source title, publisher, date and place of publication. Copyright statements are also included
in this level.
•
Text Internals: contains information about the specific extract used in the corpus, for example,
title of article, page numbers, relationship between speakers.
•
Biographical Information: includes details, such as sex, age, nationality, of each author and
speaker in the corpus.
2.2.4
Corpus Tagging
During this stage, each lexical item is usually assigned a part-of-speech label or tag, for example
‘N’ for noun. In addition, most tags contain additional inf ormation, which appears in brackets.
Together they form the tagset of each item (Nelson et al., 2002). Leech’s (1997b) principles for
creating tagsets are adopted for the ICE components, that is, the tagsets should satisfy the three
criteria mentioned be low:
•
Conciseness: labels should be brief
•
Perspicuity: labels should be user-friendly and easy to read and remember
•
Analysability: labels should be decomposable into their logical parts, for instance, ‘noun’
can occur above more specific tags such as ‘singular’ or ‘present tense’
Over the years, a number of different tagging software has been developed to insert a variety of
different tagsets and most taggers are highly accurate with more than 95% success rates. The
different tagging software available is discussed later in the report. In the ICE corpora, the texts
15
Koo Tee Fong, Dolly
are automatically tagged using the TOSCA Tagger, developed by the TOSCA Research Group at
the University of Nijmegen (UCL, 2004). An example of a grammatically tagged sentence from
the ICE webs ite is shown below:
2.2.5
Each
of
these
PRON(univ,sing)
PREP(ge)
PRON(dem,plu)
is
the
responsibility
V(cop,pres)
ART(def)
N(com,sing)
of
one
person
PREP(ge)
NUM(card,sing)
N(com,sing)
Syntactic Parsing
For the ICE components, the tagged corpus from the previous stage forms the input to the parsing
stage. However, before the tagged corpus is automatically parsed, it first needs to be pre-edited.
The pre-editing stage, also known as syntactic marking, involves manually marking several highfrequency constructions in order to reduce the ambiguity of the input, and hence reducing the
number of decisions that the automatic parser will have to make (Nelson et al., 2002). Following
syntactic marking, the corpus is submitted to the automatic parser, developed by the TOSCA
Research Group at the University of Nijmegen, for syntactic analysis. Every sentence in the
corpus is analysed at phrase, clause and sentence level and the analysis is shown in the form of a
parse tree as shown in Figure 1 (UCL, 2004).
Figure 1: example of ICE Syntactic Parsing
16
Koo Tee Fong, Dolly
The parse tree is then analysed with the ICETree 2, which is a “dedicated syntactic tree editor”
especially designed for the ICE corpora by the Survey of English Usage. The ICETree can also
be used with other corpora, but some modifications to the data files will be required first.
Unlike grammatical tagging, “syntactic annotation tends to lack a sense of standard practice” and
parsing software has much lower accuracy rates (70-80 percent at best) and they require human
intervention at varying levels (Leech and Eyes, 1997). Syntactic parsing is seen as the most
difficult and time-consuming stage in the development of a corpus (Nelson et al., 2002).
2.3
Annotation Tools
2.3.1
The ICE Markup Assistant
The ICE Mark-up Assistant reduces the time taken for the insertion of markup symbols by
automating and simplifying key presses. Generally, it can save up to tens of minutes per text.
The program has a set of WordPerfect macros implemented into it, which allows the text unit
markup to be inserted automatically at probable sentence boundaries, for example, each full stop
is followed by a space. Most markup types require an open and close symbol and the ICE
Markup Assistant also helps to ensure that all markup symbols are closed. For instance, if the
user tries to open the same symbol again before closing it, the program will remind the user to do
so. (Quinn and Porter, 1996)
Using a reduced system of annotation is another way of minimizing the amount of time taken to
annotate texts. For those ICE teams which lack resources to insert all the ICE markup that has
been developed, the ICE project reduces the amount of structural markup that is required to the
most “essential” markup. (Meyer, 2002)
2.3.2
The Different Taggers available
Automatic text tagging is an important first step in discovering the linguistic structure of text
corpora. For a tagger to function as a practical component in a language processing system,
Cutting et al. (2005) believe that a tagger must be:
•
Robust: A tagger should be able to deal with ungrammatical constructions, isolated phrases,
such as titles, and, non-linguistic data, such as tables and special words (which might be
unknown by the tagger).
17
Koo Tee Fong, Dolly
•
Efficient: Due to the large amount of words which needs to be analysed and tagged in every
corpus, a tagger must be time efficient and any training required should also be fast to allow
rapid turnaround with new corpora and new text genres.
•
Accurate: A tagger should assign the correct part-of-speech tag to every word it encounter to
reduce human intervention.
•
Tunable: A tagger should be able to take different hints to correct systematic errors and to be
adapted to different corpora.
•
Reusable: A tagger should require a minimal amount of effort to be retargeted to new corpora,
new tagsets and new languages.
There are two different types of taggers: rule -based or probabilistic (Garside and Smith, 1997,
Meyer 2002). In a rule-based tagger, grammar rules are written into the tagger and tags are
inserted on the basis of these rules. The TAGGIT program is among the first rule -based tagger to
be developed, followed by the Brill tagger. However, rule -based taggers are being superseded by
probabilistic ones. The latter works by assigning a tag to a word based on the most likely
outcome of the tag in the context of the word and its immediate neighbours. Garside and Smith
(1997) have give n an example in the sentence beginning the run: the word run has a high
probability of being a noun rather than a verb because it is preceded by the.
The most common taggers with which corpus linguists typically work are:
•
The TAGGIT program by Greene and Rubin was one of the earliest tagger to be developed
around the 1971s and it was an aid in the tagging of the Brown Corpus. The corpus was
tagged at 77% and the rest was done manually over a period of several years. The tags
assigned were from a set of some 77 tags (the Brown tags). (Garside and Smith, 1997)
•
CLAWS (the Constituent Likelihood Automatic Word-tagging System) , another one of the
first tagging programs, was designed in the early 1980s at the University of Lancaster (Atwell,
1983). CLAWS has consistently achieved 96-97% accuracy and since then, various versions
of the CLAWS program have been developed and have been used to tag the LOB Corpus (the
British counterpart of the Brown Corpus) and the British National Corpus. (Leech, 1997a)
•
The TOSCA (Tools for Syntactic Corpus Analysis) tagger has been designed by the TOSCA
team at the University of Nijmegen to insert two types of tagsets, namely the TOSCA tagset,
18
Koo Tee Fong, Dolly
which is used to tag the Nijmegen Corpus and the ICE tagsets composed of 262 tags. (UCL,
2004, Meyer, 2002)
•
Another tagger that can be used to insert the ICE tagset is the AUTASYS tagger (Fang, 1996).
The tagger has been developed by Fang and Xiaoli at the Guangzhou Institute of Foreign
Languages, China and it assigns not only ICE tags, but also LOB tags and SKELETON tags.
AUTASYS has an accuracy rate of 96% and it has a fast rate of processing words.
•
The Brill tagger, a multi-purpose tagger, can be trained to insert any tagset the user is working
with. It can also be applied to any language (Garside and Smith, 1997, Atwell et al., 2000).
•
EngCG-2 (the Helsinki English Constraint Grammar) is a tagger that has been designed to
overcome the problems in the TAGGIT program and other rule -based taggers. It has a wider
application and is able to “refer up to sentence boundaries rather that the local context along”
(Meyer, 2002). One main advantage of EngCG-2 is its 99.5% accuracy rate.
2.3.3
The ICE Tag Selection System
TAGSELECT helps users to automatically select alternative word-class tag generated by the
TOSCA tagger or AUTASYS. The most likely alternative tags for each word are displayed first,
so human interference is only needed if the first tag is not the correct one. Where no correct
alternative is provided, a new tag can be chosen from the list of possible tags. TAGSELECT is
user-friendly since it runs under Microsoft Windows and therefore all functions are available
using menus, buttons and scroll bars. (Quinn and Porter, 1996)
2.3.4
The ICE Syntactic Marking System
For the ICE projects, syntactic markers are added to the tagged texts prior to parsing by the
TOSCA parser or any other parser that requires such pre-editing.
This is done with the
ICEMARK system. Syntactic markers make the input to the parser simpler and therefore restrict
the number of alternative syntax trees generated. Like the ICE Tag Selection System, ICEMARK
also runs under Microsoft Windows, making it user-friendly. (Quinn and Porter, 1996)
19
Koo Tee Fong, Dolly
2.3.5
Different Varieties of Syntactic Annotation
Tagging and Parsing are closely related and so many parsers have taggers built into them. For
instance, the EngFDG (Functional Dependency Grammar of English) parser and the TOSCA
parser can assign both syntactic functions and part-of-speech tags to words (Meyer, 2002). Like
taggers, they can be either probabilistic or rule-based. Both parsers have been widely used but
major emphasis is being put on the development of probabilistic ones in recent years since they
are thought to be more robust in the sense that they are “able to parse rare or aberrant kinds of
language, as well as more regular, run-of-the-mill types of sentence structures” (Leech and Eyes,
1997).
The Lancaster/IBM Treebank
A skeleton parsing scheme based on a shallow PS (phrase-structure) model has been used to parse
about 3 million words of text. The PS model simply involves analysing every sentence in the
corpus and adding labelled brackets to it. A sample of skeleton parsing from the Lancaster/IBM
Spoken English Corpus is shown in Figure 2. It can be noted that the tree is incomplete and that
the number of bracket labels used is quite small. This is done intentionally to speed up the
process and to limit the complexity of the parsing (Leech and Eyes, 1997).
SJ06 298v
[S But_CCB ,_, [[N the_AT thing_NN1 N][V was_VBDZ V]] ,_, [N you_PPY
N] often_RR [V found_WD [Fn that_CST [Fa although_CS [N you_PPY N][V
had_VHD [N a_AT1 reserved_JJ sear_NN1 N]V]Fa] ,-, that_CST there_EX
just_RR [V would_VM n’t_XX be_VBO [N room_NN1 N][P on_II [N the_AT
train_NN1 N]P]V]Fn]V] ._. S]
Figure 2: Sample from the Lancaster/IBM Spoken English Corpus (Leech and Eyes, 1997)
The Penn Treebank: Phase 1
It is the largest and best-known treebanking operation available today. The Penn Treebank has
been developed at the University of Pennsylvania by Mitchell Marcus and his team and it is
closely modelled from the Lancaster/IBM Treebank.
A PS model of parsing is used and
incomplete parsed trees are accepted into the Treebank. The differences are that the Penn Tree is
displayed vertically as shown in Figure 3 below and it is generally available throughout the world.
(Marcus et al., 1993)
Another more ambitious version (Phase 2) of the Penn Treebank is being developed. In the Phase
2 Treebank, a wider range of additional information, such as functional labels or types of
adverbial, will be added.
20
Koo Tee Fong, Dolly
( (S (NP (NP Pierre Vinken)
,
(ADJP (NP 61 years)
old
,))
will
(VP join
(NP the board)
(PP as
(NP a nonexecutive director))
(NP Nov. 29)))
.)
Figure 3: A sentence from the Penn Treebank (Phase 1) (Leech and Eyes, 1997:42)
Nijmegen Treebanks
Developed before the Penn Treebank, the TOSCA parsing system was set up in the early 1980s at
the Catholic University of Nijmegen, Holland. It uses a grammatical model, known as Affix
Grammar and the TOSCA Treebank is integrated with the Linguistic DataBase (LDB), which
allows the Treebank to be searched for varied features. One of its main features is that it allows
users to correct or change the parse where necessary. Figure 4 gives an example of a sentence
from the TOSCA Treebank. (Leech and Eyes, 1997)
-:TXTU()
UTT:S(act,indic,inter,mortr,pres,unm)
INTOP:AUX(do,indic,pres){Does}
Does
SU:NP()
NPHD:PN(pers,sing){he}
he
V:VP(act,do,indic,motr)
MVB:LV(indic,nfin,mortr){realize} realize
OD:CL(act,indic,intens,pres,unm,zsub)
SU:NP()
NPHD:PN(pers,sing){he}
he
V:VP(act,indic,intens,pres)
MVB:LV(indic,intes,pres){is}
is
CS:AJP(prd)
AJHD:ADJ(prd){wront}
wrong
PUNC:PM(qm){?}
?
Figure 4: Sentence from the TOSCA Treebank (Leech and Eyes, 1997:44)
The SUSANNE Corpus
Geoffrey Sampson’s SUSANNE Corpus is a Treebank which provides a lot of parsing
infor mation for each sentence. It is a result of manual analysis and “contains much detail within a
small compass.” An example from the SUSANNE Corpus is given in Figure 5 below. Moreover,
it is available freely to any research community. The only downside is that the texts are old
(1961) compared to what people would usually analyse. (Sampson, 1995)
21
Koo Tee Fong, Dolly
N03:0460f
N03:0460g
N03:0460h
N03:0460i
N03:0460j
N03:0460k
N03:0460m
N03:0460n
N03:0460p
N03:0460q
N03:0460a
N03:0460b
-
YB
PPHS1m
WDt
AT
NN1c
II
NP1m
CC
WDv
AT
NN1c
YF
<minbrk>
He
handed
the
bayonet
to
Dean
and
kept
the
pistol
+.
he
hand
the
bayonet
to
Dean
and
keep
the
pistol
-
[Oh.Oh]
[O[S[Nas:s.Nas:s]
[Vd.Vd]
[Ns:o.
.Ns:o]
[P:u.
[Nns.Nns]p:u]
[S+
[Vd:Vd]
[Ns:o.
[.Ns:o]S+]S]
.
Figure 5: A sample from the SUSANNE Corpus (Sampson, 1995:32)
The Helsinki Constraint Grammar
The Helsinki Constraint Grammar parser adopts a dependency grammar model instead of the PS
grammar model as in the other parsers mentioned above. The parser also provides a breakdown
of the attributes of individual words such as sub-categorisation information for verbs and in
addition, functional labels such as ‘subject’ or ‘object’ are added. A sample of Helsinki parser
output is shown in Figure 6 below. (Leech and Eyes, 1997)
(“<*royal>”
(“royal” A ABS (@AN>)))
(“<*dutch>”
(“dutch” <Nominal> A ABS (@AN> @<Nom)))
(“<*shell>”
(“shell” N NOM SG (@SUBJ)))
(“<$,>”)
(“<*worth>”
(“worth” PREP (@ADVL)))
(“<*just>”
(“just” ADV (@AD-A>)))
(“<*$500m>”
(“$500m” NUM CARD (@<P)))
(“<*less=than>”
(“less=than” <CompPP> PREP (@ADVL))
(“less=than” <**CLB> CS (@CS))
(“less=than” ADV (@ADVL)))
(“<*exxon>”
(“exxon” <Proper> N NOM SG (@<P)))
(“<$,>”)
(“<is>”
(“be” <SV><SVC/N><SVC/A> V PRES SG3 VFIN (@+FMAINV)))
(“<*third>”
(“third” NUM ORD (@PCOMPL-S)))
(“<$.>”)
Figure 6: Output from the Helsinki ENGCG parser (Leech and Eyes, 1997:48)
2.3.6
The ICE Syntactic Tree Annotator
Within the ICE community, two sets of programs are used for the annotation process. First, a
parser is applied to produce a partial as well as a complete analysis and then an editor is used to
22
Koo Tee Fong, Dolly
correct or complete the analyses.
The parse analysis is represented as the ‘tree’ form and
ICETREE is the editor that allows such ‘tree’ form analysis to be manipulated. The ICETREE
also allows parse trees to be built from scratch and it can be used as a viewer for complete
analyses. (Quinn and Porter, 1996)
3. Methodology
There are many approaches to software development and one of the main approaches is “The
Waterfall Model” . It defines a project as a set of stages: from problem definition to requirements
analysis, design, implementation, testing and finally maintenance. However, each individual
stage in the project must be completed before moving on to the next (Laudon and Laudon, 2002).
The “Feedback Model” uses the same development stages as the “Waterfall Model” but it allows
for re-evaluation of earlier stages if problems arise in the later stages. Therefore, the “Feedback
Model” is the methodology that has been adapted for this project.
The first stages, namely, problem definition and requirements analysis were already carried out
during the background research and were described in sections 1 and 2 above. This section will
therefore describe the design of the project.
3.1
Corpus Design
3.1.1
Methods to be used
Firstly, it was decided that the pilot project would be fully Internet-based, that is, all of the texts
would be taken from the Internet only.
The texts would be collected from the numerous
Mauritian websites already available.
On the basis of this approach, it was decided that the ICE-Mauritius would be composed of the
following genres, as adapted by the ICE standard text categories.
Table 3: Text Categories to be adapted for ICE-Mauritius
Spoken Dialogues
Public
Broadcast Discussions
Broadcast Interviews
Parliamentary
Monologues Unscripted Commentaries
Unscripted Speeches
23
Koo Tee Fong, Dolly
Scripted
Written
Nonprinted
Printed
Student
Writing
Legal Presentations
Broadcast News
Broadcast Talks
Student Essays
Exam Scripts
Letters
Social Letters
Business Letters
Academic
Humanities
Social Sciences
Natural Sciences
Technology
Humanities
Social Sciences
Natural Sciences
Technology
Popular
Reportage Press reports
Instructional Administrative Writing
Skills/hobbies
Persuasive Editorials
Creative Novels
Since the texts would only be collected from the Internet only, some categories were removed
because they would not be available on the Internet and also the number to collect for each text
were not stated since it was difficult to know how many of those texts would be available online
beforehand.
The main method would be to collect as many texts as possible for any category and even for
those categories not listed above and then classify the texts accordingly and creating or removing
categories where necessary. For the pilot project, one to two percent of the corpus would be
collected. However, the samples collected would have to follow the standard of 2,000 words per
text to total the one million words that the corpus was required to reach at the end.
While collecting the material, the 3 main problems associated with the use of the Internet and as
identified by Sharoff (2005) would have to be kept in mind:
1. It cannot be claimed that the material is representative and that there is a balance of text types
2. Search engines address the needs of information retrieval, rather than linguistic search
3. Search engines present search result in a way that also does not correspond to the needs of a
linguist
24
Koo Tee Fong, Dolly
3.1.2
Copyright Issues
One of the important issues to consider for the collection of texts was copyright issues. The
corpus could not be used publicly unless permission was granted from owners of sites to use the
material. “Experience with the creation of the other ICE corpus has shown that, in general, it is
quite difficult to obtain permission to use copyrighted material” (Meyer, 2002). This is because
some people will take months before replying while others will not even bother sending a reply.
Therefore, extra time would have to be allocated for the request of permission to use the
copyrighted materials and also, extra texts would have to be collected in case permission was not
granted for some of them.
The first stage in compiling the ICE-Mauritius would be to identify some suitable websites and to
obtain email addresses as well as postal addresses, telephone numbers and fax numbers. Two
letters would be prepared: one would explain the purpose of the corpus and for the owners and
authors to keep, and the other with a return slip for them to sign if they agreed for their websites
to be used.
3.1.3
Corpus Layout
“Organising corpus into a series of directories and subdirectories makes working with the corpus
much easier and allows the corpus compiler to keep track of the progress being made on corpus as
it is being created” (Meyer, 2002). Therefore, the corpus would be organised into directories and
subdirectories according to the different text categories. For the proposed diagrammatic layout of
the corpus, see Appendix C.
Each text would be assigned a number that designated a specific category in the corpus in which
the sample might be included. For instance, a text number LETT01 would be the first sample
collected for inclusion in the category “Letters” while B-N01 would be the first sample collected
for inclusion in the category “Broadcast News”. This numbering system would allow the corpus
compiler to keep easy records of where a text belonged in the corpus and how many samples had
been collected for that part.
3.2
Capturing Text in Electronic Format
3.2.1
Computerising Speech
25
Koo Tee Fong, Dolly
It was assumed that the spoken texts collected would be in digitised form since they would be
taken from the Internet. The software “Voice Walker 2.0” or “Sound-Scriber” mentioned above
would be downloaded freely from the Internet to run the samples of digitised speech. Since no
other alternatives were available, speech would be manually transcribed. This process would take
the longest time in the compilation of the corpus and therefore extra time should be allowed.
3.2.2
Computerising written texts
Texts downloaded from the Internet were expected to contain as much HTML coding as text.
Since to manually delete this coding would take a considerable amount of time and effort, the
software “HTMASC” mentioned above would be used to automatically strip the HTML coding
from text. An ASCII text file with no coding was expected to be produced.
3.3
Corpus Annotation
3.3.1
Structural mark-up
The mark-up of the texts would be carried out by writing minimal encoding and pasting a header
using a word processor.
The following components, adapted from TEI-Header from the
Humanities Text Initiative (HTI) website to the ICE standards, would be added to each text:
File Description <fileDesc>
<fileDesc>
<titleStmt>
<title> </title>
<author> </author>
<respStmt><resp>compiled by</resp>
<name>Dolly Koo</name></respStmt>
</titleStmt>
<publicationStmt>
<publisher> </publisher>
<pubPlace> </pubPlace>
<date></date>
</publicationStmt>
<sourceDesc>
created in machine-readable form in http://mauritiustimes.com/040205mr.htm
</sourceDesc>
</fileDesc>
Encoding Description <encodingDesc>
<encodingDesc>
<projectDesc>
Texts collected for use in the pilot project for ICE-Mauritius, February, 2005
</projectDesc>
<samplingDecl>
26
Koo Tee Fong, Dolly
Whole text of 862 words copied from the site
</samplingDecl>
</encodingDesc>
A Profile Description would also be added and it would be similar to the one described in section
2.2.3 above.
3.3.2
Procedure for annotating the corpus
As mentioned earlier, the encoding would be done manually since no program was developed to
encode the corpus automatically. For each text, the following steps would be performed:
1. Text would be copied from the Internet and paste d into Microsoft Word. It would be saved as
encoded text choosing Unicode UTF-8 as recommended by Al-Sulaiti (2004) since some of
the texts might contain some French quotations with some special characters.
2. The text would then be encoded with paragraph marker using the option FIND/REPLACE in
edit: Find ^p Replace ^p in the case of a normal.
3. After the paragraphing was marked, the adapted TEI-header would be added and the missing
information would be filled in.
4. When the text was complete, it would be saved with its ID number as its name. For instance,
the text with the ID number LETT01 from the “Letters” category would be saved as
LETT01.txt in the “Letters” directory.
5. The text would then be renamed by changing the file extension from .txt to .xml.
6. Finally, to verify the XML file, the text could be opened in Internet Explorer.
4. Corpus Encoding
With the design laid out in section 3 in place, implementation of the ICE-Mauritius was started.
This section covers the encoding of the pilot project.
4.1
Collection of Texts
4.1.1
Search methods
Keeping the 3 main problems mentioned above in section 3.1.1 when collecting texts from the
Internet in mind, the following search engines and key words were used:
27
Koo Tee Fong, Dolly
1
Search Engine
Google
2
3
Yahoo
MSN
Key words
Mauritius, Mauritius articles/ books/ novels/ business letters/ press/
websites/ educational/ reports/ newspapers/ schools/ stories/ texts,
Higher School Certificate/Mauritius exam papers
Mauritius, Mauritius articles/ news/ books/ novels
Mauritius, Mauritius articles/ books/ novels/ business letters/ press/
educational reports/ newspapers/ schools/ stories/ texts, Mauritius
Higher School Certificate /exam papers
Table 4: Search engines & key words used to collect texts
Some of these searches proved to be very useful, for instance when searching for “Mauritius” in
Google, some of the main Mauritian websites came up, such as the government pages and other
interesting websites containing the texts needed were found. However, with over 20 million
results of “Mauritius”, it was difficult to look through all of them. The search had to be refined
and new key words such as “Mauritius newspaper” or “Mauritius schools” were typed in. The
‘Advanced Search’ option and the ‘Preference’ option in Google were also used, but they did not
prove very useful. Key words like “Mauritius business letters” matched over 200,000 sites but
none were related to the corpus or were written by Mauritian people.
The same process was carried out with the search engines Yahoo and MSN. After a few searches
with Yahoo, it was found that the latter did not yield many results and all the sites it referred to
were already visited in Google. With MSN, more results were obtained when searching the
Internet and some new materials were collected, but as with Yahoo, many of the sites were
already displayed in Google.
Two of the most useful related Mauritian websites are mentioned in Table 5 below :
Websites
http://www.servihoo.com/
Description
Website owned by Telecom Plus, the only telephone
provider in Mauritius. It has links to other websites such
as local newspapers, radio, television and it contains
articles ranging from culture to business to sports.
http://www.mauritiustopsites.co Website owned by Internet Communication Services
m/topsiteshtml/index157.shtml
Mauritius. It has a list of the most 946 popular websites
from the country.
Table 5: Most popular Mauritian websites
28
Koo Tee Fong, Dolly
4.1.2
Text Collection
Numerous websites related to Mauritius ha d been searched but careful attention had to be paid to
the author and the publisher. Many of the articles found were not from Mauritian people. The
first few texts took a considerable amount of time to obtain but once the useful sites were known,
the texts were collected more quickly. Due to the lack of time, after only four days of thorough
search from the abovementioned engines, written texts from fifty websites were collected and the
amount of words was totalled to 51,960, comprising 5 percent of the actual size of the corpus
(exceeding the target of 1-2% for the pilot project). The author, publisher, publisher place, date
and contact details of the author where available were also noted for each text.
Some of the texts
such as press reports were easily obtained from the various newspaper websites. However, letters
and student writing prove d to be very difficult to find – none of student essays or exam scripts
were available online. It was also important to note that shorter texts were easier to find than
longer texts of 2,000 words each. Appendix D shows the full list of texts and the details that were
collected.
Not surprisingly, spoken text was impossible to obtain from the Internet. Only two websites had
spoken texts, namely the Mauritius Broadcasting Corporation (http://mbc.intnet.mu/) and TopFM
(http://www.topfmradio.com/index.php). The Mauritius Broadcasting Corporation provide d live
TV News transmission, but it ha d only the French version available online and both of the sites
provide d live radio transmission, but most of the talks were in French too and saving the spoken
texts proved difficult, infeasible given the short amount of time to compile the pilot pr oject.
Hence, no spoken texts were collected. The proposed solution by Sharoff (2005), that is, to
increase the amount of ephemera (leaflets, junk mail and typed material) and correspondence
could be attempted in the follow-up project to compensate for the lack of spoken texts and to
make the project more balanced.
Alongside collecting the texts, a database file was created in Microsoft Excel (Appendix D) which
stored the type, ID number, source, title, author, publisher, place and year of publication and the
number of words of each text. This database file was important to have for the organisation of the
texts in the corpus and for counting the words automatically.
29
Koo Tee Fong, Dolly
4.1.3
Written Text Classification
It prove d difficult to decide to which category the text belonged.
For example, there was
confusion on whether some of the popular printed texts came from press report or other
magazines and therefore, those texts were classified as popular printed texts based only on best
judgement. Sinclair (1996) had examined in detail the problems of text classification and had
reported that corpus design ma de use of some internal and external factors to decide on the text
category. He pointed out that many text classifications were based on topic as it was represented
in newspapers and magazines.
The classification of the written texts of the ICE-Mauritius was based on the ICE standard
classifications, but with some amendments due to the lack of texts to cover all the categories. The
texts collected from the 50 websites were grouped and classified differently. Those exceeding
1,500 words were considered as whole texts while those below 1,500 words were grouped
together as one text, up to the total of around 2,000 words. However, the texts that were grouped
together ha ve to be part of the same initial category. This resulted in 30 final texts ready to be
included in the corpus. The spoken text category was removed completely for the pilot project,
even though this resulted in an unbalanced corpus. Much more time to collect and encode the
spoken texts would have to be allocated for the actual ICE-Mauritius. Table 6 below shows the
text categories which were derived from the sources, the number of texts and the total number
of words in each category.
Table 6: Number of texts and number of words in each category
Text Categories
Written
Nonprinted
Printed
No. of Texts
No. of Words
Student
Writing
Letters
Summary of project
1
762
School/Business / Social
2
3352
Speeches
Academic
Formal
Various Topics
3
2
5421
3354
Popular
Reportage
Instructional
Various Topics
Press reports
Administrative/hobbies
3
9
5
5528
16885
7897
Persuasive
Creative
Editorials
Novels
2
3
3226
5735
30
Koo Tee Fong, Dolly
4.1.4
Permission Letters
As mentioned above , copyright issues were one of the most important aspects to consider for the
collection of texts. The authors’ details were recorded alongside the texts when the Internet was
searched. However, it was noticed that many texts did not contain any details of its owner. When
compiling the actual ICE-Mauritius, those texts should be rejected from the beginning since they
could not be used without permission, but for the pilot project all of the texts collected were used
even if no permission was obtained since they would be kept only temporarily and would not be
made available to the public.
Two letters prepared by Al-Sulaiti (2004) to request permission for the use of texts available
online were used and sent to the authors of the texts that had been collected. One explained the
purpose of the corpus and for the owners and authors to keep, and the other had a return slip for
them to sign if they agreed for their websites to be used. Samples of the two letters can be found
in Appendix E.
It took one full day to send twenty three of these letters out by emails. Due to the lack of time,
they were sent only by emails and replies were expected mostly by emails since it was estimated
to take two weeks for a letter to reach Mauritius and another two weeks to get a reply if the author
sent it back straight away by post. Out of the twenty three letters, four were not delivered due to
the wrong address available on the Internet. To the present date, three ha d given their permissions
and were happy to help and one of them even asked for comments on his novel. However, one
was not agreeable and had asked for a formal support from the University of Leeds and a
complete CV. The outcomes proved that much more time and effort would be needed to obtain
permissions for the follow -up project.
Table 7 shows the list of addresses of resources for which permission of copyright had been
received.
Source
http://mauritiustimes.com/040205mr.htm
Contacts
Madhukar Ramlallah
mtimes@intnet.mu
http://pages.intnet.mu/rajbalkeehomepage/hd- Raj Balkee
complete.htm
rajbalkee@intnet.mu
http://ilemaurice.tripod.com/rougpoisal.htm
Madeleine Philippe
madeleine@cjp.net
Table 7: Sources with copyright permission
31
Koo Tee Fong, Dolly
4.1.5
Layout of the Pilot Corpus
The design of the Corpus ha d evolved slightly from the original plan since it was found that it
would be useful to have a folder for the marked-up corpus and one for a raw corpus. The latter
contained word docume nts of the actual text together with a table which included details such as
title, author, publisher, publisher place, date, source, email of author and the amount of words
which had been collected and had been used in the header. The raw corpus folder also contained
the HTML files of the texts taken from the source but with the extra coding slightly stripped off
manually.
Even if building up this separate raw corpus had taken some time, it ma de the
annotation process much quicker and easier and hence would not affect the overall length of the
project. Each small text was marked-up individually before being grouped together and was
stored within a sub-folder in the corresponding category in the main marked-up corpus folder.
Both the raw corpus and the marked-up corpus folders were divided into the following sub-folders
for the different categories: Academic, Editorial, Instructional, Letter, Novel, Popular, Reportage,
Speech and Student Writing. The texts in the different folders were crossed reference by their
name.
4.2
4.2.1
Corpus Annotation
TEI-Header
Some amendments had to be made in the adapted TEI-header since the information required were
not available from the Internet and it was impossible to obtain the information in such short time.
The information in the Profile Description which was more concerned with the author’s
characteristics such as age, education, occupation and first language ha d to be removed. The new
Profile Description that was used for the pilot project is shown below and for a full template of the
header, please refer to Appendix F.
Profile Description <profileDesc>
<profileDesc>
<creation>
<date value="2005-02">Feb 2005 </date>
<rs type="city">Pointe Aux Sables, Mauritius </rs>
</creation>
<langUsage>English</langUsage>
<textClass>
<textDesc n=" ">
<channel mode="w">print; written</channel>
</textDesc>
32
Koo Tee Fong, Dolly
<particDesc>
<person id="P1" sex=" ">
</person>
</particDesc>
</textClass>
</profileDesc>
However, even after careful consideration on which fields to include in the header, some of the
information was still missing when the texts were encoded. For many texts, the author, the
publisher or the date published were not available online. Therefore, those fields were filled in
with “unknown”. In fact, among the twenty-nine texts collected, only six of them were complete.
Obtaining this information would be another task that would require extra effort and time in the
follow-up project.
4.2.2
Texts Encoding
During the text encoding stage the time taken for processing was calculated. Using the procedures
described previously, the time taken to go through the six steps was approximately 20 minutes,
depending on the information available to fill in the header but regardless of the length of the texts
since the paragraphing was done automatically. If further files from the same site were collected,
the header could be reused with some minor adjustments to fit the new text. Obviously this took
less time than the first file, ranging between 5 to 10 minutes. A sample of a raw text can be found
in Appendix G while a sample of the encoded text can be found in Appendix H.
When encoding the texts, some problems did surface with the viewing of the XML files.
Some of the common error messages that were displayed when the XML files were opened
with Internet Explorer are shown in Table 8.
XML Files
Table 8: Errors duri ng encoding of texts
Error Message
LETT02.xml - Figure 7
‘whitespace not allowed’
REP05.xml
NOV01.xml
‘A semi colon character was expected’
‘End tag “P” does not match the start tag “h”’
- Figure 8
- Figure 9
33
Koo Tee Fong, Dolly
Figure 7: Error as shown when “LETT02.xml” was opened in Internet Explorer
Figure 8: Error as shown when “REP05.xml” was opened in Internet Explorer
Figure 9: Error as shown when “NOV01.xml” was opened in Internet Explorer
34
Koo Tee Fong, Dolly
In the event of those problems, the file was opened in a program called UniRed, which is a
freeware available at http://sourceforge.net/projects/unired. UniRed is a Unic ode plain text editor
for windows and it supports many character sets including UTF-8 and mark-up languages such as
XML and HTML. If an error does exist in the XML file, the program identifie s the error by
highlighting it in red. If green highlight is shown, it means that the code is correct; it has an
opening and a corresponding closing tag.
Figure 10: Screenshot of “LETT02.xml” in UniRed editor
A screenshot of the error from “LETT02.xml” in the UniRed editor is shown in Figure 10 above.
The XML tag that appeared red in the middle (the shaded ‘&’ character) of the screenshot meant
that the code was invalid. The error was related to some unusual characters or signs which needed
to be modified to be accepted by XML. Here, in this example, the ‘&’ sign needed to be written
as ‘and’ or as ‘&’.
Figure 11: Screenshot of “REP05.xml” in UniRed editor
35
Koo Tee Fong, Dolly
A different error message is displayed for “REP05.xml”. However, after the file was opened
in UniRed (Figure 11), it was noticed that the error was related to the same unusual sign, the
‘&’ sign. The only difference was that the error occurred in the URL address of the source
(http://www.businessmag.mu/displayNewsContent.asp?NID=5747&CID=30) and this could only
be changed to ‘&’ for obvious reasons.
However, in “NOV01.xml”, a completely different error was spotted. The file could not compile
due to a missing closed tag. In UniRed (Figure 12), it was found that the opening tag <h> in line
5 did not have a matching closing tag </h>. This error was shown by highlighting in red the next
opening tag (the shaded “<” character).
Figure 12: Screenshot of “NOV01.xml” in UniRed editor
Since it was not only faster to use UniRed but also it was guaranteed that the files were correctly
saved and could be viewed in the browser with no problems, this method had been tested and
compared with the former method, namely, creating the text in Microsoft Word and then
converting it to XML. Processing time with the UniRed method prove d to be more efficient,
taking only around 5 minutes.
After the errors ha d been corrected, the three files mentioned above “LETT02.xml”, “REP05.xml”
and “NOV01.xml” should look as shown in Figures 13, 14 and 15 respectively when they were
opened again in Internet Explorer. For a full display of how a file should look, refer to Appendix
H for another example .
36
Koo Tee Fong, Dolly
Figure 13: Expected output for “LETT02.xml”
Figure 14: Expected output for “REP05.xml”
Figure 15: Expected output for “NOV01.xml”
37
Koo Tee Fong, Dolly
5. The Proposal
This section is also related to the implementation stage but instead of a software implementation,
it describes how the possible extension laid out in section 1.3 was achieved. After the compilation
of the pilot ICE-Mauritius project, it was possible to write a work-plan and a proposal for a
follow-up project to develop a full-scale ICE-Mauritius corpus and extend the methodology to a
much more ambitious multinational ICE Corpus.
5.1
Funding Opportunities
5.1.1
Research at University of Leeds, School of Computing
The School of Computing web site (University of Leeds, 2004) states that the School has been
‘awarded a Grade 5 in the 2001 Research Assessment Exercise (RAE), confirming the School's
status as a leading research institute for computing’. The research activity within the School is
grouped into five categories, namely, Computer Vision and Language, Knowledge Representation
and Reasoning, Scientific Computing and Visualization, Theoretical Computer Science and
Informatics.
The School may offer scholarships but most research staff and students who need grants for their
research will have to apply to Research Councils, namely, to the Engineering and Physical
Sciences Research Council (EPSRC). Therefore, to develop the full-scale ICE-Mauritius , an
application to the EPSRC will be made. In order to fill in the application form, further research
on the requirements of EPSRC has been made and is briefly described below.
5.1.2
Introduction to EPSRC - The Engineering and Physical Sciences Research Council
The Engineering and Physical Sciences Research Council (EPSRC, 2004) is ‘the UK
Government's leading funding agency for research and training in engineering and the physical
sciences’. The EPSRC operates, mostly, by funding research projects in universities and other
research organisations. The funds are intended to meet the direct costs of the research project,
together with a contribution towards the indirect costs involved (EPSRC Funding Guide, 2004).
The majority of funding from the EPSRC is supported through the Responsive Mode, but other
funding routes are available, for example Fellowship and others. ‘Calls for Proposal’ are also
available, where strategic opportunities are announced and researchers can choose from the list
38
Koo Tee Fong, Dolly
provided. There is no minimum or maximum funding, and no minimum or maximum period
(EPSRC Funding Guide, 2004).
The EPSRC fund ‘a dynamic and evolving research portfolio, extending from fundamental
research in mathematics, chemistry, computer science and physics to more applied topics in
engineering and technology’ (EPSRC, 2004). Many of the EPSRC research activities are cofunded between programmes to encourage multidisciplinary collaborations since major
breakthroughs of ten arise when researcher from other related disciplines work together.
5.1.3
Eligibility of Investigators
Principal investigators should be permanent employee of an eligible research organisation (all UK
universities and similar research organisations are eligible organisations). Fixed term employees
may be eligible provided that the organisation will give all the support normal for a permanent
staff and that there is no conflict of interest between the investigator’s obligations to the EPSRC
and the other organisation (EPSRC Funding Guide, 2004).
‘Research Assistant can be identified as Co-Investigators if they have made a substantial
contribution to the development of the application and will be closely involved with the project, if
funded. Then the application can seek funds for the assistant’s salary for the duration of the
project’ (EPSRC Funding Guide, 2004). Research assistant cannot be the principal investigator.
Moreover, research proposals will not be considered from an applicant who was the principal
investigator of another grant and who has not yet finished producing the Final Report.
5.1.4
Research Opportunities
The majority of funding from the EPSRC is supported through the Responsive Mode, where the
research idea is determined by the applicant and where the proposals can be submitted at any
time. The main criteria against which the proposal is assessed is the ‘intrinsic engineering or
scientific excellence’ (EPSRC Funding Guide, 2004) as determined by peer review. EPSRC
especially encourage research proposals that are adventurous with new concepts and techniques.
First Grant Scheme
First Grant Scheme is used to assist individuals at the beginning of their academic careers by
offering them a research grant. To be eligible for the First Grant Scheme, candidates must
39
Koo Tee Fong, Dolly
have been appointed to their first academic lecturing appointment in a UK university within
the previous 24 months and should be within ten years of completing their PhD. Candidates
wholly employed as research fellows are not eligible to apply (EPSRC Funding Guide, 2004).
The scheme provides up to £120,000 for support. Proposal, which has received two or more
strong references, will be considered by a peer review panel along with the other First Grant
applications. First Grant proposals will not be considered against other types of proposals at the
same time.
5.1.5
How to Apply
Since 31 March 2005, applications for research grants can only be made via an electronic form
through the Je-S (Joint Electronic Submission) system and each application should be
accompanied with a self-contained ‘case for support’.
The ‘Case for Support’ comprises of the following (EPSRC Beginners’ Guide, 2004) :
•
Previous track records (2 sides A4)
•
Description of the proposed research and context (purpose, background, project,
resources, applications, collaboration) (6 sides A4)
•
Diagrammatic work plan (1 side A4)
•
Annexes (CVs, references, letters of support, equipment quotes, illustrations and named
research assistants)
Good applications contain ‘Case for Support’ which are clear, concise and uncluttered with
technical jargon. The main criterion to determine the grade assigned to any grant proposal will be
its scientific quality, but ‘viability and planning, cost-effectiveness and dissemination plans can be
taken into account’ (EPSRC Mock Panel Guidance Notes, 2004). In addition, for First Grant
proposal the applicant’s own plans for developing their research career and the commitment of the
university to career development may be considered.
5.2
Writing up the Proposal
5.2.1
Original Idea
From the development of the pilot project up to 5 percent of the actual corpus , it had been proven
that a full-scale ICE-Mauritius was feasible just by using the Internet to collect the texts.
40
Koo Tee Fong, Dolly
Therefore, a proposal was drafted in accor dance to the EPSRC requirements in view of compiling
a high standard application which could actually be sent for funding.
First, the Je-SRP1 (EPSRC) application form was downloaded from the EPSRC website and filled
in. However, many sections on the form could not be filled until a full detailed plan of how the
pilot project ha d been developed was written. For instance, sections N (Travel and Subsistence)
and O (Consumables) were difficult to fill in without knowing the actual tools and stages needed
for the project. Moreover, sections such as J (Objectives) and K (Summary) were not of the best
quality when written before more considerations were given to the outline and plan of the project.
Therefore, it was decided to begin with writing the Case for Support first. Writing the Case for
Support was not an easy task since it ha d to be clear, concise and attractive. Many details about
how the corpus would be collected and annotated and its standards and the tools and staff needed,
and the length of the project ha d to be stated in the Case for Support. To be able to provide these
details and in order to extrapolate how much time and effort would be needed to collect the full
corpus, further research and calculations were made on the process and development of the pilot
project. The number-of-word and time-taken estimates for the collection of text and the text that
had been edited and marked up were calculated to come up with estimate of lower and upper
bounds of time and person-months needed for the full corpus. From these estimations, the initial
research work plan to collect a one-million word corpus for Mauritian English was then organised
into seven activity streams which would take up to three years to be completed by one postdoctoral research fellow.
As mentioned previously, evidence from the pilot project showed that with this internet collection
technique, the corpus would contain less than one million words due to the limited set of text
categories available on the World Wide Web and this would also result in an unbalanced corpus.
One way around this problem was to collect more texts that were available to compensate for the
missing ones. Another solution was to expand the corpus to a different dimension and this is
explained in the next section.
5.2.2
Expansion of Corpus Design
To compensate for the small amount of texts and for the unbalanced texts categories, it was
decided instead to expand the corpus to include other types of English from other English
speaking countries. This would also result in a more ambitious and adventurous project which are
the characteristics that the EPSRC are looking for. With this new objective for the proposal, more
41
Koo Tee Fong, Dolly
research was needed to find other countries where either English is the official language or where
English is one of the main spoken languages. Twenty countries were chosen to form part of the
corpus and they are as follows: Bahamas, Bangladesh, Barbados, Bermuda, Botswana, Cayman
Islands, Cyprus, Dominica, Gambia, Gibraltar, Grenada, Liberia, Malta, Mauritius, Namibia,
Pakistan, Seychelles, Uganda, Zambia and Zimbabwe. As in Mauritius, English in most of these
countries ha d in one way or another been highly influenced by other languages, either brought by
ancestors or derived from their culture. The corpus would hence allow an interesting and useful
analysis of the variation in English across the nations.
Since collecting the fully-balanced one-million words for each country would be impossible, it
was decided that the proposal would only target half a million words for each country, resulting in
an “ICE-lite”. The term “lite” was borrowed from other simplified projects such as “TEI-lite”
which meant a simpler version of TEI, a standard XML-markup convention for text corpora (TEI,
2005). To provide compatibility and an enhanced comparison with the other existing projects, the
“lite” version of the 20 teams already in ICE (mentioned in section 2.1.3) would also be included
in the corpus. For each country, numerous websites were easily accessible via the World Wide
Web, and different texts categories were available. Therefore, the corpus would aim to contain
approximately 20 million words taken only from the Internet. This meant that more staff would
be required and a new work plan was needed. New ambitious estimates were then calculated.
This was done by taking the amount of time taken to collect (20 minutes average for 1 text) and
annotate (20 to 30 minutes per text) the thirty texts obtained (figure given in section 4.1.3 above)
and multiplying them accordingly by 250 texts to obtain the estimates for one country and then
multiply the result by 40 for the whole ICE-lite corpus. The overall expected completion time of
the project was kept to three years but instead of only one research fellow, two more would be
needed. The new research work plan for the ICE-lite was then organised into eight activities as
listed below:
WP1: Collection of Spoken and Written Text of English
WP2: Transcription
WP3: Textual Mark-up
WP4: Word-class tagging
WP5: Syntactic parsing
WP6: Evaluation
WP7: Comparison across dialects
WP8: Dissemination for Exploitation
42
Koo Tee Fong, Dolly
The expanded project would thus be beneficial to the governement and the educational system
in each of the twenty countries mentioned above and the existing IC E teams.
A
comprehensive description of the different types of English could be obtained from the corpus
and therefore each country would be able to develop its own reference guides to usage,
dictionaries and other teaching materials. This could help both schools and universities to
adapt their methods of teaching, and especially the structure in which English was taught and
spoken to a better standard. The comparison across the dialects of English to find any striking
similarities or differences would be useful for further research and teaching methods in each
country and would also benefit those people who wanted to travel to or trade with other
English-speaking countries since the comparison would provide a useful insight in how they
would have to adapt their language. When the corpus would be released, it would also be
beneficial to other research or academic institutions across the world. It could be used as a
comparison or for further research by the existing corpuses or other potential corpuses.
Longer-term impacts of the work to be done included:
•
Promoting cooperation between other English speaking countries and for the purpose of
developing basic components for the linguistic society.
•
Easing the entrance requirements of English speaking countries into the different markets.
•
Promoting the different culture of the 40 countries across the world.
5.2.3
Writing Up Proposal
Once the estimates were calculated and the work plan designed, the Case for Support was written
more easily and it also became much easier to fill in the application form since the figures were
readily available. The only difficulty was to divide the work among the three research fellows to
make the completion of the work possible within three years. This was done by using only the
lower limits of the estimates and therefore resulted in quite a tight schedule. Other estimates were
calculated concerning costs of travelling, consumables, etc. Details about the cost of staff should
be
calculated
through
the
COSTA
system
of
the
Universit y
at
http://www.leeds.ac.uk/rsu/COSTA.htm , but due to restricted access to students, the estimated
costs were taken from another proposal by Atwell and Al-Sulaiti (2005). It is important to note
that one paper application (in Word format) allows details of only two researches to be filled in.
Therefore, to make the proposal complete, a second application was needed to add the details of
the third researcher. However, due to the space limit of this report, the second application form
43
Koo Tee Fong, Dolly
could not be added in the appendices and sections 2 and 3, which requested personal information
about the referees and the investigators, were also omitted. Copies of the first draft of the Case
for Support (which was sent for evaluation) and the first application form are shown in
Appendices I and J respectively.
6. Evaluation
To measure the success of a project, the latter needs to be evaluated against a number of relevant
criteria. For this particular project, the criteria that were set up are:
6.1
•
Product: Evaluates the design and compilation of the final product.
•
Minimum Requirements: Evaluates what minimum requirements are met.
•
Project Stages: Evaluates the methodology used to produce the final product.
•
Planning and Schedule: Evaluates the planning of the project from start to finish.
Product
The product was evaluated by three subject-experts, namely, Eric Atwell, Gerald Nelson and
Serge Sharoff. Eric Atwell was the supervisor of the project. His evaluation would not be
discussed in this report since he provided feedback throughout the course of the whole project.
Gerald Nelson, from UCL, is the coordinator of the International Corpus of English and has been
directly involved in the development of ICE-GB, the British component of ICE. Serge Sharoff,
from the Centre for Translation Studies of Leeds University, has been involved in several corpus
developments, such as a Russian corpus and a Chinese corpus, which he has collected only
through the Internet.
Evaluation from Gerald Nelson:
Both the proposal and part of the pilot project were sent to Gerald Nelson and his first explicit
comment was “May I say, first of all, that I am very impressed by this proposal. It shows an
amazing knowledge of corpus linguistics, and of issues in world Englishes.”
Therefore it can be
said that both the proposal and the pilot project met the requirements needed and were of good
standards. In his feedback, Gerald Nelson also implicitly suggested some improvements that
could be done before the proposal is sent to the EPSRC and some issues that should be addressed
concerning the ICE-lite if the funding is obtained.
44
Koo Tee Fong, Dolly
Issues about the ICE-lite are:
•
The text files should be named according to the ICE coding scheme, not as LETT01, etc. as
described above.
•
The TEI headers should be stored externally as separate files.
•
The details in the headers should follow the ICE scheme.
•
Permission letters could cause problems to other ICE teams since they are strictly noncommercial whereas the permission letters sent stated “We may also want to use the text(s)
for developing electronic products such as translators and dictionaries.
•
The distribution method of the ICE-lite
Gerald Nelson agreed to send full details of the ICE filename and header conventions in his
emails but respecting his busy schedule, he was not able to do so before the report was due. So,
no improvement was able to be made to the pilot project. Also, for the purpose of this project, the
issues of non-commercial corpus and distribution were decided to be ignored until the funding
was obtained.
Improvements to the proposal include:
•
Gerald Nelson suggested that the parsing should be dropped altogether since the syntactic
parsing of the whole corpus is quite unrealistic, given the timescale involved. For ICE-GB, it
took about 3 years to parse one-million words, and there were six or seven part-timers
working on it. He also suggested that the aim should be to produce a fully-checked POStagged corpus and to consider the parsing as another follow-up project.
•
Changes to the wordings in the proposal such as:
o
Page 1, paragraph 1: "where English is the main language" to be changed to "where
English is the first language or second official language".
o
Page 2, line 1: Delete "Australia" as it is not yet available.
o
Page 2, line 5: "and other freely available sources": more details should be given.
o
Page 4, line 3: "a software" should be changed to "a program".
o
Page 5, Staff: It is unlikely to get post-doctoral researchers working on this project.
Therefore “post-doctoral” should be changed to "post-graduate".
Despite the small changes needed and based on Gerald Nelson’s comment which he added at the
end of the feedback: “As I said, this is a very impressive proposal, and you can count on my full
45
Koo Tee Fong, Dolly
support (and the Survey's) if it gets funded”, the pilot project and the proposal proved to be
successful and of a great potential for a follow-up project.
Evaluation from Serge Sharoff:
After reading the proposal, Serge Sharoff sent his approval implicitly by saying “I read the
proposal with interest”. He had also shown that he wanted to participate and that he thought the
proposal as being feasible and worth following up by giving comments on possible extensions and
how he could contribute to the project.
He proposed to contribute in two aspects as described below:
•
In WP6 (Evaluation), he proposed to add a lexical comparison of the new ICE-lite against the
British National Corpus (similar to what he had done in one of his Internet corpora paper).
•
In WP8 (Dissemination), he proposed to disseminate data through his web interface, which he
referred to as the Leeds CQP interface. There’s no publication on it yet but he is more than
willing to write a paper on it if the project goes ahead.
Another suggestion from Serge Sharoff that could be useful was the use of Google to estimate the
size of source texts available for each country. He had tried finding English texts from Mauritius
by typing “allintext: that OR in OR for site:.mu” in the Google query and this came up with
125,000 English pages, corresponding to more than 250 million words (if an average Internet page
is about 2000 words). Therefore, this method could be used to find the size of texts available
online for each country in the ICE-lite project.
He also raised an important issue concerning the collection of the texts. According to him, it
would be difficult to know whether a text was written by someone from a specific country. That
is, you could not be sure that a text obtained from a Gambian website, for instance, was actually
written by someone born in Gambia. For the pilot project, this problem was not encountered since
coming from Mauritius, I could easily tell the difference from a text written by a Mauritian citizen
and one which was not by either looking at the name of the author or by just looking at the
structure and the words used since Mauritian English has a particularity to it, often including other
dialects words.
However, this could be a potential problem for a full-scale project and this issue would need
further investigation if the proposal was to be funded. Due to the lack of time, this issue could not
be resolved in this project.
46
Koo Tee Fong, Dolly
6.2
Minimum Requirements
The minimum requirements were:
•
Develop a small-scale prototype of the Mauritian Corpus of English.
Section 3 described how the prototype ha d been designed and planned, with details of the
different tools and techniques that are available for use. The development of the prototype itself
was detailed in sections 4.1 and 4.2. As the prototype was being developed, some amendments to
the original plan were needed. Overall the prototype can make up 5 percent of the full-scale ICEMauritius Corpus.
•
Survey of computer technologies for corpus development and processing.
The different technologies available for corpus development and processing ha d been mentioned
throughout the whole of the report, but more particularly, the different taggers and parsing
systems available worldwide were outlined in section 2.3 while the techniques used specifically
for ICE were described in section 2.2.
The possible extension was:
•
Work plan for a follow-up project to develop a full-scale ICE-Mauritius corpus.
To be able to build a work plan for a follow-up project, the pilot project had to be well understood
and documented (which formed part of sections 4.1 and 4.2 above). Also research into the
Research Council, namely, the Engineering and Physical Sciences Research Council (EPSRC)
had to be carried out in order to know the requirements and to apply for grants.
These
requirements were described in section 5.1 while the steps taken in writing the application form
and the proposal were described in section 5.2.
6.3
Project Stages
The overall quality of the project was also assessed by applying the following criteria to each of
the different stages of the project to see if they were appropriate to solve the initial problem, and
their relevance in the development of the solution. The criteria were:
•
Was the background research of a suitable standard, did it help to understand the problem and
did it help to gather the learning requirements.
•
Was the chosen methodology suitable for the project and was it adhered to.
•
Were the requirements gathered effectively and did final product successfully meet these
requirements.
47
Koo Tee Fong, Dolly
•
Were the appropriate technologies used in creation of the pilot corpus.
•
Did the project solve the initial problem and was the final prototype of sufficient standard to
prove the feasibility of a full-scale project and was the proposal of sufficient standard to send
to the EPSRC for funding.
Background research
The background research helped to fully understand the problem and therefore what the project
should actually achieve. It gave an insight into the emergence of corpora and their increasing uses
in teaching and research. Research on ICE showed that there are only a few number of existing
corpora and that many English-speaking countries can benefit from the compilation of their
English language. Findings from the ICE website and other books on corpora were then used to
design and set the standards for ICE-Mauritius. In addition, the different techniques available
were researched to allow and facilitate the collection and annotation of the pilot corpus.
Methodology
The most signific ant problem that was encountered in the course of this project was the need to
modify the aims and requirements of the project at the beginning of the second semester. This
also meant changing the work plan and methodology.
The “Feedback Model” used for this project as described in section 3, proved to be a good choice
throughout the project. Many changes were made to the initial design after flaws became apparent
in the encoding phase of the project. The following steps were taken during the development:
•
First the problem was analysed, that is, the need for a Mauritian Corpus was identified
(section 2).
•
Then a system study was carried out and the findings showed that collecting a corpus is costly
and timely and that using the Internet would be a solution to the problem (section 1 and 2).
•
The pilot project was designed next and this was explained in section 3.
•
The corpus was collected and annotated in the following stage , section 4 and the proposal
written in section 5.
•
As the collection and annotation was carried out, it was found that many changes in the design
were needed (section 4 and 5).
•
Finally, the pilot project was evaluated as described in section 6.1.
Corpus requirements
48
Koo Tee Fong, Dolly
The initial requirements for the ICE-Mauritius were gathered through the ICE website and other
ICE-related books , which contains almost all that is needed to be included in an ICE sub-corpus.
However, some more detailed requirements such as naming scheme for each text or the minimum
amount of information to be included in the header were not specified. As mentioned above,
Gerald Nelson from UCL agreed to send those details during the Easter break but he never got
around doing so. Therefore, the only basic requirements from the official ICE website were
applied to the pilot ICE-Mauritius.
In addition to this, annotation requirements were also gained through the background research
into the different technologies available. These were general requirements that any corpus
should have and were not related to ICE.
The prototype was evaluated by the people mentioned in the section above to see if the initial
design was adequate, as well as to provide additional feedback. And as a result, they agreed that
the pilot project did meet the basic requirements of ICE.
In relation to the proposal, the requirements were taken directly from the EPSRC application
guide. According to the feedback obtained, the proposal did meet the requirements of the EPSRC
and hence consisted of a potential application for a follow-up project.
Technologies
Other than using Microsoft Word to collect and annotate the corpus manually, other technologies
and tools were discussed throughout the report. It was seen that programs such as HTMASC
could facilitate the stripping of HTML coding from the texts to produce ASCII text file while
UniRed was used to provide a faster and more error -free compilation and saving of the mark-up
texts. However, other specific corpus tools such as ICECUP or ICETREE could not be used and
tested since they are not freely and easily available to anyone.
Initial problem
The initial problem identified the need for an ICE-Mauritius. To develop a full-scale ICE project
would be impossible within this project. Therefore this project concentrated on developing a
prototype of the ICE-Mauritius, investigating data-sources and instigating data-collection and
looking at the different technologies available to investigate the requirements and feasibility of a
larger-scale follow-on project. The pilot project, together with the feedback obtained proved that
a full-scale ICE-Mauritius was feasible.
However, a more ambitious follow-on project was
49
Koo Tee Fong, Dolly
described in the proposal, which according to the feedback obtained, was a plausible application
to the EPSRC for funding.
6.4
Planning and Schedule
Up until the mid-project report, the schedule was followed very closely and everything was going
according to plan. However, after feedback was obtained from the assessor in January, it became
clear that a new direction for the project had to be devised with some changes to the aims and
requirements. This resulted in a new schedule for the second semester. Both the old and new
schedules were shown in section 1.5.
To meet the new aims and requirements, more work was required in a much restricted amount of
time. More research on the background and to understand the problem was needed and the one
week allocated was not enough. Moreover, as the corpus was being developed, it was found that
it was difficult to design the corpus since the categories to be inc luded would vary depending on
the texts collected. Therefore, the texts had to be collected first and then classified accordingly.
Also, while the schedule stated that the feasibility investigation of ICE-Mauritius and the writing
up of the proposal would be done after the pilot project was compiled, drafting the proposal
alongside compiling the corpus was easier since the different steps taken were noted as they were
carried out and new ideas kept surfacing for the final proposal. And while drafting the proposal,
the feasibility of ICE-Mauritius was being self -addressed.
The initial schedule had been created failing to take into account that just before the end of the
second term, other projects and essays would have to be submitted, and therefore not much time
would be available to work on the project.
The schedule hence had to be revised again,
accounting for this flaw. With a clearer view of the amount of work the project would entail, the
development of the corpus and writing up the proposal were both scheduled to be completed
before the Easter break, to leave enough time to evaluate and write up the rest of the project
during the holidays. This goal was achieved and with only slight revisions of the corpus and of
the proposal needing to be done during the Easter break, there was enough time to evaluate the
project. However, the time to get feedback from the different people to whom the corpus and the
proposal were sent to was underestimated. Feedback was obtained only in the last week of the
Easter break, leaving not much time to write the evaluation. Nevertheless, the write-up was
completed with a week to spare before submission and the time was used to revise the final report.
50
Koo Tee Fong, Dolly
Schedule 3 below shows the revised project schedule.
Schedule 3: Revised Project Schedule for second semester
Dates
Milestones
Tasks
24/01/05 - 31/01/05 Section 1
Decide on new aims & objectives and
design new plan
01/02/05 - 12/02/05 Section on Background
Research on methods available to
Research
extend the ICE corpus to Mauritius
13/02/05 - 20/02/05 Appendix C,D
Collect sample texts from the Internet &
send request for copyright permission
20/02/05 - 22/02/05 Appendix B and Section 4 Design layout and text categories of ICE
Mauritius
23/02/04 - 18/03/05 Section 4 and Appendix E Annotate corpus
23/02/05 - 18/03/05 Appendix F
Draft a proposal for ICE-Mauritius
01/03/05 - 18/03/05 Section 4 and proposal
Investigate feasibility of ICE-Mauritius
18/03/05 - 18/04/05 Evaluation
Evaluate corpus & proposal
01/04/05 - 26/04/05 Final Report
7. Conclusion
As stated in the first section, the aim of this project was “to develop a prototype of the Mauritius
component of the International Corpus of English, to demonstrate feasibility and potential
problems for a larger-scale follow-up project”.
Throughout this project, both benefits and
difficulties of developing a corpus, together with the techniques and tools availa ble for the
development were discovered. The outcome was a prototype of the ICE-Mauritius up to 5 percent
of its original size and in addition a work-plan for the follow-up project was set up, whereby
showing the feasibility of an ICE-Mauritius collected only through the Internet. To summarise
therefore, the project fulfilled its minimum requirements, as well as its suggested extended
requirements and it went even further by providing a full proposal for the application of a much
wider and more ambitious ICE-lite project to the EPSRC for funding.
Despite some issues which would need further consideration for the ICE-lite, much interest and
approvals were obtained from the two evaluators and field-experts mentioned above , proving its
success. Therefore, as future work and improvements, it is hoped that the proposal will be sent to
the EPSRC and that the prototype will be developed into a larger-scale project.
51
Koo Tee Fong, Dolly
References:
Al-Sulaiti, L. (2004) Designing and Developing a Corpus of Contemporary Arabic.
Unpublished MSc thesis. University of Leeds.
Atwell, E. and Al-Sulaiti, L. (2005) Development of the International Corpus of Arabic.
EPSRC Application Form (not yet submitted). University of Leeds.
Atwell, E. (1983) Constituent Likelihood Grammar. ICAME Journal (7) pp34-67.
Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., and Wilcock ,S. (2000)
A comparative evaluation of modern English corpus grammatical annotation
schemes. ICAME Journal (24) pp 7-23
Atwell, E. (2004) Gambian English ICE Corpus. University of Leeds, School of Computing.
[News Group].
Baker, P. et al. (2003) Constructing corpora of South Asian languages. In Proceedings of the
Corpus Linguistics 2003 conference, 16(1), 71-80.
BPML (2004) Cybercity Mauritius - The Ebène CyberCity web site [online]. [Accessed 20th
November 2004]. Available from World Wide Web: http://e-cybercity.mu/cybercity
Breyer, Y. (2005) Gateway to Corpus Linguistics on the Internet [online]. [Accessed 15th
February 2005]. Available from World Wide Web:
http://www.corpus-linguistics.de/corpora/corp_engl_a_e.html
Cutting, D., Kupiec, J., Pedersen, J., and Sibun, P. (2005) A Pratical Part-of-Speech Tagger.
Palo Alto: Xerox Palo Alto Research Centre.
Department of English Language & Literature, University College London (2002) The
International Corpus of English (ICE) web site [online]. [Accessed 9th November 2004].
Available from World Wide Web: http://www.ucl.ac.uk/english-usage/ice/#
Edwards, J. (1995) Principles and alternative systems in the transcription, coding and mark-up
of spoken discourse. In Leech, G., Myers, G. and Thomas, J. (ed.) (1995) Spoken English on
Computer: Transcription, mark-up and application. Harlow: Longman.
EPSRC, The Engineering and Physical Sciences Research Council (2004) The EPSRC web
site [online]. [Accessed 23rd October 2004]. Available from World Wide Web:
http://www.epsrc.co.uk/
EPSRC (2004) EPSRC Funding Guide web site [online]. [Accessed 7th November 2004].
Available from World Wide Web: http://www.epsrc.co.uk/
EPSRC (2004) EPSRC Research Grants Beginners’ Guide [online]. [Accessed 26th October
2004]. Available from World Wide Web: http://www.epsrc.co.uk/
EPSRC (2004) EPSRC Mock Panel Guidance Notes [online]. [Accessed 6th November 2004].
Available from World Wide Web: http://www.epsrc.co.uk/
52
Koo Tee Fong, Dolly
EPSRC (2004) Guidance Notes for completing the Je-SRP1 (EPSRC) form [online].
[Accessed 9th November 2004]. Available from World Wide Web: http://www.epsrc.co.uk/
Fang, A. (1996) AUTASYS: Grammatical Tagging and Cross-Tagset Mapping. In
Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of
English. Oxford: Clarendon Press.
Garside, R. and Smith, N. (1997) A Hybrid Grammatical Tagger: CLAWS4. In Garside, R.,
Leech, G and McEnery, T. (ed.) (1997) Corpus Annotation: Linguistic Information from
Computer Text Corpora. London; New York: Longman.
Greenbaum, S. (1991b) The development of the International Corpus of English. In Aijmer,
K. and Altenberg, B. (eds.) English Corpus Linguistics. Studies in Honour of Jan Svartvik.
London: Longman. Pp. 83-91.
Greenbaum, S. (1996) Introducing ICE. In Greenbaum, S. (ed.) (1996) Comparing English
Worldwide: The International Corpus of English. Oxford: Clarendon Press.
Humanities Text Initiative (2005) The TEI Header [online]. [Accessed 16th February 2005].
Available from World Wide Web:
http://www.hti.umich.edu/cgi/t/tei/tei- idx?type=pointer&value=HD
Ku.era, H. and Francis, W.H. (1967) Computational analysis of present-day American
English. Brown University Press, Providence, Rhode Island.
Laudon, K. and Laudon, J. (2002) Management Information Systems – Managing the Digital
Firm. 7th edition. New Jersey: Prentice Hall.
Leech, G. (1997a) Introducing corpus annotatio n. In Garside, R., Leech, G and McEnery, T.
(ed.) (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora.
London; New York: Longman.
Leech, G. (1997b) Grammatical Tagging. In Garside, R., Leech, G and McEnery, T. (ed.)
(1997) Corpus Annotation: Linguistic Information from Computer Text Corpora. London;
New York: Longman.
Leech, G. and Eyes, E. (1997) Syntactic Annotation: Treebanks. In Garside, R., Leech, G and
McEnery, T. (ed.) (1997) Corpus Annotation: Linguistic Information from Computer Text
Corpora. London; New York: Longman.
Marcus, M., Santorini, B., and Marcinkiewicz, M. (1993) Building a large annotated corpus
of English: the Penn Treebank, Computational Linguistics, 19(2), 313-30.
Meyer, C. (2002) English Corpus Linguistics, an Introduction. Cambridge: Cambridge
University Press.
Nelson, G. (1991a) Manual for Spoken Texts. London: Survey of English Usage, University
College London.
53
Koo Tee Fong, Dolly
Nelson, G. (1991b) Manual for Written Texts. London: Survey of English Usage, University
College London.
Nelson, G. (1996a) The Design of the Corpus. In Greenbaum, S. (ed.) (1996) Comparing
English Worldwide: The International Corpus of English. Oxford: Clarendon Press.
Nelson, G., Wallis, S. and Aarts, B. (2002) Exploring Natural Language: working with the
British component of the International Corpus of English. Philadelphia: John Benjamins
Publishing Company.
Novacek, W. (2000) Bite’n’Byte: Software Development [online]. [Accessed 17th February
2005]. Available from World Wide Web: http://www.bitenbyte.com/
Quinn, A. and Porter, N. (1996) ICE Annotation Tools. In Greenbaum, S. (ed.) (1996)
Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon
Press.
Republic of Mauritius (2004) History [online]. [Accessed 20th November 2004]. Available
from the World Wide Web: http://www.gov.mu/abtmtius/history.htm
Sampson, G. (1995) English for the computer: The SUSANNE Corpus and analytic scheme.
Oxford: Clarendon Press.
Sharoff, S. (2004) Methods and tools for development of the Russian Reference Corpus. In
Archer, D., Wilson, A. and Rayson P. (eds.) Corpus Linguistics Around the World.
Amsterdam: Rodopi.
SourceForge (2005) Project: UniRed: Summary [online]. [Accessed 16th February 2005].
Available from World Wide Wed: http://sourceforge.net/projects/unired
The School of Computing, University of Leeds (1998-2004) The University of Leeds web site
[online]. [Accessed 21 st October 2004]. Available from World Wide Web:
http://www.comp.leeds.ac.uk/research/index.shtml
54
Koo Tee Fong, Dolly
APPENDIX A: Personal Experience
Since I am doing a Joint Honours degree in Computing and Management, while selecting a
project, I was looking for a task which would allow me to combine knowledge gained from both
subjects. Therefore, my first choice, which was to develop a training course for research students
on how to apply for funding seemed ideal at the time. When the project coordinator advised that I
should relate my project more to computing aspects, I thought I would manage to do so by
building an on-line training course. However, after feedback was obtained by the assessor in
January, I was in a total state of shock and disappointment. At that moment, I realised that I
should have listened to the advice I was given. It was clear that I would have to consider a new
outline for my project. This meant that I needed to stop feeling sorry for myself and start working
even harder right away.
Also, being a Joint Honours student and taking on a 40-credit Computing project meant that I
could only take another 20 credits of Computing modules in the final year. In addition, with only
a subset of level 1 and level 2 modules, it was difficult to take other modules that would have
been relevant to the project such as Knowledge Management or Natural Language Processing.
Therefore, from this project, a number of lessons were learnt and the following advice can be
given to future students:
•
Choose a project that meets the requirements of the School. It is important to know what the
school is expecting and what your supervisor and assessor is expecting and most of all what
constitute a good level 3 project. One recommendation will be to read carefully the final year
project website and at least one past project before deciding and starting on yours.
•
Choose a project with a purpose or that interest you. The project is over two semesters and
it is guaranteed that your initial enthusiasm will not last over the full course of the project.
Therefore it is important that you choose a topic in which you have at least some interest or in
which you feel concerned and want to get involved with.
•
Choose a project that is relevant to your course.
Especially if you are a Joint Honours
student, choose a project that allows you to make some use of the other half of your course
such as project planning/management for Computing and Management students.
•
Always listen to advice given from your supervisor, the project coordinator and anyone else
involved in the project. These people are more experienced and are here to guide you, so do
not hesitate to contact them when you are confused. Don’t think you know best and can solve
55
Koo Tee Fong, Dolly
the problems by yourself. Also, the weekly meeting with the supervisor are very useful and
should be attended.
•
Do a considerable amount of background reading. Background reading may seem a lost of
time, but it is very important to deliver a project of high quality. Firstly, the more you learnt
about something, the more interested and involved you get and secondly background reading
helps you understand the problem at an early stage and makes it easier to work on the project.
•
Don’t plan on being able to work consistently.
You will still have a lot of coursework and
other assignments to complete, and allowances must be made for these if the rest of your
studies are to be unaffected by the extra work the project requires. Similarly leave time for the
exam periods and time for yourself and a break. You will need it!!!
•
Don’t leave the write-up for the end. Always keep track of what you are doing and write the
report as you go along. Then, it is less likely that you will forget to include something crucial
to your project and it saves you from being stressed nearer the deadline.
•
Allow extra time and effort for evaluation. Any good evaluation relies on other people’s
opinion or experience. However, third parties are very busy people and getting them involved
may take longer than you expect. Therefore, ensure that your schedule is flexible and one
recommendation will be to start by requesting their help, then begin on your own evaluation
of the project and drop everything when they are ready to help you.
•
Never give up. There will be some time during the course of the project that everything will
seem to go wrong and you will feel desperate, but remember that there is always a solution
and that you are not the only one going through this nightmare.
My overall experience of this project has had both its good and chaotic time; it was difficult to
restart the project in the second semester but I have enjoyed the development of the pilot corpus
and the writing of the proposal. The chance to work on a project of this size has given me the
opportunity to develop project and time management skills and report writing which have already
prove n vital with my work outside University.
56
Koo Tee Fong, Dolly
APPENDIX B: Markup Symbols
Written Text Markup Symbols
<#>
...
<l>
...
<h>...</h>
<w>...</w>
<X>...</X>
<?>...</?>
<O>...</O>
<.>...</.>
<->...</->
<+>...</+>
<=>...</=>
<}>...</}>
<&>...</&>
<(>...</(>
<)>...</)>
<@>...</@>
<sb>...</sb>
<sp>...</sp>
<ul>...</ul>
<it>...</it>
<bold>...</bold>
<typeface>...</typeface>
<roman>...</roman>
<smallcaps>...</smallcaps>
<footnote>...</footnote>
<fnr>...</fnr>
<space>
<quote>...</quote>
<del>...</del>
<marginalia>...</marginalia>
<mention>...</mention>
<indig>...</indig>
<foreign>...</foreign>
Text unit marker
Subtext marker
Linebreak marker
Paragraph marker
Heading
Orthographic word
Extra-corpus text
Uncertain transcription
Untranscribed text
Incomplete word
Normative deletion
Normative insertion
Original normalization
Normative replacement
Editorial comment
Discontinuous word
Normalized discontinuous word
Changed name or word
Subscript
Superscript
Underline
Italics
Boldface
Change of typeface
Roman type
Small capitals
Footnote
Reference to footnote
Orthographic space
Quotation
Deleted text
Marginalia
Mention
Indigenous word(s)
Foreign word(s)
57
Koo Tee Fong, Dolly
Spoken Text Markup Symbols
<$A>, <$B>, etc
...
<#>
<O>...</O>
<?>...<?>
<->...</->
<+>...</+>
<=>...</=>
<.>...</.>
<}>...</}>
<[>...</[>
<{>...</{>
<,>
<,,>
<(>...</(>
<)>...</)>
<X>...</X>
<&>...</&>
<@>...</@>
<w>...</w>
<quote>...</quo te>
<mention>...</mention>
<foreign>...</foreign>
<indig>...</indig>
<unclear>...</unclear>
Speaker identification
Subtext marker
Text unit marker
Untranscribed text
Uncertain transcription
Normative deletion
Normative insertion
Original normalization
Incomplete word
Normative replacement
Overlapping string
Overlapping string set
Short pause
Long pause
Discontinuous word
Normalized disc. word
Extra-corpus text
Editorial comment
Changed name or word
Orthographic word
Quotation
Mention
Foreign word(s)
Indigenous word(s)
Unclear word(s)
58
59
Public
Dialogue
Scripted
Monologue
Unscripted
Spoken
Student Writing
ICE-Mauritius
Printed
Letters
Academic
Written
Popular
Unprinted
Koo Tee Fong, Dolly
APPENDIX C: Corpus Design layout
Letters
Student
Writing
Type
ID
60
-
Author
Prosi
Magazine
Publisher
Mauritius Apr-99
Email
Sent?
various
http://sundayvani.in
tnet.mu/Links/Your
%20voice.htm
Sunday Vani -
-
2006 -
266 rhevateeg@ Yes
lbis.intnet.m
u
Assessment and
Rhevatee
Reports - February Gobin
2003
http://www.lebocag
e.net/circular/Asses
sment%20RG.htm
various
507 LBIS@intn Yes
et.mu
http://www.lebocag LETT01 School Fees 2004 Jean-Paul Le Bocage Moka,
Nov-03
e.net/circular/Circul
de Chazal International Mauritius
ar%20fees2004.htm
School
Mauritius
Le Bocage Moka,
Feb-03
International Mauritius
School
Mauritius
323 info@roger Yes
s.mu
Port-Louis, Oct-04
Mauritius
250 ramchurnco Yes
@intnet.mu
762 prosi@bo Yes
w.intnet.
mu
Publisher Date Word
Place
Message from the R.
Mauritius
Reduit,
2001
President of the
Ramchurn Veterinary Mauritius
Association
Association
The 1998 Illovo
Award project
competition:
Summary of
proposals made by
the winning team
from Dr Maurice
Curé State
Secondary School
Title
http://www.rogers. LETT02 Letter to
Hector
Rogers
mu/
Shareholders (New EspitalierGroup Structure) Noël
http://mva.intnet.m
u/messages_files/an
ee.htm
http://www.prosi. STU01
net.mu/mag99/36
3
Web Address
No
not
delivered
Accept
?
Koo Tee Fong, Dolly
APPENDIX D: List of Texts collected
http://ncb.intnet
.mu/
Speech
ID
Title
Author
Publisher
61
-
http://mva.intne ACAD02 Diseases of Rabbits in
t.mu/articles.ht
Mauritius
m
http://www.uo
Media and Democracy
m.ac.mu/About
Us/Newsle tter/j
une_04.pdf
-
-
Roukaya University of Reduit,
Jun-04
Kasenally Mauritius
Mauritius
-
-
Government Port-Louis, Dec-04
of Mauritius Mauritius
Government Port-Louis, Jan-96
313 roukaya@u Yes
om.ac.mu
1499 -
993 barthestude
nts @
yahoo.co.u
k
1407 -
2072 -
1130 -
Port-Louis, Nov-03
Mauritius
Sent?
818 ncb01@nc Yes
b.intnet.mu
Email
Port-Louis, Apr-04
Mauritius
Publisher Date Word
Place
R.
University of Reduit,
Ramchurn Mauritius
Mauritius
-
Speech by Hon. A.K.
Gayan, Minister of Tourism
and Leisure on the occasion
of the handing over of
certificates to skippers
Address by the PresidentYear 1996
Academic http://sundayva ACAD01 Problems facing the bar
ni.intnet.mu/Lin
student in Mauritius – Law
ks/views.htm
students threatened with a
lethal blow?
http://tourism.g
ov.mu/speech1.
htm
http://mauritius
assembly.gov.m
u/assem96.htm
Speech by Chairman of the Mr Kemraz National
National Computer Board Mohee
Computer
Board
http://ncb.intnet SPEE01 Address by the Hon. Sushil Hon. Sushil National
.mu/medrc.htm
Khushiram, Minister of
Khushiram Computer
Development, Financial
Board
Services and Corporate
Affairs on E-Business
Web Address
Type
not
delivered
not
delivered
Accept
?
Koo Tee Fong, Dolly
62
Biological Diversity Samad Rojoa
and approaches to its
Conservation
Environment
Protection
Legislation should
be more business
friendly
The Music Scene
http://pages.intnet.
mu/nathraj/article3.
html
http://www.jecmauritius.org/
http://www.infomau
ritius.com/mauritius
/latest/the_music_sc
ene/?sid=35
Teleservices Ltd, the efficient response...
http://www.serviho
o.com/channels/kin
ews/v3dossier_detai
ls.php?id=61438
POP02
The Mauritius
Kestrel, once the
world's rarest bird
http://www.maurine POP01
t.com/wildlife.html
-
-
-
Armand F.
Pampusa
-
Raj Makoond Prosi
Magazine
-
-
-
-
-
-
Mauritius Oct-96
-
-
The Mauritian COMPNet Port Louis, Wildlife
Mauritius
Foundation
-
-
Email
Sent?
384 dahkiam@i Yes
ntnet.mu
643 jec@intnet. Yes
mu
569 -
210 tplus@intn Yes
et.mu
700 -
405 hema@the Yes
mauritianco
nnection.co
m
549 tplus@intn Yes
et.mu
Publisher Publisher Date Word
Place
Miss Hema
Malini Paupiah
Mauritian Sega
Author
http://www.themaur
itianconnection.com
/culture/sega/index.
html
Title
Popular
ID
The transit of Venus Ricaud
Auckbur
Web Address
Academic http://www.serviho
o.com/channels/kin
ews/v3dossier_detai
ls.php?id=43863
Type
Accept
?
Koo Tee Fong, Dolly
63
http://www.prosi.net.mu/
mag99/365june/pram365
.htm
http://www.businessmag. REP01
mu/displayNewsContent.
asp?NID=5747&CID=30
Bridges of Hope: A
post-violence social
project
Mr. Philippe Boullé: Jacques
“Mauritius is looked Dinan
at as an important
economic entity”
Defuse this time
bomb
http://www.businessmag.
asp?NID=6232&CID=26
-
-
Circle Cycle Tour
wheels in new
sponsor...
http://www.servihoo.com/
channels/kinews/v3dossi
er_details.php?id=44481
-
National literacy and
numeracy strategy
(NL & NS)
http://ministryeducation.gov.mu/majpro
j/natlit.htm
mu/default.asp?CID=10
Prozi
Magazine
Business
Magazine
Business
Magazine
-
-
-
Feb-03
Mauritius Jun-99
Port Louis, Feb-05
Mauritius
Port Louis, Feb-05
Mauritius
-
-
Port Louis, Feb-05
Mauritius
-
Freeport
Port Louis, Operations Mauritius
(Mauritius)
Ltd
Email
451
-
ntnet.mu
1693 busmag@i Yes
ntnet.mu
2150 busmag@i Yes
t.mu
467 tplus@intne Yes
mail.gov.m
u
928 psaddul@ Yes
uom.ac.mu
1012 k.jankeee@ Yes
1739 -
not
delivered
Sent? Accept
?
278 contact@fo Yes
m.co.mu
Place
M.O.
Rotary Club Bakarkhan
-
Author
Competition in the Dr
Business
banking sector:
Chandan Magazine
further evidence:
Jankee
“Actions speak
louder than words”
Mauritius Drug
Profiles
http://rotary.intnet.mu/
Title
Société Du Port
ID
http://www.freeportmauritius.com/holding/
Web Address
Reportage http://www.businessmag. REP02
Popular
Type
Koo Tee Fong, Dolly
Web Address
MEF believes Budget
should provide more for
skill development
Mr. Assad Bhuglah,
Director, Trade Policy
Unit: “The lobbying for
the recognition of SIDS
by the WTO should start
right from now”
CAC: Without Fear and Sir SatcamMauritius
Favour?
Boolell Times
Whatever happened to Sir SatcamMauritius
the Sachs Commission? Boolell Times
asp?NID=6380&CID=8
asp?NID=6349&CID=26
64
http://mauritiustimes.co
m/060902ssb.htm
http://mauritiustimes.co
m/041002ssb.htm
Business
Magazine
Business
Magazine
Business
Magazine
Business
Magazine
Business
Magazine
Pointe-aux- Oct-02
Sables,
Mauritius
Pointe-aux- Sep-02
Sables,
Mauritius
Port Louis, Mar-05
Mauritius
Port Louis, Mar-05
Mauritius
Port Louis, Mar-05
Mauritius
Port Louis, Mar-05
Mauritius
Port Louis, Feb-05
Mauritius
Email
1020 mtimes@i
ntnet.mu
1012 mtimes@i
ntnet.mu
1541 busmag@i
ntnet.mu
409 busmag@i
ntnet.mu
1468 busmag@i
ntnet.mu
467 busmag@i
ntnet.mu
1182 busmag@i
ntnet.mu
Place
MCB estimates growth
rate at 4.2% last year and
at 5.2%in 2005
Author
asp?NID=6304&CID=8
MCCI stresses the need
to develop a more
business-friendly
environment
Title
Proposals from the
Printers & Stationery
Manufacturers
Association (PSMA)
ID
asp?NID=6381&CID=8
Reportage http://www.businessmag.
asp?NID=6378&CID=8
Type
Sent? Accept
?
Koo Tee Fong, Dolly
65
1999 Illovo Award
Inter-College Project
Competition
Salt fish in tomato
sauce
Strategic Plan
Cabinet Decisions
taken on 04 March
2005
http://www.prosi.net.
mu/mag99/366july/ilo
vo366.htm
http://ilemaurice.tripod.com/ro
ugpoisal.htm
http://www.uom.ac.m
u/AboutUs/StrategicP
lan/overview.htm
http://pmo.gov.mu/de
cision.htm
-
-
Madeleine
Philippe
-
-
Pointe-aux- Feb-03
Sables,
Mauritius
Pointe-aux- Sep-02
Sables,
Mauritius
-
Government Port-Louis, Mar-05
-
-
Mauritius Jul-99
University of Reduit,
Mauritius
Mauritius
-
Prozi
Magazine
Email
1434 webmasterportal@mai
l.gov.mu
1501 centraladmi
n@uom.ac.
mu
@cjp.net
563 madeleine Yes
intnet.mu
666 prosi@bow. Yes
m.intnet.mu
Yes
not
delivered
Sent? Accept
?
2010 director@ut Yes
1635 mtimes@in
tnet.mu
1476 mtimes@in
tnet.mu
Publisher Date Word
Place
University of Pointe-aux- Feb-04
Technology, Sables,
Mauritius
Mauritius
Robert Lesage should S.
Mauritius
not allow himself to Modeliar Times
be intimidated by
anybody and least of
all by ICAC and his
arrest
Publisher
Sir SatcamMauritius
Boolell Times
Author
http://mauritiustimes.
com/210203mod.htm
Title
The Choice Cannot
Be Clearer
ID
http://mauritiustimes.
com/200902/200902s
sb.htm
Web Address
Instructional http://www.utm.ac.mu INS01 Admission
/
Regulations
Reportage
Type
Koo Tee Fong, Dolly
For a few at the Madhukar Mauritius
cost of the many Ramlallah Times
The Tide Is
Turning
Democratisation Madhukar Mauritius
Ramlallah Times
Harman Dahl's
Legacy
Not your day to
die
Mauritius and
Sugar
http://mauritiustimes
.com/300802edito.h
tm
http://mauritiustime
s.com/200902/2009
02edit.htm
66
http://mauritiustimes
.com/040305mr.htm
http://pages.intnet. NOV01
mu/rajbalkeehomep
age/hdcomplete.htm
http://pages.intnet.
mu/rajbalkeehomep
age/n-one.htm
http://www.prosi.net.
mu/simau97/prefac
e.htm
Creative
Jacques
Dinan
Prozi
Magazine
Raj Balkee Oceanic
Publishing
Raj Balkee Oceanic
Publishing
Madhukar Mauritius
Ramlallah Times
Madhukar Mauritius
Ramlallah Times
-
-
Date
Mauritius May-97
Mauritius 1995
Mauritius 2001
Pointe-aux- Mar-05
Sables,
Mauritius
Pointe-aux- Sep-02
Sables,
Mauritius
Pointe-aux- Feb-05
Sables,
Mauritius
Pointe-aux- Aug-02
Sables,
Mauritius
Government Port-Louis,
Subservience
MSM style
Publisher
Place
Government Port-Louis,
Publisher
http://mauritiustimes EDIT01
.com/040205mr.htm
Editorial
-
-
Author
Functions of the
National
Assembly
Title
http://mauritiusasse
mbly.gov.mu/role/f
unction.htm
ID
Our Constitution
Web Address
Instructional http://www.gov.mu/
govt/g_const.htm
Type
Email
Sent?
1857
-
ntnet.mu
1988 rajbalkee@i
ntnet.mu
1890 rajbalkee@i Yes
net.mu
820 mtimes@int
net.mu
709 mtimes@int
net.mu
835 mtimes@int
net.mu
862 mtimes@int Yes
l.gov.mu
l.gov.mu
Word
Yes
Yes
Accept
?
Koo Tee Fong, Dolly
Koo Tee Fong, Dolly
APPENDIX E: Sample of the Letters of Copyright
Sample 1: First letter to explain the purpose of the corpus and for the owners
and authors to keep
10 February 2005
Dear General Director of
Request for permission to use texts for linguistic research
Creation of a Mauritian Corpus of English
I am working on a student project at the University of Leeds that involves collecting English
texts from Mauritian people in electronic form and storing them on a computer to create a
corpus that may be freely available to all via the Web.
I believe that you are the owner of the text(s) of on the website:
I would like to use the text(s) as part of the corpus. People would be able to access your
text(s) and the text(s) of others for further research and teaching. We may also want to use the
text(s) for developing electronic products such as translators and dictionaries.
I would be very grateful if you would grant to myself and the University of Leeds a free and
perpetual non-exclusive licence for the above purposes only.
In consideration for your consent mentioned above, I will gladly acknowledge your
contribution in any relevant material.
If you agree to above and can confirm that there are no other third parties that have any
further rights in the text(s) that I need to contact, please acknowledge your acceptance to this
by returning signed and dated the attached copy of this letter.
Yours faithfully
Dolly Koo
Phone: [0044 - 7818855441]
Email: [jhs2dlyk@leeds.ac.uk]
Address: [c/o Mr Eric Atwell, Senior Lecturer
University of Leeds
Leeds
LS2 9JT
United Kingdom]
67
Koo Tee Fong, Dolly
Sample 2: Second letter for authors and owners to sign if they agree for their
websites to be used.
10 February 2005
Dear General Director of
Request for permission to use texts for linguistic research
Creation of a Mauritian Corpus of English
I am working on a student project at the University of Leeds that involves collecting English
texts from Mauritian people in electronic form and storing them on a computer to create a
corpus that may be freely available to all via the Web.
I believe that you are the owner of the text(s) of on the website:
I would like to use the text(s) as part of the corpus. People would be able to access your
text(s) and the text(s) of others for further research and teaching. We may also want to use the
text(s) for developing electronic products such as translators and dictionaries.
I would be very grateful if you would grant to myself and the University of Leeds a free and
perpetual non-exclusive licence for the above purposes only.
In consideration for your consent mentioned above, I will gladly acknowledge your
contribution in any relevant material.
If you agree to above and can confirm that there are no other third parties that have any
further rights in the text(s) that I need to contact, please acknowledge your acceptance to this
by returning signed and dated the attached copy of this le tter.
This is to confirm to the School of Computing at Leeds University that I agree to give
permission for all the texts on my website to be used as explained to me by the researcher. I
also agree to make the Corpus available for public use by researche rs, students and language
engineers.
Name (in block capitals)_____________________________________
Signature: ________________________________________________
Date: ____________________________________________________
68
Koo Tee Fong, Dolly
APPENDIX F: Template for the header
<tei.2>
<teiHeader id=" ">
<fileDesc >
<titleStmt >
<title > </title >
<author> </author>
<respStmt >
<resp>compiled by </resp>
<name >Dolly Koo</name >
</respStmt >
</titleStmt >
<publicationStmt >
<publisher> </publisher>
<pubPlace> </pubPlace>
<date></date>
</publicationStmt >
<sourceDesc >
created in machine-readable form in “ “ 
</sourceDesc >
</fileDesc >
<encodingDesc >
<projectDesc>
Texts collected for use in the pilot project for ICE- Mauritius, February,
2005
</projectDesc>
<samplingDecl>
Whole text of “ “ words copied from the site 
</samplingDecl>
</encodingDesc >
<profileDesc >
<creation>
<date value=" "> </date>
<rs type="city "> </rs >
</creation>
<textClass>
<text Desc n=" ">
</textDesc>
<particDesc >
<person id=" " sex=" " />
</particDesc >
</textClass>
</profileDesc >
</teiHeader>
<text >
<body >
</body >
</text >
</tei.2>
69
Koo Tee Fong, Dolly
APPENDIX G: Example of Encoded Text
An example of a raw text which belongs to the “Reportage” category, with id
“REP14.xml”.
Title
Author
Publisher
Publisher Place
Date
Source
Email
Amount of Words
Robert Lesage should not allow himself to be intimidated by
anybody and least of all by ICAC and his arrest
S. Modeliar
Mauritius Times
Pointe-aux-Sables, Mauritius
February 2003
http://mauritiustimes.com/210203mod.htm
mtimes@intnet.mu
1637
Robert Lesage should not allow himself to be intimidated by
anybody and least of all by ICAC and his arrest
“Robert Lesage should himself write out his statement and send it to the police, to ICAC, to the DPP, to
Transparency International and to the President of the Republic. Only then will he acquire some
legitimacy in his allegations and only then that the citadel of fraud that some unfortunately do not want to
demolish will come crumbling down…”
After the Cuttaree affair and his shielding in circumstances which have been repeated ad nauseam, after a
minister of the present government has been arrested on suspicion of corruption and after a complete inaction in
the case of another minister, another scandal has emerged. This time it does not concern any Swiss bank
involving Eric Stauffer, Vasant Bunwaree and Navin Ramgoolam though Paul Bérenger has already ruled that
these latter two are guilty. This time one of the most respected banks of the country, the Mauritius Commercial
Bank, better known as the MCB and being considered the best, is involved.
Until the ramifications of what is described as a fraud with regard to the National Pensions Fund (NPF) are known
no blame should be attached to anybody. One section of the press has talked about this and has referred to
Minister Choonee. It is to be hoped that such an attitude becomes a general feature of the press and that the
civilised press as opposed to the gutter press and to the partisan press will be prevailed upon when it comes to
the innocence and reputation of people. This philosophy should also be the hallmark of certain politicians who can
blow hot and cold at the same time.
While pontificating about presumption of innocence in the case of those close to the regime, because
only those who espouse the cause of the supreme leader of Mauritius can aspire to be appointed to
posts in the services including ICAC, Paul Bérenger has already found Vasant Bunwaree and Navin
Ramgoolam guilty of offences in relation to the Swiss bank affair. Now that the MCB scandal has
emerged he is trying to make a connection between the case at the MCB with the Swiss bank affair. On
what basis he is doing that is not clear and yet he is saying that the whole matter will be fully
investigated. Who will investigate the matter? Is it going to be investigated by ICAC? Whether we like it
or not, ICAC is yet to be perceived as a totally independent institution and totally free from political
influences. Even if it is, the perception is otherwise. At times perception of independence is as important
if not more important than independence itself.
In addition to trying to lay the blame for the MCB scandal on the Labour Party through a Swiss connection, Paul
Bérenger is all praise for the MCB. By so doing Paul Bérenger is already brainwashing public opinion against any
malpractice or offence that may have been committed by the MCB or any member of the MSM or MMM because
it should not be forgotten that it appears that the scandal dates as far back as 1992, a time at which Paul
Bérenger was in a coalition with Sir Anerood Jugnauth before being booted out in 1993. So let not Paul Bérenger
shout victory too soon as the investigators would have to find out who were the ministers responsible for the NPF
from 1992 up to today. The dates at which the funds have been misused will have to be determined as well as the
companies that benefited from those transfers of funds.
70
Koo Tee Fong, Dolly
But how will all this be determined? One would expect an impartial approach to the investigation. This is simply
not possible and the perception is that this is not the case right now. Paul Bérenger has already placed a political
coloration on the whole matter. The MCB has already talked of a total absence of any conspiracy at the level of
the bank and is suggesting that the accusation of conspiracy at the bank finds its source in a personal vendetta of
Robert Lesage. Most disturbing is the attitude of ICAC which has indicated yet once more that it is not functioning
as a completely independent body. If proof is needed it is be found in the very revealing statement of both the
Commissioner of ICAC, Navin Beekharry and that of Robert Lesage. Let us hope that the office of the DPP does
not join the bandwagon.
According to reports Robert Lesage is alleged to have stated that “…being given the new approach taken, I have
decided to withdraw my cooperation with the inquiry altogether and not to make any statement. However, I
confirm that I am still willing to continue my cooperation with the inquiry so long as the line taken since the
beginning. But if such cooperation is resumed, I shall tell the truth, the whole truth and nothing but the truth.” In
fact it would appear that what the ICAC investigators have been trying to do is to accept Robert Lesage’s
statement on part of the scandal or investigation. Mr Beekharry, the independent commissioner of ICAC confirms
this view in a statement to the press. What he says is that the statement of Robert Lesage will be taken according
to procedures and according to revelations made. This is a very disturbing and vague statement and defies all
logic.
Surely when an investigation is underway the person who is willing to make a statement should be allowed to say
all that he knows without any form of censorship and, once everything is taken down, then the investigator can
retain whatever is relevant. The procedure that ICAC is propounding may lead to the conclusion that he does not
want Robert Lesage to say all that he knows in order to shield some people. If this is the case or the perception,
then let ICAC be closed down. Perhaps the novel investigative procedure that is put forward by the independent
commission is unprecedented in the history of investigations. Now that Mr Beekharry has himself admitted that
there has been an attempt to censure the statement of Robert Lesage he should explain to the public, in the
name of transparency, and in the interest of ICAC, what he means by censorship. He should also explain in detail
the procedures of any investigations and especially the taking of statements so that in future well-meaning
citizens who want to expose those who have been making money illegally, will know what stand to take vis-à-vis
so-called independent institutions.
The arrest of Robert Lesage is also very revealing. This man has been praised by many of his former friends and
colleagues as somebody who is clean. He went to the ICAC following the discovery of the misuse of the NPF
funds and was not unduly worried as he told the ICAC investigators what he knew. However when he decided to
make a written statement and is confronted by what seemed to be an arbitrary censorship on what he was going
to say, and when he refused to play that kind of game it is only then that he is arrested. One wonders what Mrs
Indira Manrakhan would have been made to endure if she had adopted such a procedure. Why is that Robert
Lesage was not arrested following his oral statement? What additional information has come to light between the
first appearance of Robert Lesage at ICAC and his arrest? On what basis has he been arrested? In the absence
of a clear and unequivocal communiqué from ICAC, the impression would be that he was arrested in order to
exert pressure on him in order to compel him to say only what the ICAC, for reasons best known to it, wants to
hear.
Rumour has it that politicians of all parties have been named by Robert Lesage. The MCB itself has
said that no proper control of the NPF funds could have been made as high profile people were
involved in the management of those funds. A former financial secretary who is very close to the MSM
has an objection to departure against him. The names of officials at the bank have been named. The
siphoning of funds to private companies has been taking place since the late eighties. Paul Bérenger
was in the 1991 government. Questions also relate to those responsible for the audit of the MCB, the
audit of the NPF funds and the overall responsibility of different politicians who had charge of such
funds. It is not going to be a simple inquiry and censorship, the Beekharry style will certainly not help.
Nobody should be spared. No stone should be left unturned to get to the truth because important
government funds and an important bank are involved.
Robert Lesage should not allow himself to be intimidated by anybody and least of all by ICAC and his arrest. He is
being legally advised and as a responsible citizen he should go all the way by making public all that he knows. He
should himself write out his statement and send it to the police, to ICAC, to the DPP, to Transparency
International and to the President of the Republic. Only then will he acquire some legitimacy in his allegations and
only then that the citadel of fraud that some unfortunately do not want to demolish will come crumbling down. For
too long with the MSM or the MMM, there have been selective investigations with regard to fraud on a political
line. It is high time for things to change. If the world population can get the United States to change its mind on
war with Iraq, why can’t the people of Mauritius organise rallies against fraudsters and their occult institutional
allies?
S. MODELIAR
71
Koo Tee Fong, Dolly
APPENDIX H: Example of Encoded Text
Example of the encoded version of the text “REP14.xml” above.
-
-
-
-
-
-
-
-
-
<?xml version="1.0" encoding="utf-8" ?>
<tei.2>
<teiHeader id="REP14">
<fileDesc >
<titleStmt >
<title >Robert Lesage should not allow himself to be intimidated by anybody and
least of all by ICAC and his arrest</title >
<author>S. Modeliar</author>
<respStmt >
<resp>compiled by </resp>
<name >Dolly Koo</name >
</respStmt >
</titleStmt >
<publicationStmt >
<publisher>Mauritius Times</publisher>
<pubPlace>Pointe-aux-Sables, Mauritius </pubPlace>
<date>2003</date>
</publicationStmt >
<sourceDesc >
created in machine-readable form in
http://mauritiustimes.com/210203mod.htm
</sourceDesc >
</fileDesc >
<encodingDesc >
<projectDesc>
Texts collected for use in the pilot project for ICE- Mauritius, February,
2005
</projectDesc >
<samplingDecl>
Whole text of 1637 words copied from the site
</samplingDecl>
</encodingDesc >
<profileDesc >
<creation>
<date value="2003-02">Feb 2003</date>
<rs type="city ">Pointe-aux-Sables</rs >
</creation>
<textClass>
<textDesc n="01">
</textDesc >
<particDesc >
<person id="P1" sex="male " />
</particDesc >
</textClass>
</profileDesc >
</teiHeader>
72
Koo Tee Fong, Dolly
- <text >
- <body>
- <bold>
Robert Lesage should not allow himself to be intimidated by anybody and least
of all by ICAC and his arrest
- 
<it >“Robert Lesage should himself write out his statement and send it to the
police, to ICAC, to the DPP, to Transparency International and to the President
of the Republic. Only then will he acquire some legitimacy in his allegations and
only then that the citadel of fraud that some unfortunately do not want to
demolish will come crumbling down…”</it >

</bold>
After the Cuttaree affair and his shielding in circumstances which have been
repeated ad nauseam, after a minister of the present government has been
arrested on suspicion of corruption and after a complete inaction in the case of
anothe r minister, another scandal has emerged. This time it does not concern
any Swiss bank involving Eric Stauffer, Vasant Bunwaree and Navin Ramgoolam
though Paul Bérenger has already ruled that these latter two are guilty. This
time one of the most respected banks of the country, the Mauritius Commercial
Bank, better known as the MCB and being considered the best, is involved.
Until the ramifications of what is described as a fraud with regard to the
National Pensions Fund (NPF) are known no blame should be attached to
anybody. One section of the press has talked about this and has referred to
Minister Choonee. It is to be hoped that such an attitude becomes a general
feature of the press and that the civilised press as opposed to the gutter press
and to the partisan press will be prevailed upon when it comes to the innocence
and reputation of people. This philosophy should also be the hallmark of certain
politicians who can blow hot and cold at the same time.
 <marginalia > While pontificating about presumption of innocence in the case of
those close to the regime, because only those who espouse the cause of the
supreme leader of Mauritius can aspire to be appointed to posts in the services
including ICAC, Paul Bérenger has already found Vasant Bunwaree and Navin
Ramgoolam guilty of offences in relation to the Swiss bank affair. Now that the
MCB scandal has emerged he is trying to make a connection between the case
at the MCB with the Swiss bank affair. On what basis he is doing that is not
clear and yet he is saying that the whole matter will be fully investigated. Who
will investigate the matter? Is it going to be investigated by ICAC? Whether we
like it or not, ICAC is yet to be perceived as a totally independent institution and
totally free from political influences. Even if it is, the perception is otherwise. At
times perception of independence is as important if not more important than
independence itself. </marginalia > 
In addition to trying to lay the blame for the MCB scandal on the Labour Party
through a Swiss connection, Paul Bérenger is all praise for the MCB. By so doing
Paul Bérenger is already brainwashing public opinion against any malpractice or
offence that may have been committed by the MCB or any member of the MSM
or MMM because it should not be forgotten that it appears that the scandal
dates as far back as 1992, a time at which Paul Bérenger was in a coalition with
Sir Anerood Jugnauth before being booted out in 1993. So let not Paul Bérenger
shout victory too soon as the investigators would have to find out who were the
ministers responsible for the NPF from 1992 up to today. The dates at which the
funds have been misused will have to be determined as well as the companies
that benefited from those transfers of funds. 
But how will all this be determined? One would expect an impartial approach
to the investigation. This is simply not possible and the perception is that this is
not the case right now. Paul Bérenger has already placed a political coloration
73
Koo Tee Fong, Dolly
on the whole matter. The MCB has already talked of a total absence of any
conspiracy at the level of the bank and is suggesting that the accusation of
conspiracy at the bank finds its source in a personal vendetta of Robert Lesage.
Most disturbing is the attitude of ICAC which has indicated yet once more that it
is not functioning as a completely independent body. If proof is needed it is be
found in the very revealing statement of both the Commissioner of ICAC, Navin
Beekharry and that of Robert Lesage. Let us hope that the office of the DPP
does not join the bandwagon. 
According to reports Robert Lesage is alleged to have stated that “…being
given the new approach taken, I have decided to withdraw my cooperation with
the inquiry altogether and not to make any statement. However, I confirm that
I am still willing to continue my cooperation with the inquiry so long as the line
taken since the beginning. But if such cooperation is resumed, I shall tell the
truth, the whole truth and nothing but the truth.” In fact it would appear that
what the ICAC investigators have been trying to do is to accept Robert Lesage’s
statement on part of the scandal or investigation. Mr Beekharry, the
independent commissioner of ICAC confirms this view in a statement to the
press. What he says is that the statement of Robert Lesage will be taken
according to procedures and according to revelations made. This is a very
disturbing and vague statement and defies all logic.
Surely when an investigation is underway the person who is willing to make a
statement should be allowed to say all that he knows without any form of
censorship and, once everything is taken down, then the investigator can retain
whatever is relevant. The procedure that ICAC is propounding may lead to the
conclusion that he does not want Robert Lesage to say all that he knows in
order to shield some people. If this is the case or the perception, then let ICAC
be closed down. Perhaps the novel investigative procedure that is put forward
by the independent commission is unprecedented in the history of
investigations. Now that Mr Beekharry has himself admitted that there has been
an attempt to censure the statement of Robert Lesage he should explain to the
public, in the name of transparency, and in the interest of ICAC, what he means
by censorship. He should also explain in detail the procedures of any
investigations and especially the taking of statements so that in future wellmeaning citizens who want to expose those who have been making money
illegally, will know what stand to take vis-à-vis so-called independent
institutions.
The arrest of Robert Lesage is also very revealing. This man has been praised
by many of his former friends and colleagues as somebody who is clean. He
went to the ICAC following the discovery of the misuse of the NPF funds and
was not unduly worried as he told the ICAC investigators what he knew.
However when he decided to make a written statement and is confronted by
what seemed to be an arbitrary censorship on what he was going to say, and
when he refused to play that kind of game it is only then that he is arrested.
One wonders what Mrs Indira Manrakhan would have been made to endure if
she had adopted such a procedure. Why is that Robert Lesage was not arrested
following his oral statement? What additional information has come to light
between the first appearance of Robert Lesage at ICAC and his arrest? On what
basis has he been arrested? In the absence of a clear and unequivocal
communiqué from ICAC, the impression would be that he was arrested in order
to exert pressure on him in order to compel him to say only what the ICAC, for
reasons best known to it, wants to hear. 
<marginalia > Rumour has it that politicians of all parties have been named by
Robert Lesage. The MCB itself has said that no proper control of the NPF funds
could have been made as high profile people were involved in the management
of those funds. A former financial secretary who is very close to the MSM has an
objection to departure against him. The names of officials at the bank have
74
Koo Tee Fong, Dolly
been named. The siphoning of funds to private companies has been taking place
since the late eighties. Paul Bérenger was in the 1991 government. Questions
also relate to those responsible for the audit of the MCB, the audit of the NPF
funds and the overall responsibility of different politicians who had charge of
such funds. It is not going to be a simple inquiry and censorship, the Beekharry
style will certainly not help. Nobody should be spared. No stone should be left
unturned to get to the truth because important government funds and an
important bank are involved. </marginalia > 
Robert Lesage should not allow himself to be intimidated by anybody and least
of all by ICAC and his arrest. He is being legally advised and as a responsible
citizen he should go all the way by making public all that he knows. He should
himself write out his statement and send it to the police, to ICAC, to the DPP, to
Transparency International and to the President of the Re public. Only then will
he acquire some legitimacy in his allegations and only then that the citadel of
fraud that some unfortunately do not want to demolish will come crumbling
down. For too long with the MSM or the MMM, there have been selective
investigations with regard to fraud on a political line. It is high time for things
to change. If the world population can get the United States to change its mind
on war with Iraq, why can’t the people of Mauritius organise rallies against
fraudsters and their occult institutional allies?
- <h>
<bold>S. MODELIAR</bold>
</h>
</body>
</text >
</tei.2>
75
Koo Tee Fong, Dolly
Appendix I: First Draft of the Case for Support for the ICE-lite Proposal
EPSRC Research Proposal:
Development of the ICE-lite
Part 1: DESCRIPTION OF THE PROPOSED RESEARCH AND ITS CONTEXT
1. Background
The University of Leeds and University College London have done previous research on
computer analysis of English language texts, also known as English Corpus Linguistics. For
example, development of a Part-of-Speech analysis system which is being used on other research
projects such as the International Corpus of English (ICE), which includes research teams in
fifteen countries where English is the main language. In many of these English-speaking
countries, the national ICE sub-corpus is a recognised resource used in research and teaching
(ICE, 2002). “The International Corpus of English (ICE) began in 1990 with the primary aim of
collecting material for comparative studies of English worldwide. Fifteen research teams around
the world are preparing electronic corpora of their own national or regional variety of English.
Each ICE corpus consists of one million words of spoken and written English produced after
1989. For most participating countries, the ICE project is stimulating the first systematic
investigation of the nationa l variety. To ensure compatibility among the component corpora, each
team is following a common corpus design, as well as a common scheme for grammatical
annotation.” (ICE, 2002)
Mauritius is one of the many English-speaking African countries, but there is no Mauritian subcorpus in ICE yet. English has been used in Mauritius for around 195 years, since the British
settlers arrived in 1810, and is the official language of the country (Republic of Mauritius, 2004).
However, at that time, slaves were imported from Africa and Madagascar, a large number of
labourers from India were brought to work in the sugar cane fields and a small number of Chinese
came to trade, and the influence of the French who were the rulers before the British was still very
strong. The languages brought by the Hindus workers and merchants from India include
Bhojpuri, Hindi, Tamil, Telegu, Marathi and Gujerati. The Chinese who came to Mauritius
generally speak Hakka or Cantonese and the Muslims workers from India speak Arabic or Urdu.
The slaves brought Malagasy (the language spoken in Madagascar) and Afrikaan to the country as
well. All those different languages have quite a big impact on the official language, which is
English, but the mixture of those languages also resulted in a new language, which is Creole.
Creole is the most widely spoken language on the island and it is used by more than half the
population, including many people who are not of Creole descent. However, even if Creole is the
most common language in Mauritius, all official communications, and teaching in schools are
done in English. With the influence of the other languages, the traditional English brought by the
British settlers have suffered drastic changes. In many official communications or press reports
for instance, we will come across some French or Creole words. It might be names of individuals
or companies or it might be used only to put some emphasis on a theme. In school textbooks,
often there will be words in Hindi or Chinese, depending on whether they are used in private or
public schools. It is also important to note that dialectal variation is reflected much more in
spoken than written Mauritian English. This is due to the fact that people tend to think in their
native language and then translate what they want to say in English. Therefore the structure and
grammar of the sentences will differ among the different cultures in Mauritius.
There are already numerous Mauritian websites available on the World Wide Web and most of
them are written in English and the government has just started a Cyber City project, which is the
76
Koo Tee Fong, Dolly
first of its kind of a new generation of IT parks in this part of the world (BPML, 2004).
Therefore, it will be feasible to collect at least some written samples of Mauritian English
remotely, via the World Wide Web.
From the success of the other ICE corpuses, namely the sub-corpuses from Australia, Great
Britain, India, Hong Kong and East Africa, among other, researching Mauritius English and
developing a sub-corpus within this country will help it in its development whether it be in IT,
research or teaching.
However, the Mauritius sub-corpus will be compiled by collecting texts mostly from the Internet
and some amendments will have to be made to the standards since the type of texts available on
the Internet will not match the text categories of ICE and other information, such as details of
authors and publication will not be easily available on the Internet. According to this technique,
the sub-corpus will not contain all the text-categories required in the standard ICE scheme;
instead, we will develop an “ICE-lite” scheme, to simplify compilation of an ICE-Mauritius
corpus. Furthermore, a more ambitious extension will be to include other types of English from
other English-speaking countries in the corpus. It will be a quick and simple way of compiling a
corpus of ten million words, with around five hundred thousand words for each country. The
twenty English-speaking countries not currently covered by ICE to be included in the corpus are:
Bahamas, Bangladesh, Barbados, Bermuda, Botswana, Cayman Islands, Cyprus, Dominica,
Gambia, Gibraltar, Grenada, Liberia, Malta, Mauritius, Namibia, Pakistan, Seychelles, Uganda,
Zambia and Zimbabwe.
2. Objectives
To achieve the goal of developing a Multi-national Corpus of English, the following specific
research objectives have been identified:
• To set up infrastructure and prototype sampler corpus for the Multi-national Corpus of
English.
• To collect, mark-up and lexico-grammatically annotate different samples of spoken and
written texts in English from the twenty countries :- 250 texts of approximately 2,000 words
each for each country, a total of approximately ten million words.
• Use corpus exploration tools to analyse lexical and grammatical variation across the
contributing dialects of English in this sampler corpus.
3. Research Work-Plan
The work has been organised into eight activity streams:
WP1: Collection of Spoken and Written Text of English. (24 months RF1, 12 months RF2, 9
months RF3)
The corpus will contain texts from 1990 or later. The total amount of texts needed will be 250
texts of approximately 2,000 words each for each country - a total of approximately ten million
words. The authors and speakers of the texts need to have been brought up and taught through the
English medium. They must be aged 18 or over and were either born or immigrated at an early
age to the country.
1.1 Written Text
All of the written texts will be collected from the Internet and other freely available sources.
However, it will be difficult to obtain non-printed texts such as social letters or student essays.
1.2 Spoken Text
Recording spoken texts is labour-intensive, time-consuming and costly (Meyer, 2002) and will
only be possible if done at the site, i.e. in each of the countries. Therefore, we will only seek to
77
Koo Tee Fong, Dolly
use sources such as radio and TV broadcasts, which are available on the Internet. The solution
proposed by Sharoff (2005) is to “increase the amount of ephemera (leaflets, junk mail and typed
material), correspondence & spoken language samples”.
1.3 Copyright issues
Letters of copyright will have to be obtained. This will involve in the first place identifying the
owners of sources and finding the right contact details.
1.4 Classification of texts
Selecting and organising the texts will be a complex task and careful consideration is required.
As far as possible, the text classification will follow the ICE standard categories, but it is expected
that some texts categories will not be available on the Internet. It is also important to classify the
text according to the country it comes from.
Deliverable D1: A detailed list of the texts collected together with information about the authors,
publisher, publisher place and date.
WP2: Transcription (19 months RF1, 12 months RF2, 9 months RF3)
Most of the written texts will be in electronic format already. After the spoken texts have been
collected and permission is received, the spoken texts will be transcribed, that is, written on paper
or typed on screen. It is expected that most of the speech recorded will be in digital format
already since they will be collected from the Internet.
Deliverable D2: A sampler of the raw Multi-national Corpus of English.
WP3: Textual Mark-up (6 months RF1, 2 months RF2, 2 months RF3)
3.1 Encoding of Text
The texts will be encoded with XML mark-up, i.e. the features of the original texts that are lost
when it is converted into a plain text file on a computer will be encoded. In written texts this
includes features such as boldface, italics and underlining as well as sentence boundaries,
paragraph boundaries and headings. In spoken texts the encoding features will be sentence
boundaries, speaker turns, and pauses (Nelson, 1996a). Paragraphing and header information
(adapted from the ICE standards) regarding author, publisher, etc. will be added. Texts with
different formats (Doc, PDF, HTML) will be converted into a unified framework (XML format)
(Al-Sulaiti, 2004).
3.2 Proofread Text
Both spoken and written text will be proofread on the screen. This task includes deleting extra
and unnecessary material from texts and checking and adjusting paragraphing markers.
Deliverable D3: Multi-national Sampler Corpus ready for distribution.
WP4: Word-class tagging (18 months RF3)
Like in the other ICE, the texts will be “automatically tagged for wordclass by the TOSCA
Tagger, developed by the TOSCA Research Group at the University of Nijmegen. This assigns
wordclass tags to each lexical item in the corpus. The tagset has been developed especially for
ICE, and is largely based on Quirk et al (1985) A Comprehensive Grammar of the English
Language” (ICE, 2002). During this stage, each item will be assigned a label or tag, for example,
‘N’ for noun and ‘ADV’ for adverb. Other information, such as singular or plural, or the verb
tense will be added in brackets next to the label or tag.
Deliverable D4: Proofread and manually-corrected tagged Multi-national Sampler Corpus.
WP5: Syntactic parsing
The tagged corpus from the previous stage formed the input to the next major stage, the syntactic
parsing.
5.1 Syntactic marking (18 months RF2)
The corpus is pre-edited (also known as syntactic marking) before the rest of the parsing stage.
This involves manually marking several high-frequency constructions in order to reduce the
78
Koo Tee Fong, Dolly
ambiguity of the input, and thereby reduce the number of decisions that the automatic parser
would have to make.
5.2 TOSCA parser
The TOSCA parser (Nelson et al., 2002), a software that has been developed by the TOSCA
group, will be used to automate this stage. The output from the TOSCA parser will be series of
labelled syntactic trees, in which the nodes will be labelled for function, category, and features.
5.3 Manual Analysis
The TOSCA parser should yield a complete analysis for around 70% of the parsing units in the
corpus and for the remainder, the analysis will have to be done manually.
Deliverable D5: An analysis of the corpus at phrase, clause, and sentence level, and the analysis
will be shown in the form of a parse tree.
WP6: Evaluation (7 months RF1, 2 months RF2, 17 months RF3)
6.1 Cross-sectional checking
The syntactic trees will be checked on a cross-sectional, construction–by-construction basis. This
will allow the check to be concentrated on just one grammatical construction at time and
correction can be made on each instance of the construction throughout the whole corpus, if
necessary. The ICECUP (ICE, 2002) can be used for the cross-sectional checking.
6.2 Spot-checking
Finally, the corpus will be ‘spot-checked’ before being released.
Deliverable D6: The final Multi-national Corpus of English
WP7: Comparison across dialects (10 months RF1, 16 months RF2, 6 months RF3)
We will use English concordance and corpus exploration tools to analyse lexical and grammatical
variation across the contributing dialects of English in the sampler corpus.
Deliverable D7: Research paper on lexical and grammatical variation in the Multi-National
Sampler Corpus, to be submitted to the International Journal of Corpus Linguistics.
WP8: Dissemination for Exploitation (2 months each)
The normal dissemination route for academic research is journal and conference papers. D5 and
D6 are directly publishable, other papers will need to be written from other deliverables.
Deliverable D8: Plan for continuing expansion of the Multi-national Corpus of English,
extending to new countries.
4. Benefits
The research will first be beneficial to the government and the educational system in each of the
twenty countries mentioned above. A comprehensive description of the different types of English
can be obtained from the corpus and therefore each country will be able to develop its own
reference guides to usage, dictionaries and other teaching materials. This can help both schools
and universities to adapt their methods of teaching, and especially the structure in which English
is taught and spoken to a better standard. The comparison across the dialects of English to find
any striking similarities or differences will be useful for further research and teaching methods in
each country and will also benefit those people who want to travel to or trade with other Englishspeaking countries since the comparison will provide a useful insight in how they will have to
adapt their language. When the corpus is released, it will also be beneficial to other research or
academic institutions across the world. It can be used as a comparison or for further research by
the existing corpuses or other potential corpuses.
79
Koo Tee Fong, Dolly
Longer-term impacts of the work to be done include
• Promoting cooperation between English speaking countries and for the purpose of developing
basic components for the linguistic society.
•
•
Promoting the different culture as a whole.
5. Resources
Staff: the development of the project will require the employment of:
Three English corpus linguists as post-doctoral Research Fellows and project managers for three
years.
Consumables:
A powerful laptop PC for each researcher, costing £2000 each.
Consultancy fees of £40,000 for transcription and mark-up of source materials.
Travel and Subsistence:
Results will be reported and published in conference proceedings including Corpus Linguistics (CL’07
Lisbon) and ICAME (ICAME’06, ‘07, ‘08, locations not yet known), estimated total cost of £4,000.
The costs for the International Steering Panel meetings at start, mid and end of project are estimated at
a total of £21,000.
References:
Al-Sulaiti, L. (2004) Designing and Developing a Corpus of Contemporary Arabic. Unpublished MSc
thesis. University of Leeds.
BPML (2004) Cybercity Mauritius - The Ebène CyberCity web site [online]. [Accessed 20th
November 2004]. Available from World Wide Web: http://e-cybercity.mu/cybercity
Department of English Language & Literature, University College London (2002) The International
Corpus of English (ICE) web site [online]. [Accessed 9th November 2004]. Available from World
Wide Web: http://www.ucl.ac.uk/english-usage/ice/#
Meyer, C. (2002) English Corpus Linguistics, an Introduction. Cambridge: Cambridge University
Press.
Nelson, G. (1996a) The Design of the Corpus. In Greenbaum, S. (ed.) (1996) Comparing English
Worldwide: The International Corpus of English. Oxford: Clarendon Press
Nelson, G., Wallis, S. and Aarts, B. (2002) Exploring Natural Language: working with the British
component of the International Corpus of English. Philadelphia: John Benjamins Publishing
Company.
Republic of Mauritius (2004) History [online]. [Accessed 20th November 2004]. Available from the
World Wide Web: http://www.gov.mu/abtmtius/history.htm
Sharoff, S. (2004) Methods and tools for development of the Russian Reference Corpus. In Archer,
D., Wilson, A. and Rayson P. (eds.) Corpus Linguistics Around the World. Amsterdam: Rodopi.
The School of Computing, University of Leeds (1998-2004). The University of Leeds web site
[online]. [Accessed 21 st October 2004]. Available from World Wide Web:
80
Koo Tee Fong, Dolly
Part 2: DIAGRAMMATIC WORK PLAN
The work has been organised into eight activity streams. The three research fellows will be
working in parallel during the 36- month project starting August 2006.
WP1: Collection of Spoken and Written Text of English (24 months RF1, 12 months RF2,
9 months RF3)
WP4: Word-class tagging (18 months RF3)
WP5: Syntactic parsing (18 months RF2)
Research Fellow 1:
Month:
A
S
O N
D J
F
M A M J
J
A
S
O
N D
J
F
M A M J
J
A S
O N D
J
F
M A
M J
J
F
M A M J
J
A
S
O
N D
J
F
M A M J
J
A S
O N D
J
F
M A
M J
J
F
M A M J
J
A
S
O
N D
J
F
M A M J
J
A S
O N D
J
F
M A
M J
J
WP1
WP2
WP3
WP4
WP5
WP6
WP7
WP8
Research Fellow 2:
Month:
A
S
O N
D J
WP1
WP2
WP3
WP4
WP5
WP6
WP7
WP8
Research Fellow 3:
Month:
A
S
O N
D J
WP1
WP2
WP3
WP4
WP5
WP6
WP7
WP8
81
Koo Tee Fong, Dolly
Appendix J: EPSRC Application Form for ICE-lite
Engineering & Physical Sciences Research Council
Polaris House, North Star Avenue, Swindon, Wiltshire,
United Kingdom, SN2 1ET
Telephone +44 (0) 1793 444000
Web http://www.epsrc.ac.uk/
Je-SRP1 (EPSRC)
v1.1
COMPLIANCE WITH THE DATA PROTECTION ACT 1998
In accordance with the Data Protection Act 1998, the personal data provided on this form will be processed by EPSRC, and may be
held on computerised database and/or manual files. Further details may be found in the guidance notes
EPSRC Reference:
RESEARCH PROPOSAL
1. DETAILS OF PROPOSAL
You should read the separate notes for guidance, the 'EPSRC Funding
Guide’ and any specific call documentation on the EPSRC Web site
before completing any research proposal. Form Je-SRP1 (EPSRC) must
be accompanied by a Case for Support. EPSRC will reject incomplete
research proposals.
A. Organisation Where Grant Would Be Held
Organisation
University of Leeds
Division or Department
School of Computing
Address Line 1
Computer Vision and Language group
Address Line 2
School of Computing
Address Line 3
University of Leeds
Town/City
Leeds
Admin Area/County
West Yorkshire
Research Organisation
Reference:
Postal Code
B. Investigators
LS2 9JT
Please give details of each investigator below. Please provide the details of any additional
investigators on a separate sheet using the same format as below.
Details
Principal Investigator (PI)
Title
Mr
Forename(s)
Eric
Surname
Atwell
Organisation
University of Leeds
School of Computing
Post will outlast project (Y/N)
Y
% time committed to project
20
Other commitments (description and
average hours per week)
8
Co-Investigator 1
0
Total number of co-investigators (ie. excluding the PI)
82
Koo Tee Fong, Dolly
C. Recognised Researchers
Please give details of each Recognised Researcher below. Please provide the details
of any additional Recognised Researchers on a separate sheet using the same format as below.
Details
Recognised Researcher 1
Recognised Researcher 2
Title
Forename(s)
Surname
Organisation
% time committed to project
Dr
Serge
Sharoff
University of Leeds
Centre for Translation Studies
100
Ms
Bayan
Abu Shawar
University of Leeds
School of Computing
100
3
Total number of Recognised Researchers
D. Title of Research Project [up to 150 chars]
Development of an ICE-lite
E. Start Date and Duration
a. Proposed start date
August 1st 2005
b. Duration of the grant (months)
36
F. Type of Proposal
Scheme:
Call: n/a
G. Summary of EPSRC Resources Required for Project
a. Financial resources required
b. Summary of staff effort requested
Total £
Staff
Months
330,072
Research
Travel and Subsistence
25,000
Technician
Consumables
46,000
Other
Exceptional Items
Project Students
Equipment
Visiting Researchers
Large Capital
PCTF
c. Services
Total
£
108
108
500
Sub-total
401,572
Indirect Costs
101,222
Total
502,794
H. Related Proposals
EPSRC Reference Number
How related? (one of Continuation,
Follow-up to outline proposal,
Invited resubmission, Uninvited
resubmission)
a. If this proposal is related to a previous proposal to EPSRC,
please give the previous EPSRC research grant proposal
reference number(s) and indicate the type of relationship.
Total Number of
Proposals being
submitted
b. If there is more than one organisation submitting a JeSRP1 (EPSRC) proposal form for this project, please give
the number of proposals involved, the lead Research
Organisation and the project common reference.
83
Name of Lead
Research Organisation
Common
Reference
Koo Tee Fong, Dolly
I. Research Councils / MoD Joint Research Grants Scheme (JGS) If you have received a
commitment of support from the Defence Science Technology Laboratory (DSTL), please give the following details:
Percentage funding indicated by DSTL
DSTL contact (name and address)
Title/Forename(s)
Surname
Address Line 1
Address Line 2
Address Line 3
Town/City
Administrative Area/County
Postal Code
Telephone
Fax
E- mail
DSTL Reference (please ensure that the letter providing
this reference is attached with the Case for Support)
J. Objectives
List main objectives of the proposed research in order of priority [up to 4000 chars]
We will set up unfrastructure and prototype sample corpus for the ICE-lite, an International Corpus of
English component which contains a 'lite' version of Englishes from 40 different English-speaking countries.
An international steering panel will establish agreed standards for text types and categories and the other
annotation standards such as encoding and XML mark-up and tagging, distribution.
We will collect, mark-up and annotate different samples of spoken and written texts in English from 40
English-speaking countries:- 250 texts of approximately 2,000 words each, a total of approximately 20
million words.
We will use corpus exploration tools to analyse lexical and grammatical variation across the contributing
dialects of English in this sample corpus.
84
Koo Tee Fong, Dolly
K. Summary
Describe the proposed research in a style that would be accessible to an interested 14 year old [up to 4000 chars]
Texts stored on the computer, known as corpora together with software tools provide a powerful method to
learn more about language usage. Corpora are useful for studying all aspects of language such as
grammar, meaning, speech sounds and helping dictionary makers in spotting new words. Corpus
linguistics are nowadays analysing the use of structures and investigating factors that affect our choice of a
particular structure. For instance, the factors may be related to the nature of the writing or speaking such
as science rather than literature. Other factors that may influence of choice include as age, gender, period
of time, text type, and medium (spoken or written) and these are being fully examined to get best result in
our study of language.
The main aim of a corpus linguistic is to discover common linguistic patterns in some specific contexts
rather than stating whether the pattern is correct or incorrect. Therefore, with the computer storing a huge
amount of data, this view of language analysis becomes more accessible and gives a good resource to
start with. The corpus can be searched and handled at high speed using a special software tool. In
addition, some information such as grammar, meaning, and speech sound can be added to it to make it
useful to examine.
Many large corpora have been developed during the past few years. Some are for general-use in
linguistics research and represent different languages such as English, Spanish, French, and Russian,
while others are more specialised such as the Air Traffic Control corpus. English is widely spoken in
different parts of the world and one main corpus that handles its variation is known as the International
Corpus of English (ICE). The main purpose of collecting this corpus is for comparing English as spoken
worldwide. Around the world, fifteen research teams are preparing electronic corpora of their own variety
of English and each one consists of one million words: 60% spoken and 40% written. Each team is
following the same corpus design to ensure compatibility.
Many English-speaking countries do not have a component in ICE yet and developing a sub-corpus for
each one will be very costly and time-consuming. A better extension to the ICE project will be to collect a
small version of the corpus for each of the English-speaking country and grouping them together to form
the ICE -lite. The term “lite” is borrowed from other simplified projects such as “TEI-lite” which means a
simpler version of TEI, a standard XML-markup convention for text corpora.
Therefore, the aim of this project is to build a corpus for the ICE-lite which will follow similar conventions to
the full ICE version. To ensure compatibility with the other ICE-projects, the “lite” version of the teams
already in ICE will also be included in the corpus together with other 25 countries. The aim is to collect a
corpus of 20 million words: 250 texts of 2,000 words for each country. An international steering panel will
be appointed to agree on a general design structure for the corpus. The different types of English will then
be analysed and compared to find similarities or differences across countries. This will allow an
understanding of the different cultures and therefore will be useful for other research and academic
institutions across the world.
L. Beneficiaries
Describe who will benefit from the research [up to 4000 chars]
The research will first be beneficial to the governement and the educational system in each of the twenty
countries mentioned above and the existing ICE teams. A comprehensive description of the different types
of English can be obtained from the corpus and therefore each country will be able to develop its own
reference guides to usage, dictionaries and other teaching materials. This can help both schools and
universities to adapt their methods of teaching, and especially the structure in which English is taught and
spoken to a better standard. The comparison across the dialects of English to find any striking similarities
or differences will be useful for further research and teaching methods in each country and will also benefit
those people who want to travel to or trade with other English-speaking countries since the comparison will
provide a useful insight in how they will have to adapt their language. When the corpus is released, it will
also be beneficial to other research or academic institutions across the world. It can be used as a
comparison or for further research by the existing corpuses or other potential corpuses.
Longer-term impacts of the work to be done include:
•
Promoting cooperation between other English speaking countries and for the purpose of developing
basic components for the linguistic society.
•
•
Promote the different cultures of the 40 countris across the world.
85
Koo Tee Fong, Dolly
M. Staff
Joint Negotiating Committee For Higher Education Staff (JNCHES – formerly UCEA) Posts
EFFORT ON PROJECT
Name /Post Identifier
Grade
Starting
Spine
Point
Effective Date
of Salary Scale
Increment
Date
Start
Date
Period on
Project
(months)
% of Full
Time
London
Allowance
(Y/N)
Total cost
on grant
(£)
i) Research Staff
Serge Sharoff
RAII
11
01/08/2005
01/08/2006
01/08/2005
36
100 %
N
110,024
Bayan Abu Shawar
RAII
11
01/08/2005
01/08/2006
01/08/2005
36
100 %
N
110,024
Sean Wallis at UCL
RAII
11
01/08/2005
01/08/2006
01/08/2005
36
100 %
N
110,024
%
%
%
%
%
%
ii) Technical Staff
%
%
%
%
%
%
%
%
%
iii) Visiting Researchers
%
%
%
%
%
%
%
%
Total
[
86
330,072
Koo Tee Fong, Dolly
Non-JNCHES Posts
EFFORT ON PROJECT
Name / Post Identifier
Basic Starting
Salary
Scale
Effective Date of
Salary Scale
Increment
Date
Start
Date
Period on
Project
(months)
% of Full
Time
London
Allowance
(£)
Superannuation
and NI (£)
Total cost
on grant (£)
i) Research Staff
%
%
%
%
%
%
%
ii) Technical Staff
%
%
%
%
%
%
%
iii) Other Staff
%
%
%
%
%
%
%
iv) Visiting Researchers
%
%
%
%
Total
87
Koo Tee Fong, Dolly
Ma. Project Studentships
Name/Post Identifier
Start Date
London (Y/N)
Stipend (£)
Total
Mb. Visiting Researchers
Please provide the details of any additional visiting researchers on a separate sheet in
the same format as below.
Details
Visiting Researcher 1
Title
Forename(s)
Surname
Home Organisation
Address Line 1
Address Line 2
Address Line 3
Town/City
Administrative Area/County
Postal Code
Country
Telephone
Fax
E- mail
Post held
a) If you have requested an amount
from EPSRC for the Visiting
Researcher's salary in Section M, will
the Visiting Researcher receive any
other contribution on top of this? (Y/N)
b) If the Visiting Researcher will receive
another contribution, how much will this
be? (£)
c) What annual salary would the host
organisation expect to pay staff of the
Visiting Researcher's status? (£)
Total number of visiting researchers
0
Mc. Public Communication Training Funds (PCTF)
Do you wish to apply for £500 towards Public Communication Training Funds?
88
YES
NO
Koo Tee Fong, Dolly
N. Travel and Subsistence
Destination and purpose
Total £
(i) Within UK
International Steering Panel review meetings at start, mid and end of project
21,000
(ii) Outside UK
Corpus Linguistics conferences (CL'2007, TALC'06,07,08) to disseminate results
4,000
Total £
25,000
O. Consumables
Description
Total £
Consultancy fee funds for transcription and markup of source materials
40,000
3 laptops for data collection and analysis
6,000
Total £
46,000
P. Exceptional Items
Description
Total £
Total £
89
Koo Tee Fong, Dolly
Q. Equipment (single items between £3,000 and £99,999, including VAT))
Description
Country of
Manufacture
Delivery
Date
Basic
price £
Import
duty £
VAT £
Total £
Total £
R. Large Capital (single items £100,000 and over, including VAT)
Description
Country of
Manufacture
Delivery
Date
Basic
price £
Import
duty £
VAT £
Total £
Total £
S. Services
Service
Instrument(s)
Units
Cost £
Total
T. Other Support
Give details of any support sought or received from any source for this or related research in the past three years (minimum
£10,000)
Awarding
Organisation
Awarding
Organisation’s
Reference
Title of project
Decision
Made
(Y/N)
90
Award
Made
(Y/N)
Start
Date
End
Date
Amount
Sought/
Awarded
(£)
Koo Tee Fong, Dolly
Appendix K: Revised Case for Support for the ICE-lite Proposal
EPSRC Research Proposal:
Development of the Multi -National Corpus of English
Part 1: DESCRIPTION OF THE PROPOSED RESEARCH AND ITS CONTEXT
1. Background
The University of Leeds (2004) has done previous research on computer analysis of English language
texts, also known as English Corpus Linguistics. For example, the University has developed a Partof-Speech analysis system which is being used on other research projects such as the International
Corpus of English (ICE), which includes research teams in fifteen countries where English is the first
language or second official language. In many of these English-speaking countries, the national ICE
sub-corpus is a recognised resource used in research and teaching. ICE began in 1990 with the
primary aim of collecting material for comparative studies of English worldwide. Fifteen research
teams around the world are preparing electronic corpora of their own national or regional variety of
English. Each ICE corpus consists of one million words of spoken and written English produced
after 1989. For most participating countries, the ICE project is stimulating the first systematic
investigation of the national variety. To ensure compatibility among the component corpora, each
team is following a common corpus design, as well as a common scheme for grammatical
annotation.” (ICE, 2002)
These corpora are collected mostly by looking at standard and more traditional materials such as
books, newspapers and articles. This method allows a wide variety of texts to be obtained, however
it is very time consuming and costly since researchers have to be sent on the site and the texts need to
be transcribed and converted into electronic format. This is one of the reasons why five other ICE
projects (Cameroon, Fiji, Ghana, Nigeria and Sierra Leone) have not been able to start collecting any
texts up to this date. Therefore, an alternative way of quickly and simply compiling a big corpus of
much more than one million words would be to use the World Wide Web. Other attempts at using
this method have proved to be successful (e.g. Serge Sharoff at Leeds University has developed tools
to extract 100 million words corpora of Russian, German and Chinese).
A pilot project to investigate the possibility of collecting a corpus for Mauritius, one of the many
English-speaking African countries, was undertaken. English has been used in Mauritius for around
195 years, since the British settlers arrived in 1810, and is the official language of the country
(Republic of Mauritius, 2004). However, at that time, slaves were imported from Africa and
Madagascar, a large number of labourers from India were brought to work in the sugar cane fields
and a small number of Chinese came to trade, and the influence of the French who were the rulers
before the British was still very strong. The different languages brought by the different settlers have
therefore influenced significantly the official English language of the country. Still, all official
communications and teaching in schools are done in English, even though in many official
communications or press reports or school textbooks for instance, you might come across some
dialect words, such as names of individuals or companies written in French or Hindi.
There are already numerous Mauritian websites available on the World Wide Web and most of them
are written in English and the government has just started a Cyber City project, which is the first of
its kind of a new generation of IT parks in this part of the world (BPML, 2004). Therefore, it will be
feasible to collect at least some written samples of Mauritian English remotely, via the World Wide
Web. For the pilot project, a sample of 30 texts between 1,000 to 2,000 words were collected (a total
of 51,960 words). Each text, including its details such as author, publisher and date, took between15
and 20 minutes to find. From the texts obtained, it was noted that some amendments will have to be
made to the standards of ICE since the types of texts available on the Internet will not match the text
categories nor the text size and other information of the texts are not easily available on the Internet.
18 permissions for the use of the texts were sent out by emails and it took 6 minutes on average to
91
Koo Tee Fong, Dolly
send each one. The issue of potential commercial use will have to be addressed in more details since
the other ICE projects are strictly not commercial and according to Gerald Nelson at UCL this
statement might cause difficulties in obtaining persmissions from owners and might cause problem to
the other ICE teams. The 30 texts were also mark-up with a reduced ICE header and it took between
20 to 30 minutes to “clean up” the source webpage and to add the markup to each text. This stage
can be done partly by a program, where human interaction is only needed to proofread and post-edit
or correct the draft of the marked up texts which are produced. However, for the pilot project, no
such program was available and therefore the estimates are derived from the manual process. The
tagging and parsing will be done automatically and hence it was estimated that one million words
will take one and a half weeks to be tagged and one and a half weeks to be parsed.
Evidence from the pilot project shown that with this internet collection technique, the sub-corpus will
contain less than one million words due to the limited set of text categories available on the World
Wide Web. Therefore, a better extension will be to include other types of English from other
English-speaking countries in the corpus. This will result in an “ICE-lite” with around five hundred
thousands words for each country. The term “lite” is borrowed from other simplified projects such as
“TEI-lite” which means a simpler version of TEI, a standard XML-markup convention for text
corpora (TEI, 2005). The twenty countries which have been chosen to form part of the Multinational Corpus of English are: Bahamas, Bangladesh, Barbados, Bermuda, Botswana, Cayman
Islands, Cyprus, Dominica, Gambia, Gibraltar, Grenada, Liberia, Malta, Mauritius, Namibia,
Pakistan, Seychelles, Uganda, Zambia and Zimbabwe.
In each of these abovementioned countries, English is either the national language or one of the main
speaking languages. For instance, in Bermuda and Zimbabwe, English is the official language while
in Liberia English is used mostly for trading purposes. Therefore, the form of English in Liberia has
significant differences in terms of its word structure and it can take some time and practice to master.
Like in Mauritius, English in most of these countries has in a way or another been highly influenced
by other languages, either brought by ancestors or derived from their culture. The ICE-lite corpus
will hence allow an interesting and useful analysis of the variation in English across the nations. To
ensure compatibility and provide an enhanced comparison with the other existing projects, the “lite”
version of the 20 teams already in ICE will also be included in the corpus. For each country,
numerous websites are easily accessible via the World Wide Web, and different texts categories are
available. Google provides evidence that for the Mauritius pilot project there are 250 million words
of text to select from. Therefore, a large amount of texts should be available to collect 250 texts of
2,000 words each for each country and thus the corpus will aim to contain approximately 20 million
words in total. An important issue which will need further consideration is the distribution method.
The existing ICE corpora are distributed on CD and this results in reduced accessibility. One
possible solution will be to make the ICE-lite corpus availa ble on the Internet via a public licence.
For instance, it can be distributed via the GNU public licence, analogous to open-source software
freely downloadable from Sourceforge.net.
From the success of the other ICE corpuses, namely the sub-corpuses from Great Britain, India, Hong
Kong and East Africa, among others, researching English across the different nations and developing
an ICE-lite Corpus will help in the development of each of the participating countries whether it be in
IT, research or teaching.
2. Objectives
To achieve the goal of developing an ICE-lite Corpus of English, the following specific research
objectives have been identified:
• To set up infrastructure and prototype sampler corpus for the ICE-lite Corpus of English.
• An international steering panel will establish agreed standards for text types and categories and
the other annotation standards such as encoding and XML mark-up and tagging, distribution.
92
Koo Tee Fong, Dolly
•
•
To collect, mark-up and lexico-grammatically annotate different samples of spoken and written
texts in English from the 40 countries :- 250 texts of approximately 2,000 words each for each
country, a total of approximately 20 million words.
Use corpus exploration tools to analyse lexical and grammatical variation across the contributing
dialects of English in this sampler corpus.
3. Research Work-Plan
The work has been organised into eight activity streams (assuming that a month consists of 20 days
of 8 hours working time):
WP0: Project Management via International Steering Panel (1 month each over 3 years)
The Panel will establish agreed standards for text types and categories; encoding and XML mark-up;
morphological analysis and Part-of-Speech tagging and distribution methods. Standards proposals
will be drawn by the project investigators but subject to approval and improvement by the Panel.
Members will be chosen from the 40 countries that will form part of the ICE-lite.
Deliverable D0: ICE-lite International Steering Panel to meet annually to oversee project progress.
WP1: Collection of Spoken and Written Text of English. (10 months RF1, 8 months RF2, 8
months RF3)
The corpus will contain texts from 1990 or later. The total amount of texts needed will be 250 texts
of approximately 2,000 words each for each country - a total of approximately 20 million words.
The authors and speakers of the texts need to have been brought up and taught through the English
medium. They must be aged 18 or over and were either born or immigrated at an early age to the
country.
1.1 Written Text
All of the written texts will be collected from the Internet only. However, it will be difficult to obtain
non-printed texts such as social letters or student essays.
1.2 Spoken Text
Recording spoken texts is labour-intensive, time-consuming and costly (Meyer, 2002) and will only
be possible if done at the site, i.e. in each of the countries. Therefore, we will only seek to use
sources such as radio and TV broadcasts, which are available on the Internet. The solution proposed
by Sharoff (2005) is to “increase the amount of ephemera (leaflets, junk mail and typed material),
correspondence & spoken language samples”.
1.3 Copyright issues
Letters of copyright will have to be obtained. This will involve in the first place identifying the
owners of sources and finding the right contact details.
1.4 Classification of texts
Selecting and organising the texts will be a complex task and careful consideration is required. As
far as possible, the text classification will follow the ICE standard categories, but it is expected that
some texts categories will not be available on the Internet. It is also important to classify the text
according to the country it comes from.
Deliverable D1: A detailed list of the texts collected together with information about the authors,
publisher, publisher place and date.
The written texts will be in electronic format already. After the spoken texts have been collected and
permission is received, the spoken texts will be transcribed, that is, written on paper or typed on
screen. It is expected that most of the speech recorded will be in digital format already since they
will be collected from the Internet. No sample of spoken texts from Mauritius was available, so it is
expected that only a limited number of spoken texts will be obtained and the time required for
transcription is only based on personal judgement and relative to the time allowed for WP1.
Deliverable D2: A sampler of the raw Multi-national Corpus of English.
93
Koo Tee Fong, Dolly
3.1 Encoding of Text
The texts will be encoded with XML mark-up, i.e. the features of the original texts that are lost when
it is converted into a plain text file on a computer will be encoded. In written texts this includes
features such as boldface, italics and underlining as well as sentence boundaries, paragraph
boundaries and headings. In spoken texts the encoding features will be sentence boundaries, speaker
turns, and pauses (Nelson, 1996a). Paragraphing and header information (adapted from the ICE
standards) regarding author, publisher, etc. will be added. Texts with different formats (Doc, PDF,
HTML) will be converted into a unified framework (XML format) (Al-Sulaiti, 2004).
3.2 Proofread Text
Both spoken and written text will be proofread on the screen. This task includes deleting extra and
unnecessary material from texts and checking and adjusting paragraphing markers.
Deliverable D3: Multi-national Sampler Corpus ready for distribution.
WP4: Word-class tagging (6 months RF2, 8 months RF3)
Like in the other ICE, the texts will be “automatically tagged for wordclass by the TOSCA Tagger,
developed by the TOSCA Research Group at the University of Nijmegen. This assigns wordclass
tags to each lexical item in the corpus. The tagset has been developed especially for ICE, and is
largely based on Quirk et al (1985) A Comprehensive Grammar of the English Language” (ICE,
2002). During this stage, each item will be assigned a label or tag, for example, ‘N’ for noun and
‘ADV’ for adverb. Other information, such as singular or plural, or the verb tense will be added in
brackets next to the label or tag.
Deliverable D4: Proofread and manually-corrected tagged Multi-national Sampler Corpus.
5.1 Cross-sectional checking
The syntactic wordclass tags will be checked on a cross-sectional, construction–by-construction
basis. This will allow the check to be concentrated on just one grammatical construction at time and
correction can be made on each instance of the construction throughout the whole corpus, if
necessary. The ICECUP (ICE, 2002) can be used for the cross-sectional checking.
5.2 Spot-checking
Finally, the corpus will be ‘spot-checked’ before being released.
Deliverable D5: The final Multi-national Corpus of English
We will use English concordance and corpus exploration tools to analyse lexical and grammatical
variation across the contributing dialects of English in the sampler corpus.
Deliverable D6: Research paper on lexical and grammatical variation in the Multi-National Sampler
Corpus, to be submitted to the International Journal of Corpus Linguistics.
The normal dissemination route for academic research is journal and conference papers. D6 is
directly publishable, other papers will need to be written from other deliverables.
Deliverable D7: Plan for continuing expansion of the Multi-national Corpus of English, extending to
new countries.
4. Benefits
The research will first be beneficial to the governement and the educational system in each of the
twenty countries mentioned above and the existing ICE teams. A comprehensive description of the
different types of English can be obtained from the corpus and therefore each country will be able to
develop its own reference guides to usage, dictionaries and other teaching materials. This can help
both schools and universities to adapt their methods of teaching, and especially the structure in which
English is taught and spoken to a better standard. The comparison across the dialects of English to
find any striking similarities or differences will be useful for further research and teaching methods
94
Koo Tee Fong, Dolly
in each country and will also benefit those people who want to travel to or trade with other Englishspeaking countries since the comparison will provide a useful insight in how they will have to adapt
their language. When the corpus is released, it will also be beneficial to other research or academic
institutions across the world. It can be used as a comparison or for further research by the existing
corpuses or other potential corpuses.
Longer-term impacts of the work to be done include:
• Promoting cooperation between other English speaking countries and for the purpose of
developing basic components for the linguistic society.
• Easing the entrance requirements of English speaking countries into the different markets.
• Promote the different cultures of the 40 countries across the world.
5. Resources
Staff: the development of the project will require the employment of:
3 English corpus linguists as post-graduate Research Fellows and project managers for 3 years.
Consumables:
A powerful laptop PC for each researcher, costing £2000 each.
Consultancy fees of £40,000 for transcription and mark-up of source materials.
Travel and Subsistence:
Results will be reported and published ni conference proceedings including Corpus Linguistics
(CL’07 Lisbon) and ICAME (ICAME’06, ‘07, ‘08, locations not yet known), estimated total cost of
£4,000. The costs for the International Steering Panel meetings at start, mid and end of project are
estimated at a total of £21,000.
Reference:
Al-Sulaiti, L. (2004) Designing and Developing a Corpus of Contemporary Arabic. Unpublished MSc
thesis. University of Leeds.
BPML (2004) Cybercity Mauritius - The Ebène CyberCity web site [online]. [Accessed 20th November
2004]. Available from World Wide Web: http://e-cybercity.mu/cybercity
Department of English Language & Literature, University College London (2002) The International
Corpus of English (ICE) web site [online]. [Accessed 9th November 2004]. Available from World Wide
Web: http://www.ucl.ac.uk/english-usage/ice/#
Humanities Text Initiative (2005) The TEI Header [online]. [Accessed 16th February 2005]. Available
from World Wide Web: http://www.hti.umich.edu/cgi/t/tei/tei-idx?type=pointer&value=HD
Meyer, C. (2002) English Corpus Linguistics, an Introduction. Cambridge: Cambridge University Press.
Nelson, G. (1996a) The Design of the Corpus. In Greenbaum, S. (ed.) (1996) Comparing English
Worldwide: The International Corpus of English. Oxford: Clarendon Press
Nelson, G., Wallis, S. and Aarts, B. (2002) Exploring Natural Language: working with the British
component of the International Corpus of English. Philadelphia: John Benjamins Publishing Company.
Republic of Mauritius (2004) History [online]. [Accessed 20th November 2004]. Available from the
World Wide Web: http://www.gov.mu/abtmtius/history.htm
Sharoff, S. (2004) Methods and tools for development of the Russian Reference Corpus. In Archer, D.,
Wilson, A. and Rayson P. (eds.) Corpus Linguistics Around the World. Amsterdam: Rodopi.
SourceForge.net (2005) Project: MinGW - Minimalist GNU for Windows [online]. [Accessed 11th March
2005]. Available from World Wide Web: http://sourceforge.net
95
Koo Tee Fong, Dolly
The School of Computing, University of Leeds (1998-2004). The University of Leeds web site [online].
[Accessed 21st October 2004]. Available from World Wide Web:
Part 2: DIAGRAMMATIC WORK PLAN
The work has been organised into eight activity streams. The three research fellows will be working
in parallel during the 36-month project starting August 2005.
WP0: Project Management via International Steering Panel (1 month each over 3 years)
WP1: Collection of Spoken and Written Text of English (10 months RF1, 8 months RF2, 8
months RF3)
WP4: Word-class tagging (6 months RF2, 8 months RF3)
Research Fellow 1:
Month:
A
S
O N
D J
F
M A M J
J
A
S
O
N D
J
F
M A M J
J
A S
O N D
J
F
M A
M J
J
F
M A M J
J
A
S
O
N D
J
F
M A M J
J
A S
O N D
J
F
M A
M J
J
F
M A M J
J
A
S
O
N D
J
F
M A M J
J
A S
O N D
J
F
M A
M J
J
WP0
WP1
WP2
WP3
WP4
WP5
WP6
WP7
Research Fellow 2:
Month:
A
S
O N
D J
WP0
WP1
WP2
WP3
WP4
WP5
WP6
WP7
Research Fellow 3:
Month:
A
S
O N
D J
WP0
WP1
WP2
WP3
WP4
WP5
WP6
WP7
Actual Estimate
Possible overflow
96

a pilot project for ice-mauritius - VLE

Transcription

Similar documents

american tire distributors facility

YOU are the American Red Cross

Program Layout

pre-proof pdf - Kearsy Cormier

Equivalent Malay-Arabic Data Corpus Collection

Sunday Bulletin - Corpus Christi Parish

The Ultimate Ice Resurfacer Non Polluting Top Performing

YACIS: A Five-Billion-Word Corpus of Japanese Blogs Fully