a pilot project for ice-mauritius - VLE
Transcription
a pilot project for ice-mauritius - VLE
A PILOT PROJECT FOR ICE-MAURITIUS Dolly Koo Tee Fong BSc Computing and Management 2004-2005 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS SUMMARY The overall objective of this project was to develop a prototype of the Mauritius component of the International Corpus of English (ICE) to demonstrate feasibility and potential problems for a larger-scale follow- up project. In doing so, a proposal was also drafted in accordance to the EPSRC requirements with a possibility to be sent for funding. The following was achieved in the project: • Tools and techniques available for corpus development and processing were investigated and discussed, along with the main ones used by ICE. • The Mauritius component of ICE, named as ICE-Mauritius, had been collected and compiled up to 5% of the original size of an ICE project. • A full work plan was written for a follow-up project to develop a full-scale ICE- lite corpus, consisting not only of English from Mauritius but also from other 39 Englishspeaking countries. • Finally, the prototype and the work plan were evaluated by three people who are experienced and involved in corpus collection and funding application. i Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS ACKNOWLEDGEMENT I would like to thank my project supervisor and personal tutor, Eric Atwell, for his help and support throughout this project and also through my whole third year at Leeds University I would also like to thank Gerald Nelson and Serge Sharoff for kindly agreeing to take part in evaluating the project and for their advice. Finally, I would like to thank my boyfriend, family and flatmates for their input, support and encouragement throughout the course of the project. ii Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS CONTENTS 1. Introduction 1.1 1.2 1.3 1.4 1.5 ___________________________________________ 1 - 3 Aim ____________________________________________________ Objectives ______________________________________________ Minimum Requirements __________________________________ Deliverables _____________________________________________ Initial Project Schedule __________________________________ 2. Survey of computer technol ogies for corpus development _________ 2.1 2.2 Background to the problem ________________________________ 2.1.1 Introduction ___________________________________________ 2.1.2 What is a Corpus? ______________________________________ 2.1.3 Overview of The International Corpus of English (ICE)_________ 2.1.4 Other Corpora ________________________________________ 2.1.5 Reasons for Encoding a Corpus____________________________ 2.1.6 ICE Corpus Design _____________________________________ 3 - 23 3 - 10 3 4 5 6 7 8 Corpus Collection and Encoding ____________________________ 10 - 17 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.3 1 1 1 1 2 Collecting Data ________________________________________ Computerising Data _____________________________________ ICE Markup System _____________________________________ Corpus Tagging ________________________________________ Syntactic Parsing _______________________________________ 10 11 12 15 16 Annotation Tools ________________________________________ 17 - 23 2.3.1 The ICE Markup Assistant ________________________________ 17 2.3.2 The Different Taggers Available ___________________________ 17 2.3.3 The ICE Tag Selection System ____________________________ 19 2.3.4 The ICE Syntactic Marking System _________________________ 19 2.3.5 Different Varieties of Syntactic Annotation ___________________ 20 2.3.6 The ICE Syntactic Tree Annotator __________________________ 22 3. Methodology _____________________________________________ 23 - 27 3.1 Corpus Design ___________________________________________ 23 - 25 3.1.1 Methods to be Used _____________________________________ 23 3.1.2 Copyright Issues ________________________________________ 25 3.1.3 Corpus Layout ________________________________________ 25 3.2 Capturing Text in Electronic Format ______________________ 3.2.1 Computerising Speech __________________________________ 3.2.2 Computerising Written Texts ____________________________ 25 - 26 25 26 3.3 Corpus Annotation _______________________________________ 3.3.1 Structural Mark-up _____________________________________ 3.3.2 Procedure for Annotating the Corpus ______________________ 26 - 27 26 27 iii Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 4. The Pilot Project __________________________________________ 27 - 37 4.1 Collection of Texts ________________________________________ 27 - 32 4.1.1 Search Methods ________________________________________ 27 4.1.2 Text Collection ________________________________________ 29 4.1.3 Written Text Classification ________________________________ 30 4.1.4 Permission Letters ______________________________________ 31 4.1.5 Layout of the Pilot Project ________________________________ 32 4.2 Corpus Annotation ________________________________________ 32 - 37 4.2.1 TEI-Header ________________________________________ 32 4.2.2 Texts Encoding ________________________________________ 33 5. The Proposal __________________________________________ 38 - 44 5.1 Funding opportunities __________________________________ 5.1.1 Research at University of Leeds, School of Computing__________ 5.1.2 Introduction to the EPSRC _______________________________ 5.1.3 Eligibility of Investigators ________________________________ 5.1.4 Research Opportunities __________________________________ 5.1.5 How to Apply ________________________________________ 38 - 40 38 38 39 39 40 5.2 Writing up the proposal __________________________________ 5.2.1 Original Idea ________________________________________ 5.2.2 Expansion of Corpus Design ______________________________ 5.2.3 Writing Up Proposal ____________________________________ 40 - 44 40 41 43 6. Evaluation ______________________________________________ 6.1 6.2 6.3 6.4 Product _________________________________________________ Minimum Requirements __________________________________ Project Stages ____________________________________________ Planning and Schedule _____________________________________ 7. Conclusion ______________________________________________ 44 - 51 44 47 47 50 51 - 51 References ________________________________________________ 52 - 54 APPENDIX A: Personal Experience ____________________________ APPENDIX B: Markup Symbols ____________________________ APPENDIX C: Corpus Design Layout ____________________________ APPENDIX D: List of Texts Collected ____________________________ APPENDIX E: Sample of the Letters of Copyright ______________________ APPENDIX F: Template for the Header ____________________________ APPENDIX G: Example of Raw Text ____________________________ APPENDIX H: Examples of Encoded Text ____________________________ APPENDIX I: First Draft of the Case for Support for ICE-lite_____________ APPENDIX J: EPSRC Application Form ____________________________ APPENDIX K: Revised Case for Support for the ICE-lite Proposal_________ iv 55 - 56 57 - 58 59 - 59 60 - 66 67 - 68 69 - 69 70 - 71 72 - 75 76 - 81 82 - 90 91 - 96 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS A PILOT PROJECT FOR ICE-MAURITIUS 1. Introduction 1.1 Aim To develop a prototype of the Mauritius component of the International Corpus of English, to demonstrate feasibility and potential problems for a larger-scale follow-up project. 1.2 Objectives The objectives of the project are to: • Compare and evaluate the different computer technologies available to extend the International Corpus of English to Mauritius English. • Investigate data-sources and instigate data-collection for a Mauritius ICE sub-corpus. • Research on infrastructure and data collection methods. • Investigate the requirements and feasibility of a larger-scale follow-on project to develop a full-scale ICE-Mauritius Corpus. 1.3 Minimum Requirements The minimum requirements are: • Develop a small-scale prototype of the Mauritian Corpus of English. • Survey of computer technologies for corpus development and processing. The possible extensions: • 1.4 Work plan for a follow -up project to develop a full-scale ICE-Mauritius corpus. Deliverables The project deliverables are: • The project report • The prototype of the Mauritian Corpus of English 1 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 1.5 Initial Project Schedule The initial project schedule, Schedule 1 below , does not reflect the actual work done since after obtaining the assessor’s feedback in January, it became clear that the project ha d to take a new direction, with some changes to the aims and requirements. The new aims and requirements were given above. This also resulted in a new work plan for the second semester, as shown in Schedule 2 on the next page. Schedule 1: Schedule before Christmas Break Dates 11/10/04 - 22/10/04 22/10/04 - 08/11/04 Milestones 15/11/04 - 24/01/05 29/11/04 - 10/12/04 10/12/04 - 22/12/04 Section on Background Research Section on Background Research Section on Background Research Appendix I,J & K Mid-project report n/a 10/12/04 - 22/12/04 n/a 10/12/04 - 24/01/05 n/a 24/01/05 - 31/01/05 31/01/05 - 07/02/05 n/a n/a 07/02/05 - 21/02/05 21/02/04 - 07/03/05 n/a n/a 08/03/05 n/a 08/02/05 - 18/03/05 18/03/05 - 01/04/05 n/a n/a 01/04/05 –18/04/05 Final Report 08/11/04 - 22/11/04 08/11/04 - 22/11/04 2 Tasks Identify and specify aim and minimum requirements Research requirements of EPSRC Research the International Corpus of English Research on Mauritius English Draft Proposal Collate mid-project report Research and evaluate other EPSRC training courses available Research and evaluate different training/education techniques Christmas break, revision and end of semester 1 exams Decide on delivery mechanism Research and decide on what to include in training course Write up tutorial Work on improving aspects of tutorial and the draft proposal Give training course to research staff and students Collect feedback on training course Analyse feedback and evaluate training course Complete final report. Most chapters should be already partially written up, but may need reworking. Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Dates 24/01/05 - 31/01/05 01/02/05 - 07/02/05 07/02/05 - 09/02/05 10/02/05 - 17/02/05 18/02/04 - 25/02/05 26/02/05 - 28/02/05 01/03/05 - 18/03/05 18/03/05 - 01/04/05 01/04/05 –18/04/05 Schedule 2: New schedule for second semester Milestones Tasks Section 1 Decide on new aims & objectives and design new plan Section on Background Research on methods available to Research (sections 1 & extend the ICE corpus to Mauritius 2) Appendix C and Section Design layout and text categories of ICE 3 Mauritius Section 4 and Appendices Collect sample texts from the Internet & D&E send request for copyright permission Section 4 and Appendices Annotate corpus F,G & H Section 4 & 5 and Investigate feasibility of ICE-Mauritius Appendices I, J & K Appendices I,J & K Draft a proposal for ICE-Mauritius Section 6 Evaluate corpus & proposal Final Report Complete final report. Most chapters should be already partially written up, but may need reworking. 2. Survey of computer technologies for corpus development 2.1 Background to the problem 2.1.1 Introduction ‘Leeds has a track record for research on computer analysis of English language texts, also known as English Corpus Linguistics. For example, the University has developed a Part-of-Speech analysis system, used on other research projects such as the International Corpus of English (ICE), which includes research teams in fifteen countries where English is the is the first language or second official language language. In many of these English-speaking countries, the national ICE sub-corpus is a recognised resource used in research and teaching’ (Atwell, 2004). Mauritius is one of the many English-speaking African countries, but there is no Mauritian subcorpus in ICE. However, the government has started a Cyber City project, which is the first of its kind of a new generation of IT parks in this part of the world. ‘The construction of Ebène CyberCity is a historical milestone towards achieving the Government’s objective of transforming Mauritius into a diversified, high-tech, high income services and knowledge economy’ (BPML, 2004). Therefore, it may be feasible to collect at least some samples of Mauritian English remotely, via the World Wide Web. 3 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Mauritius has been chosen because of the special characteristics of the English used there. English has been used in Mauritius for around 195 years, since the British settlers arrived in 1810, and is the official language of the country (Republic of Mauritius, 2004). However, at that time, slaves were imported from Africa and Madagascar, a large number of labourers from India were brought to work in the sugar cane fields and a small number of Chinese came to trade, and the influence of the French who were the rulers before the British was still very strong. The different languages brought by the different settlers have therefore influenced significantly the official English language of the country and the mixture of those languages have also resulted in a new language, Creole, which is nowadays spoken by everyone. Even if Creole is the most spoken language in Mauritius, all official communications and teaching in schools are done in English. However, the English used has particular characteristics. For instance, in many of the official communications or press reports, some French and Creole words are used to emphasise a specific theme. In school textbooks, other dialect words are bound to appear, such as names of individuals or companies written in Chinese or Hindi. Another characteristic worth noting is that people in Mauritius tend to think in their native language and then translating what they want to say in English, keeping the same structure and grammar as in their native language. This results in a great variation of English among the cultures in the same country and also in young people learning English differently. Developing a corpus for Mauritius will therefore allow an interesting and useful analysis and understanding of the different types of English used and will hopefully help the government, teaching organisations and the people in general to understand and build up a common standard of English. The pilot project has hence investigated what can be done to instigate and start data collection for a Mauritian ICE sub-corpus. A proposal for the infrastructure and data collection methods has also been made in respect of the EPSRC requirements. 2.1.2 What is a Corpus? The term ‘corpus’ was traditionally used to designate a body of naturally-occurring or authentic language data which usually might consist of written or spoken texts, or samples of spoken and written language in a particular language or language variety. The corpus could be used as a basis for linguistic research. In the last thirty-five years, the term ‘corpus’ has been used to describe more the electronic form of the set of language material which may be processed by computer for various purposes such as language research and engineering. This includes the study of all aspects of language such as syntax, semantics, pragmatics, speech and recently in lexicographic studies. 4 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Due to the small scale of data available at hand in the past, many past theories and interpretations explaining linguistics phenomena, although accurate, were too narrow to be applied to the whole set of languages. In addition, the focus was more on language structure than on the use of the language. (Leech, 1997a, Al-Sulaiti, 2004) Due to the recent explosion in technology, corpora have increased dramatically in size, variety and ease of access. The combined use of corpora and computers to study languages has changed the way linguistics phenomena is analysed. Nowadays, corpus linguistics are analysing the use of structures and investigating factors that affect our choice of a particular structure. For instance, the factors may be related to the nature of the writing or speaking such as science or literature or it may be related to discover typical linguistic patterns in some defined contexts. With the computer, storing a huge amount of data, this new view of language analysis becomes more accessible and hence, the computer corpus is fast becoming a universal resource for language research. 2.1.3 Overview of The International Corpus of English (ICE) ICE was initiated in 1988 by the late Sydney Greenbaum, the then Director of the Survey of English Usage, UCL. From 1996 to 2001, it was coordinated by Charles Meyer, University of Massachusetts-Boston and it is now coordinated by Gerald Nelson, who recently returned to UCL from the University of Hong Kong (Nelson et al., 2002). The ICE’s primary aim is to collect material for comparative studies of English worldwide and its long-term aim is to produce up to twenty-one million-word corpora. Around the world there are fifteen research teams, shown in Table 1 below, who are preparing electronic corpora of their own national or regional variety of English (UCL, 2002). Five other ICE projects for Cameroon, Fiji, Ghana, Nigeria and Sierra Leone have been considered but no text has been collected yet. Table 1: Components of the ICE project Australia Great Britain Ireland New Zealand South Africa Canada Hong Kong Jamaica Philippines Sri Lanka East Africa India Malaysia Singapore USA Each ICE corpus consists of one million written and spoken words of English. Each team is following very closely the same corpus design, as well as the same scheme for grammatical annotation so as to ensure compatibility between the individual corpora in ICE (UCL, 2002). The 5 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS texts in the corpus date from 1990 or later. The authors and speakers of the texts have grown up and been taught through the English medium. They are aged 18 or over and are either born or emigrated at an early age to the country in whose corpus they are included and are educated through the English medium in the country concerned (UCL, 2002). The corpora in ICE are being annotated at various levels to enhance their value in linguistic research. These levels are Textual Mark-up, Word class Tagging and Syntactic Parsing. Despite attempts to achieve conformity, complete identity between corpora is not possible. There are inevitable differences in the samples taken for speech and writing and in some countries certain categories are difficult to obtain. Information about each author or speaker may also be unavailable and the projects have different start dates which result in discrepancies in the date of the texts. However, for global comparisons, the corpora are similar enough to justify any analysis carried out. (Greenbaum, 1996) 2.1.4 Other corpora The Brown Corpus of Standard American English (Ku.era and Francis, 1967) is the first modern electronically readable corpus to be developed. The corpus consists of one million words of American English texts printed in 1961 and the texts are sampled in different proportions from 15 different text categories, some of which are press, skills and hobbies, religion and fiction. Compared with the various corpora available today, the Brown Corpus of Standard American English is considered to be small. However, it is still used in teaching and as a model for the development of other corpora. The British National Corpus (BNC) is another large corpus and it was completed in 1994 (Leech, 1997a). The corpus consists of 100 million words and it contains both written and spoken material. In addition to the British and American English corpora, there are other varieties of English corpora such as the Australian Corpus of English (ACE), the Finland Corpus of Early English Correspondence Sampler and others (Breyer, 2005). Many other corpora have also been developed for different languages, such as the Czech National Corpus, CORIS (an Italian Corpus) and the French Corpus. These corpora are for general-use in linguistics research. There exist other corpora which are more specialised, such as the Air Traffic Control (ATC) corpus and the Trains Spoken Dialogue Corpus (Al-Sulaiti, 2004). 6 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Modern-day corpora are of various types and the difference in their composition depends on their use. Balanced corpora like the Brown Corpus, which includes different types of written English, are more valued by individuals who are interested in linguistic description and analysis. In other corpora, size may be more important than balance. One such example is the Penn Treebank. In this case, linguists are more interested in the computational aspects of the corpus, involving research in natural language processing. For instance, these types of corpora have been used in the development of taggers and parsers. (Meyer, 2002) It is useful to note that part of the British component of the ICE and the BNC have been funded by the Engineering and Physical Sciences Research Council (EPSRC) in the UK and the Brown Corpus has been funded by the equivalent of the EPSRC in America. 2.1.5 Reasons for encoding a Corpus Two types of corpus can be identified whether it is written or spoken: raw corpus and annotated (or marked-up) corpus. The former is mainly the natural text itself with no other additional information and in the latter the text is “enriched with a variety of information” (Al-Sulaiti, 2004). Although raw corpora can be used with the help of tools to carry out any linguistic analysis, annotated corpora provide better analysis. Leech (1997a) has identifie d the following advantages of annotated corpora: • Extracting information: a piece of language can have various meanings and uses in its orthographic form, for instance the word ‘left’ can be a noun, an adjective or a verb. Therefore, extracting information becomes easier and more efficient if the corpus is grammatically tagged since each occurrence of ‘left’ will be accompanied by a label indicating its type. • Re-usability: once the corpus has been annotated, it can be handed on to other users and this is a valuable advantage since corpus annotation are usually an expensive and time consuming process. • Multi-functionality: annotation adds overt linguistic information to a corpus and this makes it useful for a multitude of purposes. 7 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 2.1.6 ICE Corpus Design Length of Corpus Meyer (2002) has state d that the first questions to ask when designing a corpus are: 1. “What will be the overall length of the corpus?” The lengthier the corpus, the better it is, but it has to be feasible. 2. “How long the corpus needs to be to permit the kinds of studies one envisions for it?” The standard requirement of ICE is that the core corpus should contain a total of one million written and spoken words of English. However, some region might want to collect more material in certain text categories or to include additional categories, depending on their needs (Greenbaum, 1991b). Type of genres Again, Meyer (2002) has raised an important question: “Why these genres and not others?” To answer this question, we have to consider the different types of corpora that have been created and the purpose of each one. As mentioned above, some corpora are multi-purpose, namely the BNC and the ICE Corpus, which means that they are intended to be used for a variety of different purposes and therefore these corpora need to contain a broad range of genres. However, the multi-purpose corpora do not always cover a full representation of all genres. Therefore, special genres need to be collected for special- purpose corpora such as the Michigan Corpus of Academic Spoken English (MICASE), which is used to study the type of speech used by individuals conversing in an academic setting (Meyer, 2002). The ICE corpus is usually divided into the ratio of 60:40 for spoken and written English respectively. Within both halves a distinction is made between private (conversation or letter) and public (news report or lecture). Both the private and public sections can be further divided into monologue and dialogue for speech and scripted, non-printed and printed for written texts (Greenbaum, 1991b). Below are the typical ICE Text Categories, taken from the ICE website. Table 2: ICE Text Categories Numbers in brackets indicate the number of 2,000-word texts in each category. Spoken Dialogues Private Conversations (90) (300) (180) (100) Phone calls (10) Public (80) 8 Class Lessons (20) Broadcast Discussions (20) Broadcast Interviews (10) Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Parliamentary Debates (10) Cross-examinations (10) Business Transactions (10) Monologues Unscripted Commentaries (20) (120) (70) Unscripted Speeches (30) Demonstrations (10) Legal Presentations (10) Written (200) Non-printed (50) Printed (150) Scripted (50) Broadcast News (20) Broadcast Talks (20) Non-broadcast Talks (10) Student Writing (20) Student Essays (10) Exam Scripts (10) Letters (30) Social Letters (15) Business Letters (15) Academic (40) Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10) Popular (40) Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10) Reportage (20) Press reports (20) Instructional Administrative Writing (10) (20) Skills/hobbies (10) Persuasive (10) Creative (20) Editorials (10) Novels (20) Length of individual text samples Each text in the corpus contains about 2000 words (UCL, 2002) , following the sample -size nor ms of pioneering Brown and LOB corpora. Therefore, there are 500 texts in each regional corpus with 10 texts (20,000 words) as the minimum for each text category. Since most corpora contain relatively short samples of text, text fragments instead of complete texts tend to be stored. Ideally, it is be better to include complete text in the corpora but the length of the text is one of the main reasons why this is not possible. For instance, a book is too lengthy and it will take up the whole corpus if it is to be used as a whole. If only part of a text is used, the 2000 word sample can be chosen from any part of the text. In existing ICE Corpora, many samples also consist of composite texts, that is, a series of complete short texts that total 2,000 words in length (UCL, 2002, Meyer, 2002). These often include personal letters which are usually less than 2,000 words. 9 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Range of speakers and writers Meyer (2002) has pointed out that it is “not simply whether one obtains texts from native or nonnative speakers but rather that the texts selected for inclusion are obtained from individuals who accurately reflect actual users of the particular language variety that will make up the corpus”. Greenbaum (1991b) has also emphasised that it is not the language that has to be selected but the people and their language should not be excluded on subjective criteria of correctness, adequacy or appropriateness. Therefore, since the ICE project is restricted to educated English, the only criteria for selecting the population should be “adults of eighteen or over who ha ve received formal education through the medium of English to at least the completion of secondary school” (UCL, 2002). The selection of text should not be random but the population differences and the textual differences should be taken into account. Some relevant variables to consider are age, gender, level of educational, dialect variation (e.g. urban or rural locations), ethnic group, region, occupation and status in occupation, social contexts and social relationships. 2.2 Corpus Collection and Encoding 2.2.1 Collecting Data Spoken Texts Collecting spoken texts, especially spontaneous speech, is the most difficult and frustrating task in the development of the corpus (Sharoff, 2005, Nelson, 1996a). The cooperation of the speakers is required but often there is the problem of the “observer’s paradox”. That is, people tend to behave differently when being observed (or recorded) and therefore the way they speak may change. According to Meyer (2002), one way around this problem is to record a longer speech and then choose the most natural part. However, speech collection is already time -consuming and recording a longer speech just to obtain a natural part will be too costly. To record the speech, either analogue or digital recorder can be used since they both yield satisfactory result. Meyer (2002) has nevertheless recommended using digital recorder since it is easier to transfer to the computer for manipulation and longer speech can be recorded. To improve the quality of recordings, the type of microphones being used is also an important consideration. For the other spoken categories, such as broadcast speech, it is best to use radio or television for direct recordings. 10 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Written Texts Compared to the collection of spoken texts, according to Nelson (1996a) written texts are the easiest to obtain, but in Sharoff’s (2005) experience, extra efforts are needed to obtain private texts, such as personal letters. With the Internet, a wide range of texts are easily and freely available today. Although using electronic texts saves us the time and effort of computerising printed texts, some important questions raised by Meyer (2002) also need to be considered: “Are electronic texts essentially the same as traditionally published written texts?”, “Is an article from a personal webpage any different from one that has gone through the editorial process?” Copyright Issues Collecting the texts is one complex task but without copyright permission the texts cannot be used in the corpus, especially if the corpus is going to be made accessible to anyone and is going to be used internationally for research purposes. Based on the experience of the other ICE teams, Nelson (1996a) has found that owners of texts are usually willing to help. The only frustration is getting the permission within a short time period. Having other priorities, owners usually take a long time to reply or some may not even bother replying. Nelson (1996a) has also discovered that due to major confidentially issues, it is more difficult to obtain permission for texts in the commercial sector. 2.2.2 Computerising Data Computerising Written Data Nowadays, most texts are readily available in electronic form. Those texts downloaded from the Internet however contain a significant amount of HTML code. Meyer (2002) has suggested using software such as “HTMASC” (http://www.bitenbyte.com) to automatically strip the HTML coding from the texts to produce an ASCII text file with no coding. If it is not possible to obtain the texts in electronic form, a printed copy of the text can be converted with an optical scanner. These exist in two types: form-feed and flatbed scanners. Meyer (2002) has encouraged the use of the fla tbed scanner since experience with ICE-USA has shown that they are slightly more accurate. Transcription of Spoken Texts After collecting the spoken texts in digital form and having obtained copyright permissions, the texts need to be written down. This process is known as transcription and more precisely, as 11 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS defined by Edwards (1995) “it involves capturing who said what, in what manner, to whom and under what circumstances”. Software programs such as “Voice Walker 2.0” used within the Santa Barbara Corpus of Spoken American English and “Sound-Scriber” used in the Michigan Corpus of Academic Spoken English (MICASE) have been designed specifically to help in the transcription of digitised speech. These programs (Meyer, 2002) can be downloaded freely from http://www.linguistics.ucsb.edu/resources/computing/download/download.htm and http://www.lsa.umich.edu/eli/micase/soundscriber.html respectively. 2.2.3 ICE Mark-up System Mark-up is the first stage in the annotation process of ICE corpora. Nelson (1996b) has described mark-up as two distinct types: textual mark-up, which is added to the texts themselves and bibliographical and biographical mark-up, which is stored externally in the form of a file header for each text. There exist two manuals , one for spoken and one for written texts (Nelson, 1991a, 1991b) which describe the textual mark-up system and a third one is available for encoding bibliographical and biographical information (Nelson, 1991c). In written texts, mark-up symbols are used to encode typographic features such as boldface, italics and underlining, and structural features such as sentence boundaries, paragraph boundaries and headings. In spoken texts, markup is needed to indicate sentence boundaries, speaker turns, overlapping strings and pauses (Nelson et al., 2002). More recently, with the increasing use of electronic documents, a standard for the markup of these types of documents has been developed. This standard, known as Standard Generalized Markup Language (SGML), offers the advantage of computer independence, that is, the corpus can be transferred from computer to computer while keeping its original description. However, although it is a flexible language, problems such as lack of general style sheets, do arise when transferring the text over the Web. Due to these problems, interests have been shifted to a newly emerging mark-up system, the Extensible Markup Language (XML), which has been designed mainly for use in web documents. (Meyer, 2002) In the ICE components, all mark-up symbols are characterised by angled brackets, appearing with an opening symbol <symbol> and a closing symbol </symbol>. (1991a, 1991b) manuals are given below. 12 Some examples from Nelson’s Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Written Text: Boldface <bold> </bold> Example: Readers must return all books to the library Markup: Readers <bold>must</bold> return all books to the library Italics <it> </it> Example: You must attend every day during term Markup: You must attend <it>every</it> day during term Typeface <typeface> </typeface> Example: Warhol is alive and well Markup: Warhol is <typeface: courier>alive</typeface: courier> and well Spoken Text: Overlapping speech <[> </[> and <{> </{> Example: $A's utterance "Nothing stands out" overlaps completely with $B's "Yeah I suppose". Markup: <$A> <#><{><[>Nothing stands out</[> <$B> <#><[>Yeah I suppose</[></{> Anthropophonics (non-verbal sounds) Examples: <O>cough</O> <O>sneeze</O> <O>laugh</O> Mark-up can be done manually but to speed up the process, Nelson (1996b) has proposed to partially automate it with the use of the Mark-up Assistant program, a set of WordPerfect macros that assigns whole mark-up symbols to single keys. The minimum set of ICE mark-up symbols which has been used is given in Appendix B. Bibliographical and biographical data The description of each text is represented as bibliographical and biographical infor mation and the mark-up is stored separately in a header file. The description includes ‘category’, ‘date’, ‘publisher’ among others and the data is enclosed within opening and closing symbols like in the textual mark-up, for example, <date> 1996 </date>. A common standard used in many corpora is the Text Encoding Initiative (TEI) (Al-Sulaiti, 2004), which has been working to incorporate XML within its standard and which comprises of four main components: • File Description <fileDesc>: includes bibliographic information about the text. Below is an example from the TEI website: 13 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS <fileDesc> <titleStmt> <title> Thomas Paine: Common sense, a machine-readable transcript </title> <respStmt><resp> compiled by </resp> <name> Jon K Adams </name> </respStmt> </titleStmt> <publicationStmt> <distributor> Oxford Text Archive </distributor> </publicationStmt> <sourceDesc> <bibl> The complete writings of Thomas Paine, collected and edited by Phillip S. Foner (New York, Citadel Press, 1945)</bibl> </sourceDesc> </fileDesc> • Encoding Description <encodingDesc>: states the relationship between the text and its source. The simplest example from Baker et al. (2003) is shown below: <encodingDesc> <projectDesc>Text collected for use in EMILLE project</projectDesc> <sampleDesc>simple written text only has been transcribed. Diagrams, pictures and tables have been omitted and their place marked with a gap element </sampleDesc> </encodingDesc> • Profile Description <profileDesc>: supplies non-bibliographic information about the text and the participants. The profile description can be divided into two parts: the text description and the person description. Again an example from the TEI website is given below: <profileDesc> <textDesc n='novel'> <channel mode=w>print; part issues</channel> <constitution type=single> <derivation type=original> <domain type=art> <factuality type=fiction> <interaction type=none> <preparedness type=prepared> <purpose type=entertain degree=high> <purpose type=inform degree=medium> </textDesc> <person id=P1 sex=F age='mid'> <birth date='1950-01-12'> <date>12 Jan 1950</date> <name type=place>Shropshire, UK</name> </birth> <firstLang>English</firstLang> <langKnown>French</langKnown> <residence>Long term resident of Hull</residence> <education>University postgraduate</education> <occupation>Unknown</occupation> <socecstatus source=PEP code=B2> </person> </profileDesc> 14 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS • Revision description <RevisionDesc>: gives a summary of the history of the text and provides a detailed change log in which each change made to a text may be recorded. Some examples of changes made to a text, adapted from the TEI website is shown below: <revisionDesc> <change><date>1996-01-22 <name>CM SMcQ<what>finished proofreading</change> <change><date>1995-10-30 <name>L.B. <what>finished proofreading</change> <change><date>1995-07-20 <name>R.G. <what>finished proofreading</change> <change><date>1995-07-04 <name>R.G. <what>finished data entry</change> <change><date>1995-01-15 <name>R.G. <what>began data entry</change> </revisionDesc> For the ICE corpora, Nelson (1996b) has re-classified the above information in the header file into four different levels, but with very similar attributes: • Text Description: specif ies the text category and subcategories so that it can be located in the hierarchy of the corpus • Text Source: records bibliographical data about the sources of texts in the corpus, such as source title, publisher, date and place of publication. Copyright statements are also included in this level. • Text Internals: contains information about the specific extract used in the corpus, for example, title of article, page numbers, relationship between speakers. • Biographical Information: includes details, such as sex, age, nationality, of each author and speaker in the corpus. 2.2.4 Corpus Tagging During this stage, each lexical item is usually assigned a part-of-speech label or tag, for example ‘N’ for noun. In addition, most tags contain additional inf ormation, which appears in brackets. Together they form the tagset of each item (Nelson et al., 2002). Leech’s (1997b) principles for creating tagsets are adopted for the ICE components, that is, the tagsets should satisfy the three criteria mentioned be low: • Conciseness: labels should be brief • Perspicuity: labels should be user-friendly and easy to read and remember • Analysability: labels should be decomposable into their logical parts, for instance, ‘noun’ can occur above more specific tags such as ‘singular’ or ‘present tense’ Over the years, a number of different tagging software has been developed to insert a variety of different tagsets and most taggers are highly accurate with more than 95% success rates. The different tagging software available is discussed later in the report. In the ICE corpora, the texts 15 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS are automatically tagged using the TOSCA Tagger, developed by the TOSCA Research Group at the University of Nijmegen (UCL, 2004). An example of a grammatically tagged sentence from the ICE webs ite is shown below: 2.2.5 Each of these PRON(univ,sing) PREP(ge) PRON(dem,plu) is the responsibility V(cop,pres) ART(def) N(com,sing) of one person PREP(ge) NUM(card,sing) N(com,sing) Syntactic Parsing For the ICE components, the tagged corpus from the previous stage forms the input to the parsing stage. However, before the tagged corpus is automatically parsed, it first needs to be pre-edited. The pre-editing stage, also known as syntactic marking, involves manually marking several highfrequency constructions in order to reduce the ambiguity of the input, and hence reducing the number of decisions that the automatic parser will have to make (Nelson et al., 2002). Following syntactic marking, the corpus is submitted to the automatic parser, developed by the TOSCA Research Group at the University of Nijmegen, for syntactic analysis. Every sentence in the corpus is analysed at phrase, clause and sentence level and the analysis is shown in the form of a parse tree as shown in Figure 1 (UCL, 2004). Figure 1: example of ICE Syntactic Parsing 16 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS The parse tree is then analysed with the ICETree 2, which is a “dedicated syntactic tree editor” especially designed for the ICE corpora by the Survey of English Usage. The ICETree can also be used with other corpora, but some modifications to the data files will be required first. Unlike grammatical tagging, “syntactic annotation tends to lack a sense of standard practice” and parsing software has much lower accuracy rates (70-80 percent at best) and they require human intervention at varying levels (Leech and Eyes, 1997). Syntactic parsing is seen as the most difficult and time-consuming stage in the development of a corpus (Nelson et al., 2002). 2.3 Annotation Tools 2.3.1 The ICE Markup Assistant The ICE Mark-up Assistant reduces the time taken for the insertion of markup symbols by automating and simplifying key presses. Generally, it can save up to tens of minutes per text. The program has a set of WordPerfect macros implemented into it, which allows the text unit markup to be inserted automatically at probable sentence boundaries, for example, each full stop is followed by a space. Most markup types require an open and close symbol and the ICE Markup Assistant also helps to ensure that all markup symbols are closed. For instance, if the user tries to open the same symbol again before closing it, the program will remind the user to do so. (Quinn and Porter, 1996) Using a reduced system of annotation is another way of minimizing the amount of time taken to annotate texts. For those ICE teams which lack resources to insert all the ICE markup that has been developed, the ICE project reduces the amount of structural markup that is required to the most “essential” markup. (Meyer, 2002) 2.3.2 The Different Taggers available Automatic text tagging is an important first step in discovering the linguistic structure of text corpora. For a tagger to function as a practical component in a language processing system, Cutting et al. (2005) believe that a tagger must be: • Robust: A tagger should be able to deal with ungrammatical constructions, isolated phrases, such as titles, and, non-linguistic data, such as tables and special words (which might be unknown by the tagger). 17 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS • Efficient: Due to the large amount of words which needs to be analysed and tagged in every corpus, a tagger must be time efficient and any training required should also be fast to allow rapid turnaround with new corpora and new text genres. • Accurate: A tagger should assign the correct part-of-speech tag to every word it encounter to reduce human intervention. • Tunable: A tagger should be able to take different hints to correct systematic errors and to be adapted to different corpora. • Reusable: A tagger should require a minimal amount of effort to be retargeted to new corpora, new tagsets and new languages. There are two different types of taggers: rule -based or probabilistic (Garside and Smith, 1997, Meyer 2002). In a rule-based tagger, grammar rules are written into the tagger and tags are inserted on the basis of these rules. The TAGGIT program is among the first rule -based tagger to be developed, followed by the Brill tagger. However, rule -based taggers are being superseded by probabilistic ones. The latter works by assigning a tag to a word based on the most likely outcome of the tag in the context of the word and its immediate neighbours. Garside and Smith (1997) have give n an example in the sentence beginning the run: the word run has a high probability of being a noun rather than a verb because it is preceded by the. The most common taggers with which corpus linguists typically work are: • The TAGGIT program by Greene and Rubin was one of the earliest tagger to be developed around the 1971s and it was an aid in the tagging of the Brown Corpus. The corpus was tagged at 77% and the rest was done manually over a period of several years. The tags assigned were from a set of some 77 tags (the Brown tags). (Garside and Smith, 1997) • CLAWS (the Constituent Likelihood Automatic Word-tagging System) , another one of the first tagging programs, was designed in the early 1980s at the University of Lancaster (Atwell, 1983). CLAWS has consistently achieved 96-97% accuracy and since then, various versions of the CLAWS program have been developed and have been used to tag the LOB Corpus (the British counterpart of the Brown Corpus) and the British National Corpus. (Leech, 1997a) • The TOSCA (Tools for Syntactic Corpus Analysis) tagger has been designed by the TOSCA team at the University of Nijmegen to insert two types of tagsets, namely the TOSCA tagset, 18 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS which is used to tag the Nijmegen Corpus and the ICE tagsets composed of 262 tags. (UCL, 2004, Meyer, 2002) • Another tagger that can be used to insert the ICE tagset is the AUTASYS tagger (Fang, 1996). The tagger has been developed by Fang and Xiaoli at the Guangzhou Institute of Foreign Languages, China and it assigns not only ICE tags, but also LOB tags and SKELETON tags. AUTASYS has an accuracy rate of 96% and it has a fast rate of processing words. • The Brill tagger, a multi-purpose tagger, can be trained to insert any tagset the user is working with. It can also be applied to any language (Garside and Smith, 1997, Atwell et al., 2000). • EngCG-2 (the Helsinki English Constraint Grammar) is a tagger that has been designed to overcome the problems in the TAGGIT program and other rule -based taggers. It has a wider application and is able to “refer up to sentence boundaries rather that the local context along” (Meyer, 2002). One main advantage of EngCG-2 is its 99.5% accuracy rate. 2.3.3 The ICE Tag Selection System TAGSELECT helps users to automatically select alternative word-class tag generated by the TOSCA tagger or AUTASYS. The most likely alternative tags for each word are displayed first, so human interference is only needed if the first tag is not the correct one. Where no correct alternative is provided, a new tag can be chosen from the list of possible tags. TAGSELECT is user-friendly since it runs under Microsoft Windows and therefore all functions are available using menus, buttons and scroll bars. (Quinn and Porter, 1996) 2.3.4 The ICE Syntactic Marking System For the ICE projects, syntactic markers are added to the tagged texts prior to parsing by the TOSCA parser or any other parser that requires such pre-editing. This is done with the ICEMARK system. Syntactic markers make the input to the parser simpler and therefore restrict the number of alternative syntax trees generated. Like the ICE Tag Selection System, ICEMARK also runs under Microsoft Windows, making it user-friendly. (Quinn and Porter, 1996) 19 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 2.3.5 Different Varieties of Syntactic Annotation Tagging and Parsing are closely related and so many parsers have taggers built into them. For instance, the EngFDG (Functional Dependency Grammar of English) parser and the TOSCA parser can assign both syntactic functions and part-of-speech tags to words (Meyer, 2002). Like taggers, they can be either probabilistic or rule-based. Both parsers have been widely used but major emphasis is being put on the development of probabilistic ones in recent years since they are thought to be more robust in the sense that they are “able to parse rare or aberrant kinds of language, as well as more regular, run-of-the-mill types of sentence structures” (Leech and Eyes, 1997). The Lancaster/IBM Treebank A skeleton parsing scheme based on a shallow PS (phrase-structure) model has been used to parse about 3 million words of text. The PS model simply involves analysing every sentence in the corpus and adding labelled brackets to it. A sample of skeleton parsing from the Lancaster/IBM Spoken English Corpus is shown in Figure 2. It can be noted that the tree is incomplete and that the number of bracket labels used is quite small. This is done intentionally to speed up the process and to limit the complexity of the parsing (Leech and Eyes, 1997). SJ06 298v [S But_CCB ,_, [[N the_AT thing_NN1 N][V was_VBDZ V]] ,_, [N you_PPY N] often_RR [V found_WD [Fn that_CST [Fa although_CS [N you_PPY N][V had_VHD [N a_AT1 reserved_JJ sear_NN1 N]V]Fa] ,-, that_CST there_EX just_RR [V would_VM n’t_XX be_VBO [N room_NN1 N][P on_II [N the_AT train_NN1 N]P]V]Fn]V] ._. S] Figure 2: Sample from the Lancaster/IBM Spoken English Corpus (Leech and Eyes, 1997) The Penn Treebank: Phase 1 It is the largest and best-known treebanking operation available today. The Penn Treebank has been developed at the University of Pennsylvania by Mitchell Marcus and his team and it is closely modelled from the Lancaster/IBM Treebank. A PS model of parsing is used and incomplete parsed trees are accepted into the Treebank. The differences are that the Penn Tree is displayed vertically as shown in Figure 3 below and it is generally available throughout the world. (Marcus et al., 1993) Another more ambitious version (Phase 2) of the Penn Treebank is being developed. In the Phase 2 Treebank, a wider range of additional information, such as functional labels or types of adverbial, will be added. 20 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS ( (S (NP (NP Pierre Vinken) , (ADJP (NP 61 years) old ,)) will (VP join (NP the board) (PP as (NP a nonexecutive director)) (NP Nov. 29))) .) Figure 3: A sentence from the Penn Treebank (Phase 1) (Leech and Eyes, 1997:42) Nijmegen Treebanks Developed before the Penn Treebank, the TOSCA parsing system was set up in the early 1980s at the Catholic University of Nijmegen, Holland. It uses a grammatical model, known as Affix Grammar and the TOSCA Treebank is integrated with the Linguistic DataBase (LDB), which allows the Treebank to be searched for varied features. One of its main features is that it allows users to correct or change the parse where necessary. Figure 4 gives an example of a sentence from the TOSCA Treebank. (Leech and Eyes, 1997) -:TXTU() UTT:S(act,indic,inter,mortr,pres,unm) INTOP:AUX(do,indic,pres){Does} Does SU:NP() NPHD:PN(pers,sing){he} he V:VP(act,do,indic,motr) MVB:LV(indic,nfin,mortr){realize} realize OD:CL(act,indic,intens,pres,unm,zsub) SU:NP() NPHD:PN(pers,sing){he} he V:VP(act,indic,intens,pres) MVB:LV(indic,intes,pres){is} is CS:AJP(prd) AJHD:ADJ(prd){wront} wrong PUNC:PM(qm){?} ? Figure 4: Sentence from the TOSCA Treebank (Leech and Eyes, 1997:44) The SUSANNE Corpus Geoffrey Sampson’s SUSANNE Corpus is a Treebank which provides a lot of parsing infor mation for each sentence. It is a result of manual analysis and “contains much detail within a small compass.” An example from the SUSANNE Corpus is given in Figure 5 below. Moreover, it is available freely to any research community. The only downside is that the texts are old (1961) compared to what people would usually analyse. (Sampson, 1995) 21 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS N03:0460f N03:0460g N03:0460h N03:0460i N03:0460j N03:0460k N03:0460m N03:0460n N03:0460p N03:0460q N03:0460a N03:0460b - YB PPHS1m WDt AT NN1c II NP1m CC WDv AT NN1c YF <minbrk> He handed the bayonet to Dean and kept the pistol +. he hand the bayonet to Dean and keep the pistol - [Oh.Oh] [O[S[Nas:s.Nas:s] [Vd.Vd] [Ns:o. .Ns:o] [P:u. [Nns.Nns]p:u] [S+ [Vd:Vd] [Ns:o. [.Ns:o]S+]S] . Figure 5: A sample from the SUSANNE Corpus (Sampson, 1995:32) The Helsinki Constraint Grammar The Helsinki Constraint Grammar parser adopts a dependency grammar model instead of the PS grammar model as in the other parsers mentioned above. The parser also provides a breakdown of the attributes of individual words such as sub-categorisation information for verbs and in addition, functional labels such as ‘subject’ or ‘object’ are added. A sample of Helsinki parser output is shown in Figure 6 below. (Leech and Eyes, 1997) (“<*royal>” (“royal” A ABS (@AN>))) (“<*dutch>” (“dutch” <Nominal> A ABS (@AN> @<Nom))) (“<*shell>” (“shell” N NOM SG (@SUBJ))) (“<$,>”) (“<*worth>” (“worth” PREP (@ADVL))) (“<*just>” (“just” ADV (@AD-A>))) (“<*$500m>” (“$500m” NUM CARD (@<P))) (“<*less=than>” (“less=than” <CompPP> PREP (@ADVL)) (“less=than” <**CLB> CS (@CS)) (“less=than” ADV (@ADVL))) (“<*exxon>” (“exxon” <Proper> N NOM SG (@<P))) (“<$,>”) (“<is>” (“be” <SV><SVC/N><SVC/A> V PRES SG3 VFIN (@+FMAINV))) (“<*third>” (“third” NUM ORD (@PCOMPL-S))) (“<$.>”) Figure 6: Output from the Helsinki ENGCG parser (Leech and Eyes, 1997:48) 2.3.6 The ICE Syntactic Tree Annotator Within the ICE community, two sets of programs are used for the annotation process. First, a parser is applied to produce a partial as well as a complete analysis and then an editor is used to 22 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS correct or complete the analyses. The parse analysis is represented as the ‘tree’ form and ICETREE is the editor that allows such ‘tree’ form analysis to be manipulated. The ICETREE also allows parse trees to be built from scratch and it can be used as a viewer for complete analyses. (Quinn and Porter, 1996) 3. Methodology There are many approaches to software development and one of the main approaches is “The Waterfall Model” . It defines a project as a set of stages: from problem definition to requirements analysis, design, implementation, testing and finally maintenance. However, each individual stage in the project must be completed before moving on to the next (Laudon and Laudon, 2002). The “Feedback Model” uses the same development stages as the “Waterfall Model” but it allows for re-evaluation of earlier stages if problems arise in the later stages. Therefore, the “Feedback Model” is the methodology that has been adapted for this project. The first stages, namely, problem definition and requirements analysis were already carried out during the background research and were described in sections 1 and 2 above. This section will therefore describe the design of the project. 3.1 Corpus Design 3.1.1 Methods to be used Firstly, it was decided that the pilot project would be fully Internet-based, that is, all of the texts would be taken from the Internet only. The texts would be collected from the numerous Mauritian websites already available. On the basis of this approach, it was decided that the ICE-Mauritius would be composed of the following genres, as adapted by the ICE standard text categories. Table 3: Text Categories to be adapted for ICE-Mauritius Spoken Dialogues Public Broadcast Discussions Broadcast Interviews Parliamentary Monologues Unscripted Commentaries Unscripted Speeches 23 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Scripted Written Nonprinted Printed Student Writing Legal Presentations Broadcast News Broadcast Talks Student Essays Exam Scripts Letters Social Letters Business Letters Academic Humanities Social Sciences Natural Sciences Technology Humanities Social Sciences Natural Sciences Technology Popular Reportage Press reports Instructional Administrative Writing Skills/hobbies Persuasive Editorials Creative Novels Since the texts would only be collected from the Internet only, some categories were removed because they would not be available on the Internet and also the number to collect for each text were not stated since it was difficult to know how many of those texts would be available online beforehand. The main method would be to collect as many texts as possible for any category and even for those categories not listed above and then classify the texts accordingly and creating or removing categories where necessary. For the pilot project, one to two percent of the corpus would be collected. However, the samples collected would have to follow the standard of 2,000 words per text to total the one million words that the corpus was required to reach at the end. While collecting the material, the 3 main problems associated with the use of the Internet and as identified by Sharoff (2005) would have to be kept in mind: 1. It cannot be claimed that the material is representative and that there is a balance of text types 2. Search engines address the needs of information retrieval, rather than linguistic search 3. Search engines present search result in a way that also does not correspond to the needs of a linguist 24 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 3.1.2 Copyright Issues One of the important issues to consider for the collection of texts was copyright issues. The corpus could not be used publicly unless permission was granted from owners of sites to use the material. “Experience with the creation of the other ICE corpus has shown that, in general, it is quite difficult to obtain permission to use copyrighted material” (Meyer, 2002). This is because some people will take months before replying while others will not even bother sending a reply. Therefore, extra time would have to be allocated for the request of permission to use the copyrighted materials and also, extra texts would have to be collected in case permission was not granted for some of them. The first stage in compiling the ICE-Mauritius would be to identify some suitable websites and to obtain email addresses as well as postal addresses, telephone numbers and fax numbers. Two letters would be prepared: one would explain the purpose of the corpus and for the owners and authors to keep, and the other with a return slip for them to sign if they agreed for their websites to be used. 3.1.3 Corpus Layout “Organising corpus into a series of directories and subdirectories makes working with the corpus much easier and allows the corpus compiler to keep track of the progress being made on corpus as it is being created” (Meyer, 2002). Therefore, the corpus would be organised into directories and subdirectories according to the different text categories. For the proposed diagrammatic layout of the corpus, see Appendix C. Each text would be assigned a number that designated a specific category in the corpus in which the sample might be included. For instance, a text number LETT01 would be the first sample collected for inclusion in the category “Letters” while B-N01 would be the first sample collected for inclusion in the category “Broadcast News”. This numbering system would allow the corpus compiler to keep easy records of where a text belonged in the corpus and how many samples had been collected for that part. 3.2 Capturing Text in Electronic Format 3.2.1 Computerising Speech 25 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS It was assumed that the spoken texts collected would be in digitised form since they would be taken from the Internet. The software “Voice Walker 2.0” or “Sound-Scriber” mentioned above would be downloaded freely from the Internet to run the samples of digitised speech. Since no other alternatives were available, speech would be manually transcribed. This process would take the longest time in the compilation of the corpus and therefore extra time should be allowed. 3.2.2 Computerising written texts Texts downloaded from the Internet were expected to contain as much HTML coding as text. Since to manually delete this coding would take a considerable amount of time and effort, the software “HTMASC” mentioned above would be used to automatically strip the HTML coding from text. An ASCII text file with no coding was expected to be produced. 3.3 Corpus Annotation 3.3.1 Structural mark-up The mark-up of the texts would be carried out by writing minimal encoding and pasting a header using a word processor. The following components, adapted from TEI-Header from the Humanities Text Initiative (HTI) website to the ICE standards, would be added to each text: File Description <fileDesc> <fileDesc> <titleStmt> <title> </title> <author> </author> <respStmt><resp>compiled by</resp> <name>Dolly Koo</name></respStmt> </titleStmt> <publicationStmt> <publisher> </publisher> <pubPlace> </pubPlace> <date></date> </publicationStmt> <sourceDesc> <p>created in machine-readable form in http://mauritiustimes.com/040205mr.htm</p> </sourceDesc> </fileDesc> Encoding Description <encodingDesc> <encodingDesc> <projectDesc> <p>Texts collected for use in the pilot project for ICE-Mauritius, February, 2005</p> </projectDesc> <samplingDecl> 26 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS <p>Whole text of 862 words copied from the site</p> </samplingDecl> </encodingDesc> A Profile Description would also be added and it would be similar to the one described in section 2.2.3 above. 3.3.2 Procedure for annotating the corpus As mentioned earlier, the encoding would be done manually since no program was developed to encode the corpus automatically. For each text, the following steps would be performed: 1. Text would be copied from the Internet and paste d into Microsoft Word. It would be saved as encoded text choosing Unicode UTF-8 as recommended by Al-Sulaiti (2004) since some of the texts might contain some French quotations with some special characters. 2. The text would then be encoded with paragraph marker using the option FIND/REPLACE in edit: Find ^p Replace </p>^p<p> in the case of a normal. 3. After the paragraphing was marked, the adapted TEI-header would be added and the missing information would be filled in. 4. When the text was complete, it would be saved with its ID number as its name. For instance, the text with the ID number LETT01 from the “Letters” category would be saved as LETT01.txt in the “Letters” directory. 5. The text would then be renamed by changing the file extension from .txt to .xml. 6. Finally, to verify the XML file, the text could be opened in Internet Explorer. 4. Corpus Encoding With the design laid out in section 3 in place, implementation of the ICE-Mauritius was started. This section covers the encoding of the pilot project. 4.1 Collection of Texts 4.1.1 Search methods Keeping the 3 main problems mentioned above in section 3.1.1 when collecting texts from the Internet in mind, the following search engines and key words were used: 27 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 1 Search Engine Google 2 3 Yahoo MSN Key words Mauritius, Mauritius articles/ books/ novels/ business letters/ press/ websites/ educational/ reports/ newspapers/ schools/ stories/ texts, Higher School Certificate/Mauritius exam papers Mauritius, Mauritius articles/ news/ books/ novels Mauritius, Mauritius articles/ books/ novels/ business letters/ press/ educational reports/ newspapers/ schools/ stories/ texts, Mauritius Higher School Certificate /exam papers Table 4: Search engines & key words used to collect texts Some of these searches proved to be very useful, for instance when searching for “Mauritius” in Google, some of the main Mauritian websites came up, such as the government pages and other interesting websites containing the texts needed were found. However, with over 20 million results of “Mauritius”, it was difficult to look through all of them. The search had to be refined and new key words such as “Mauritius newspaper” or “Mauritius schools” were typed in. The ‘Advanced Search’ option and the ‘Preference’ option in Google were also used, but they did not prove very useful. Key words like “Mauritius business letters” matched over 200,000 sites but none were related to the corpus or were written by Mauritian people. The same process was carried out with the search engines Yahoo and MSN. After a few searches with Yahoo, it was found that the latter did not yield many results and all the sites it referred to were already visited in Google. With MSN, more results were obtained when searching the Internet and some new materials were collected, but as with Yahoo, many of the sites were already displayed in Google. Two of the most useful related Mauritian websites are mentioned in Table 5 below : Websites http://www.servihoo.com/ Description Website owned by Telecom Plus, the only telephone provider in Mauritius. It has links to other websites such as local newspapers, radio, television and it contains articles ranging from culture to business to sports. http://www.mauritiustopsites.co Website owned by Internet Communication Services m/topsiteshtml/index157.shtml Mauritius. It has a list of the most 946 popular websites from the country. Table 5: Most popular Mauritian websites 28 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 4.1.2 Text Collection Numerous websites related to Mauritius ha d been searched but careful attention had to be paid to the author and the publisher. Many of the articles found were not from Mauritian people. The first few texts took a considerable amount of time to obtain but once the useful sites were known, the texts were collected more quickly. Due to the lack of time, after only four days of thorough search from the abovementioned engines, written texts from fifty websites were collected and the amount of words was totalled to 51,960, comprising 5 percent of the actual size of the corpus (exceeding the target of 1-2% for the pilot project). The author, publisher, publisher place, date and contact details of the author where available were also noted for each text. Some of the texts such as press reports were easily obtained from the various newspaper websites. However, letters and student writing prove d to be very difficult to find – none of student essays or exam scripts were available online. It was also important to note that shorter texts were easier to find than longer texts of 2,000 words each. Appendix D shows the full list of texts and the details that were collected. Not surprisingly, spoken text was impossible to obtain from the Internet. Only two websites had spoken texts, namely the Mauritius Broadcasting Corporation (http://mbc.intnet.mu/) and TopFM (http://www.topfmradio.com/index.php). The Mauritius Broadcasting Corporation provide d live TV News transmission, but it ha d only the French version available online and both of the sites provide d live radio transmission, but most of the talks were in French too and saving the spoken texts proved difficult, infeasible given the short amount of time to compile the pilot pr oject. Hence, no spoken texts were collected. The proposed solution by Sharoff (2005), that is, to increase the amount of ephemera (leaflets, junk mail and typed material) and correspondence could be attempted in the follow-up project to compensate for the lack of spoken texts and to make the project more balanced. Alongside collecting the texts, a database file was created in Microsoft Excel (Appendix D) which stored the type, ID number, source, title, author, publisher, place and year of publication and the number of words of each text. This database file was important to have for the organisation of the texts in the corpus and for counting the words automatically. 29 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 4.1.3 Written Text Classification It prove d difficult to decide to which category the text belonged. For example, there was confusion on whether some of the popular printed texts came from press report or other magazines and therefore, those texts were classified as popular printed texts based only on best judgement. Sinclair (1996) had examined in detail the problems of text classification and had reported that corpus design ma de use of some internal and external factors to decide on the text category. He pointed out that many text classifications were based on topic as it was represented in newspapers and magazines. The classification of the written texts of the ICE-Mauritius was based on the ICE standard classifications, but with some amendments due to the lack of texts to cover all the categories. The texts collected from the 50 websites were grouped and classified differently. Those exceeding 1,500 words were considered as whole texts while those below 1,500 words were grouped together as one text, up to the total of around 2,000 words. However, the texts that were grouped together ha ve to be part of the same initial category. This resulted in 30 final texts ready to be included in the corpus. The spoken text category was removed completely for the pilot project, even though this resulted in an unbalanced corpus. Much more time to collect and encode the spoken texts would have to be allocated for the actual ICE-Mauritius. Table 6 below shows the text categories which were derived from the sources, the number of texts and the total number of words in each category. Table 6: Number of texts and number of words in each category Text Categories Written Nonprinted Printed No. of Texts No. of Words Student Writing Letters Summary of project 1 762 School/Business / Social 2 3352 Speeches Academic Formal Various Topics 3 2 5421 3354 Popular Reportage Instructional Various Topics Press reports Administrative/hobbies 3 9 5 5528 16885 7897 Persuasive Creative Editorials Novels 2 3 3226 5735 30 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 4.1.4 Permission Letters As mentioned above , copyright issues were one of the most important aspects to consider for the collection of texts. The authors’ details were recorded alongside the texts when the Internet was searched. However, it was noticed that many texts did not contain any details of its owner. When compiling the actual ICE-Mauritius, those texts should be rejected from the beginning since they could not be used without permission, but for the pilot project all of the texts collected were used even if no permission was obtained since they would be kept only temporarily and would not be made available to the public. Two letters prepared by Al-Sulaiti (2004) to request permission for the use of texts available online were used and sent to the authors of the texts that had been collected. One explained the purpose of the corpus and for the owners and authors to keep, and the other had a return slip for them to sign if they agreed for their websites to be used. Samples of the two letters can be found in Appendix E. It took one full day to send twenty three of these letters out by emails. Due to the lack of time, they were sent only by emails and replies were expected mostly by emails since it was estimated to take two weeks for a letter to reach Mauritius and another two weeks to get a reply if the author sent it back straight away by post. Out of the twenty three letters, four were not delivered due to the wrong address available on the Internet. To the present date, three ha d given their permissions and were happy to help and one of them even asked for comments on his novel. However, one was not agreeable and had asked for a formal support from the University of Leeds and a complete CV. The outcomes proved that much more time and effort would be needed to obtain permissions for the follow -up project. Table 7 shows the list of addresses of resources for which permission of copyright had been received. Source http://mauritiustimes.com/040205mr.htm Contacts Madhukar Ramlallah mtimes@intnet.mu http://pages.intnet.mu/rajbalkeehomepage/hd- Raj Balkee complete.htm rajbalkee@intnet.mu http://ile- maurice.tripod.com/rougpoisal.htm Madeleine Philippe madeleine@cjp.net Table 7: Sources with copyright permission 31 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 4.1.5 Layout of the Pilot Corpus The design of the Corpus ha d evolved slightly from the original plan since it was found that it would be useful to have a folder for the marked-up corpus and one for a raw corpus. The latter contained word docume nts of the actual text together with a table which included details such as title, author, publisher, publisher place, date, source, email of author and the amount of words which had been collected and had been used in the header. The raw corpus folder also contained the HTML files of the texts taken from the source but with the extra coding slightly stripped off manually. Even if building up this separate raw corpus had taken some time, it ma de the annotation process much quicker and easier and hence would not affect the overall length of the project. Each small text was marked-up individually before being grouped together and was stored within a sub-folder in the corresponding category in the main marked-up corpus folder. Both the raw corpus and the marked-up corpus folders were divided into the following sub-folders for the different categories: Academic, Editorial, Instructional, Letter, Novel, Popular, Reportage, Speech and Student Writing. The texts in the different folders were crossed reference by their name. 4.2 4.2.1 Corpus Annotation TEI-Header Some amendments had to be made in the adapted TEI-header since the information required were not available from the Internet and it was impossible to obtain the information in such short time. The information in the Profile Description which was more concerned with the author’s characteristics such as age, education, occupation and first language ha d to be removed. The new Profile Description that was used for the pilot project is shown below and for a full template of the header, please refer to Appendix F. Profile Description <profileDesc> <profileDesc> <creation> <date value="2005-02">Feb 2005 </date> <rs type="city">Pointe Aux Sables, Mauritius </rs> </creation> <langUsage>English</langUsage> <textClass> <textDesc n=" "> <channel mode="w">print; written</channel> </textDesc> 32 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS <particDesc> <person id="P1" sex=" "> </person> </particDesc> </textClass> </profileDesc> However, even after careful consideration on which fields to include in the header, some of the information was still missing when the texts were encoded. For many texts, the author, the publisher or the date published were not available online. Therefore, those fields were filled in with “unknown”. In fact, among the twenty-nine texts collected, only six of them were complete. Obtaining this information would be another task that would require extra effort and time in the follow-up project. 4.2.2 Texts Encoding During the text encoding stage the time taken for processing was calculated. Using the procedures described previously, the time taken to go through the six steps was approximately 20 minutes, depending on the information available to fill in the header but regardless of the length of the texts since the paragraphing was done automatically. If further files from the same site were collected, the header could be reused with some minor adjustments to fit the new text. Obviously this took less time than the first file, ranging between 5 to 10 minutes. A sample of a raw text can be found in Appendix G while a sample of the encoded text can be found in Appendix H. When encoding the texts, some problems did surface with the viewing of the XML files. Some of the common error messages that were displayed when the XML files were opened with Internet Explorer are shown in Table 8. XML Files Table 8: Errors duri ng encoding of texts Error Message LETT02.xml - Figure 7 ‘whitespace not allowed’ REP05.xml NOV01.xml ‘A semi colon character was expected’ ‘End tag “P” does not match the start tag “h”’ - Figure 8 - Figure 9 33 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Figure 7: Error as shown when “LETT02.xml” was opened in Internet Explorer Figure 8: Error as shown when “REP05.xml” was opened in Internet Explorer Figure 9: Error as shown when “NOV01.xml” was opened in Internet Explorer 34 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS In the event of those problems, the file was opened in a program called UniRed, which is a freeware available at http://sourceforge.net/projects/unired. UniRed is a Unic ode plain text editor for windows and it supports many character sets including UTF-8 and mark-up languages such as XML and HTML. If an error does exist in the XML file, the program identifie s the error by highlighting it in red. If green highlight is shown, it means that the code is correct; it has an opening and a corresponding closing tag. Figure 10: Screenshot of “LETT02.xml” in UniRed editor A screenshot of the error from “LETT02.xml” in the UniRed editor is shown in Figure 10 above. The XML tag that appeared red in the middle (the shaded ‘&’ character) of the screenshot meant that the code was invalid. The error was related to some unusual characters or signs which needed to be modified to be accepted by XML. Here, in this example, the ‘&’ sign needed to be written as ‘and’ or as ‘&’. Figure 11: Screenshot of “REP05.xml” in UniRed editor 35 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS A different error message is displayed for “REP05.xml”. However, after the file was opened in UniRed (Figure 11), it was noticed that the error was related to the same unusual sign, the ‘&’ sign. The only difference was that the error occurred in the URL address of the source (http://www.businessmag.mu/displayNewsContent.asp?NID=5747&CID=30) and this could only be changed to ‘&’ for obvious reasons. However, in “NOV01.xml”, a completely different error was spotted. The file could not compile due to a missing closed tag. In UniRed (Figure 12), it was found that the opening tag <h> in line 5 did not have a matching closing tag </h>. This error was shown by highlighting in red the next opening tag (the shaded “<” character). Figure 12: Screenshot of “NOV01.xml” in UniRed editor Since it was not only faster to use UniRed but also it was guaranteed that the files were correctly saved and could be viewed in the browser with no problems, this method had been tested and compared with the former method, namely, creating the text in Microsoft Word and then converting it to XML. Processing time with the UniRed method prove d to be more efficient, taking only around 5 minutes. After the errors ha d been corrected, the three files mentioned above “LETT02.xml”, “REP05.xml” and “NOV01.xml” should look as shown in Figures 13, 14 and 15 respectively when they were opened again in Internet Explorer. For a full display of how a file should look, refer to Appendix H for another example . 36 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Figure 13: Expected output for “LETT02.xml” Figure 14: Expected output for “REP05.xml” Figure 15: Expected output for “NOV01.xml” 37 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 5. The Proposal This section is also related to the implementation stage but instead of a software implementation, it describes how the possible extension laid out in section 1.3 was achieved. After the compilation of the pilot ICE-Mauritius project, it was possible to write a work-plan and a proposal for a follow-up project to develop a full-scale ICE-Mauritius corpus and extend the methodology to a much more ambitious multinational ICE Corpus. 5.1 Funding Opportunities 5.1.1 Research at University of Leeds, School of Computing The School of Computing web site (University of Leeds, 2004) states that the School has been ‘awarded a Grade 5 in the 2001 Research Assessment Exercise (RAE), confirming the School's status as a leading research institute for computing’. The research activity within the School is grouped into five categories, namely, Computer Vision and Language, Knowledge Representation and Reasoning, Scientific Computing and Visualization, Theoretical Computer Science and Informatics. The School may offer scholarships but most research staff and students who need grants for their research will have to apply to Research Councils, namely, to the Engineering and Physical Sciences Research Council (EPSRC). Therefore, to develop the full-scale ICE-Mauritius , an application to the EPSRC will be made. In order to fill in the application form, further research on the requirements of EPSRC has been made and is briefly described below. 5.1.2 Introduction to EPSRC - The Engineering and Physical Sciences Research Council The Engineering and Physical Sciences Research Council (EPSRC, 2004) is ‘the UK Government's leading funding agency for research and training in engineering and the physical sciences’. The EPSRC operates, mostly, by funding research projects in universities and other research organisations. The funds are intended to meet the direct costs of the research project, together with a contribution towards the indirect costs involved (EPSRC Funding Guide, 2004). The majority of funding from the EPSRC is supported through the Responsive Mode, but other funding routes are available, for example Fellowship and others. ‘Calls for Proposal’ are also available, where strategic opportunities are announced and researchers can choose from the list 38 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS provided. There is no minimum or maximum funding, and no minimum or maximum period (EPSRC Funding Guide, 2004). The EPSRC fund ‘a dynamic and evolving research portfolio, extending from fundamental research in mathematics, chemistry, computer science and physics to more applied topics in engineering and technology’ (EPSRC, 2004). Many of the EPSRC research activities are cofunded between programmes to encourage multidisciplinary collaborations since major breakthroughs of ten arise when researcher from other related disciplines work together. 5.1.3 Eligibility of Investigators Principal investigators should be permanent employee of an eligible research organisation (all UK universities and similar research organisations are eligible organisations). Fixed term employees may be eligible provided that the organisation will give all the support normal for a permanent staff and that there is no conflict of interest between the investigator’s obligations to the EPSRC and the other organisation (EPSRC Funding Guide, 2004). ‘Research Assistant can be identified as Co-Investigators if they have made a substantial contribution to the development of the application and will be closely involved with the project, if funded. Then the application can seek funds for the assistant’s salary for the duration of the project’ (EPSRC Funding Guide, 2004). Research assistant cannot be the principal investigator. Moreover, research proposals will not be considered from an applicant who was the principal investigator of another grant and who has not yet finished producing the Final Report. 5.1.4 Research Opportunities The majority of funding from the EPSRC is supported through the Responsive Mode, where the research idea is determined by the applicant and where the proposals can be submitted at any time. The main criteria against which the proposal is assessed is the ‘intrinsic engineering or scientific excellence’ (EPSRC Funding Guide, 2004) as determined by peer review. EPSRC especially encourage research proposals that are adventurous with new concepts and techniques. First Grant Scheme First Grant Scheme is used to assist individuals at the beginning of their academic careers by offering them a research grant. To be eligible for the First Grant Scheme, candidates must 39 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS have been appointed to their first academic lecturing appointment in a UK university within the previous 24 months and should be within ten years of completing their PhD. Candidates wholly employed as research fellows are not eligible to apply (EPSRC Funding Guide, 2004). The scheme provides up to £120,000 for support. Proposal, which has received two or more strong references, will be considered by a peer review panel along with the other First Grant applications. First Grant proposals will not be considered against other types of proposals at the same time. 5.1.5 How to Apply Since 31 March 2005, applications for research grants can only be made via an electronic form through the Je-S (Joint Electronic Submission) system and each application should be accompanied with a self-contained ‘case for support’. The ‘Case for Support’ comprises of the following (EPSRC Beginners’ Guide, 2004) : • Previous track records (2 sides A4) • Description of the proposed research and context (purpose, background, project, resources, applications, collaboration) (6 sides A4) • Diagrammatic work plan (1 side A4) • Annexes (CVs, references, letters of support, equipment quotes, illustrations and named research assistants) Good applications contain ‘Case for Support’ which are clear, concise and uncluttered with technical jargon. The main criterion to determine the grade assigned to any grant proposal will be its scientific quality, but ‘viability and planning, cost-effectiveness and dissemination plans can be taken into account’ (EPSRC Mock Panel Guidance Notes, 2004). In addition, for First Grant proposal the applicant’s own plans for developing their research career and the commitment of the university to career development may be considered. 5.2 Writing up the Proposal 5.2.1 Original Idea From the development of the pilot project up to 5 percent of the actual corpus , it had been proven that a full-scale ICE-Mauritius was feasible just by using the Internet to collect the texts. 40 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Therefore, a proposal was drafted in accor dance to the EPSRC requirements in view of compiling a high standard application which could actually be sent for funding. First, the Je-SRP1 (EPSRC) application form was downloaded from the EPSRC website and filled in. However, many sections on the form could not be filled until a full detailed plan of how the pilot project ha d been developed was written. For instance, sections N (Travel and Subsistence) and O (Consumables) were difficult to fill in without knowing the actual tools and stages needed for the project. Moreover, sections such as J (Objectives) and K (Summary) were not of the best quality when written before more considerations were given to the outline and plan of the project. Therefore, it was decided to begin with writing the Case for Support first. Writing the Case for Support was not an easy task since it ha d to be clear, concise and attractive. Many details about how the corpus would be collected and annotated and its standards and the tools and staff needed, and the length of the project ha d to be stated in the Case for Support. To be able to provide these details and in order to extrapolate how much time and effort would be needed to collect the full corpus, further research and calculations were made on the process and development of the pilot project. The number-of-word and time-taken estimates for the collection of text and the text that had been edited and marked up were calculated to come up with estimate of lower and upper bounds of time and person-months needed for the full corpus. From these estimations, the initial research work plan to collect a one-million word corpus for Mauritian English was then organised into seven activity streams which would take up to three years to be completed by one postdoctoral research fellow. As mentioned previously, evidence from the pilot project showed that with this internet collection technique, the corpus would contain less than one million words due to the limited set of text categories available on the World Wide Web and this would also result in an unbalanced corpus. One way around this problem was to collect more texts that were available to compensate for the missing ones. Another solution was to expand the corpus to a different dimension and this is explained in the next section. 5.2.2 Expansion of Corpus Design To compensate for the small amount of texts and for the unbalanced texts categories, it was decided instead to expand the corpus to include other types of English from other English speaking countries. This would also result in a more ambitious and adventurous project which are the characteristics that the EPSRC are looking for. With this new objective for the proposal, more 41 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS research was needed to find other countries where either English is the official language or where English is one of the main spoken languages. Twenty countries were chosen to form part of the corpus and they are as follows: Bahamas, Bangladesh, Barbados, Bermuda, Botswana, Cayman Islands, Cyprus, Dominica, Gambia, Gibraltar, Grenada, Liberia, Malta, Mauritius, Namibia, Pakistan, Seychelles, Uganda, Zambia and Zimbabwe. As in Mauritius, English in most of these countries ha d in one way or another been highly influenced by other languages, either brought by ancestors or derived from their culture. The corpus would hence allow an interesting and useful analysis of the variation in English across the nations. Since collecting the fully-balanced one-million words for each country would be impossible, it was decided that the proposal would only target half a million words for each country, resulting in an “ICE-lite”. The term “lite” was borrowed from other simplified projects such as “TEI-lite” which meant a simpler version of TEI, a standard XML-markup convention for text corpora (TEI, 2005). To provide compatibility and an enhanced comparison with the other existing projects, the “lite” version of the 20 teams already in ICE (mentioned in section 2.1.3) would also be included in the corpus. For each country, numerous websites were easily accessible via the World Wide Web, and different texts categories were available. Therefore, the corpus would aim to contain approximately 20 million words taken only from the Internet. This meant that more staff would be required and a new work plan was needed. New ambitious estimates were then calculated. This was done by taking the amount of time taken to collect (20 minutes average for 1 text) and annotate (20 to 30 minutes per text) the thirty texts obtained (figure given in section 4.1.3 above) and multiplying them accordingly by 250 texts to obtain the estimates for one country and then multiply the result by 40 for the whole ICE-lite corpus. The overall expected completion time of the project was kept to three years but instead of only one research fellow, two more would be needed. The new research work plan for the ICE-lite was then organised into eight activities as listed below: WP1: Collection of Spoken and Written Text of English WP2: Transcription WP3: Textual Mark-up WP4: Word-class tagging WP5: Syntactic parsing WP6: Evaluation WP7: Comparison across dialects WP8: Dissemination for Exploitation 42 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS The expanded project would thus be beneficial to the governement and the educational system in each of the twenty countries mentioned above and the existing IC E teams. A comprehensive description of the different types of English could be obtained from the corpus and therefore each country would be able to develop its own reference guides to usage, dictionaries and other teaching materials. This could help both schools and universities to adapt their methods of teaching, and especially the structure in which English was taught and spoken to a better standard. The comparison across the dialects of English to find any striking similarities or differences would be useful for further research and teaching methods in each country and would also benefit those people who wanted to travel to or trade with other English-speaking countries since the comparison would provide a useful insight in how they would have to adapt their language. When the corpus would be released, it would also be beneficial to other research or academic institutions across the world. It could be used as a comparison or for further research by the existing corpuses or other potential corpuses. Longer-term impacts of the work to be done included: • Promoting cooperation between other English speaking countries and for the purpose of developing basic components for the linguistic society. • Easing the entrance requirements of English speaking countries into the different markets. • Promoting the different culture of the 40 countries across the world. 5.2.3 Writing Up Proposal Once the estimates were calculated and the work plan designed, the Case for Support was written more easily and it also became much easier to fill in the application form since the figures were readily available. The only difficulty was to divide the work among the three research fellows to make the completion of the work possible within three years. This was done by using only the lower limits of the estimates and therefore resulted in quite a tight schedule. Other estimates were calculated concerning costs of travelling, consumables, etc. Details about the cost of staff should be calculated through the COSTA system of the Universit y at http://www.leeds.ac.uk/rsu/COSTA.htm , but due to restricted access to students, the estimated costs were taken from another proposal by Atwell and Al-Sulaiti (2005). It is important to note that one paper application (in Word format) allows details of only two researches to be filled in. Therefore, to make the proposal complete, a second application was needed to add the details of the third researcher. However, due to the space limit of this report, the second application form 43 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS could not be added in the appendices and sections 2 and 3, which requested personal information about the referees and the investigators, were also omitted. Copies of the first draft of the Case for Support (which was sent for evaluation) and the first application form are shown in Appendices I and J respectively. 6. Evaluation To measure the success of a project, the latter needs to be evaluated against a number of relevant criteria. For this particular project, the criteria that were set up are: 6.1 • Product: Evaluates the design and compilation of the final product. • Minimum Requirements: Evaluates what minimum requirements are met. • Project Stages: Evaluates the methodology used to produce the final product. • Planning and Schedule: Evaluates the planning of the project from start to finish. Product The product was evaluated by three subject-experts, namely, Eric Atwell, Gerald Nelson and Serge Sharoff. Eric Atwell was the supervisor of the project. His evaluation would not be discussed in this report since he provided feedback throughout the course of the whole project. Gerald Nelson, from UCL, is the coordinator of the International Corpus of English and has been directly involved in the development of ICE-GB, the British component of ICE. Serge Sharoff, from the Centre for Translation Studies of Leeds University, has been involved in several corpus developments, such as a Russian corpus and a Chinese corpus, which he has collected only through the Internet. Evaluation from Gerald Nelson: Both the proposal and part of the pilot project were sent to Gerald Nelson and his first explicit comment was “May I say, first of all, that I am very impressed by this proposal. It shows an amazing knowledge of corpus linguistics, and of issues in world Englishes.” Therefore it can be said that both the proposal and the pilot project met the requirements needed and were of good standards. In his feedback, Gerald Nelson also implicitly suggested some improvements that could be done before the proposal is sent to the EPSRC and some issues that should be addressed concerning the ICE-lite if the funding is obtained. 44 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Issues about the ICE-lite are: • The text files should be named according to the ICE coding scheme, not as LETT01, etc. as described above. • The TEI headers should be stored externally as separate files. • The details in the headers should follow the ICE scheme. • Permission letters could cause problems to other ICE teams since they are strictly noncommercial whereas the permission letters sent stated “We may also want to use the text(s) for developing electronic products such as translators and dictionaries. • The distribution method of the ICE-lite Gerald Nelson agreed to send full details of the ICE filename and header conventions in his emails but respecting his busy schedule, he was not able to do so before the report was due. So, no improvement was able to be made to the pilot project. Also, for the purpose of this project, the issues of non-commercial corpus and distribution were decided to be ignored until the funding was obtained. Improvements to the proposal include: • Gerald Nelson suggested that the parsing should be dropped altogether since the syntactic parsing of the whole corpus is quite unrealistic, given the timescale involved. For ICE-GB, it took about 3 years to parse one-million words, and there were six or seven part-timers working on it. He also suggested that the aim should be to produce a fully-checked POStagged corpus and to consider the parsing as another follow-up project. • Changes to the wordings in the proposal such as: o Page 1, paragraph 1: "where English is the main language" to be changed to "where English is the first language or second official language". o Page 2, line 1: Delete "Australia" as it is not yet available. o Page 2, line 5: "and other freely available sources": more details should be given. o Page 4, line 3: "a software" should be changed to "a program". o Page 5, Staff: It is unlikely to get post-doctoral researchers working on this project. Therefore “post-doctoral” should be changed to "post-graduate". Despite the small changes needed and based on Gerald Nelson’s comment which he added at the end of the feedback: “As I said, this is a very impressive proposal, and you can count on my full 45 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS support (and the Survey's) if it gets funded”, the pilot project and the proposal proved to be successful and of a great potential for a follow-up project. Evaluation from Serge Sharoff: After reading the proposal, Serge Sharoff sent his approval implicitly by saying “I read the proposal with interest”. He had also shown that he wanted to participate and that he thought the proposal as being feasible and worth following up by giving comments on possible extensions and how he could contribute to the project. He proposed to contribute in two aspects as described below: • In WP6 (Evaluation), he proposed to add a lexical comparison of the new ICE-lite against the British National Corpus (similar to what he had done in one of his Internet corpora paper). • In WP8 (Dissemination), he proposed to disseminate data through his web interface, which he referred to as the Leeds CQP interface. There’s no publication on it yet but he is more than willing to write a paper on it if the project goes ahead. Another suggestion from Serge Sharoff that could be useful was the use of Google to estimate the size of source texts available for each country. He had tried finding English texts from Mauritius by typing “allintext: that OR in OR for site:.mu” in the Google query and this came up with 125,000 English pages, corresponding to more than 250 million words (if an average Internet page is about 2000 words). Therefore, this method could be used to find the size of texts available online for each country in the ICE-lite project. He also raised an important issue concerning the collection of the texts. According to him, it would be difficult to know whether a text was written by someone from a specific country. That is, you could not be sure that a text obtained from a Gambian website, for instance, was actually written by someone born in Gambia. For the pilot project, this problem was not encountered since coming from Mauritius, I could easily tell the difference from a text written by a Mauritian citizen and one which was not by either looking at the name of the author or by just looking at the structure and the words used since Mauritian English has a particularity to it, often including other dialects words. However, this could be a potential problem for a full-scale project and this issue would need further investigation if the proposal was to be funded. Due to the lack of time, this issue could not be resolved in this project. 46 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 6.2 Minimum Requirements The minimum requirements were: • Develop a small-scale prototype of the Mauritian Corpus of English. Section 3 described how the prototype ha d been designed and planned, with details of the different tools and techniques that are available for use. The development of the prototype itself was detailed in sections 4.1 and 4.2. As the prototype was being developed, some amendments to the original plan were needed. Overall the prototype can make up 5 percent of the full-scale ICEMauritius Corpus. • Survey of computer technologies for corpus development and processing. The different technologies available for corpus development and processing ha d been mentioned throughout the whole of the report, but more particularly, the different taggers and parsing systems available worldwide were outlined in section 2.3 while the techniques used specifically for ICE were described in section 2.2. The possible extension was: • Work plan for a follow-up project to develop a full-scale ICE-Mauritius corpus. To be able to build a work plan for a follow-up project, the pilot project had to be well understood and documented (which formed part of sections 4.1 and 4.2 above). Also research into the Research Council, namely, the Engineering and Physical Sciences Research Council (EPSRC) had to be carried out in order to know the requirements and to apply for grants. These requirements were described in section 5.1 while the steps taken in writing the application form and the proposal were described in section 5.2. 6.3 Project Stages The overall quality of the project was also assessed by applying the following criteria to each of the different stages of the project to see if they were appropriate to solve the initial problem, and their relevance in the development of the solution. The criteria were: • Was the background research of a suitable standard, did it help to understand the problem and did it help to gather the learning requirements. • Was the chosen methodology suitable for the project and was it adhered to. • Were the requirements gathered effectively and did final product successfully meet these requirements. 47 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS • Were the appropriate technologies used in creation of the pilot corpus. • Did the project solve the initial problem and was the final prototype of sufficient standard to prove the feasibility of a full-scale project and was the proposal of sufficient standard to send to the EPSRC for funding. Background research The background research helped to fully understand the problem and therefore what the project should actually achieve. It gave an insight into the emergence of corpora and their increasing uses in teaching and research. Research on ICE showed that there are only a few number of existing corpora and that many English-speaking countries can benefit from the compilation of their English language. Findings from the ICE website and other books on corpora were then used to design and set the standards for ICE-Mauritius. In addition, the different techniques available were researched to allow and facilitate the collection and annotation of the pilot corpus. Methodology The most signific ant problem that was encountered in the course of this project was the need to modify the aims and requirements of the project at the beginning of the second semester. This also meant changing the work plan and methodology. The “Feedback Model” used for this project as described in section 3, proved to be a good choice throughout the project. Many changes were made to the initial design after flaws became apparent in the encoding phase of the project. The following steps were taken during the development: • First the problem was analysed, that is, the need for a Mauritian Corpus was identified (section 2). • Then a system study was carried out and the findings showed that collecting a corpus is costly and timely and that using the Internet would be a solution to the problem (section 1 and 2). • The pilot project was designed next and this was explained in section 3. • The corpus was collected and annotated in the following stage , section 4 and the proposal written in section 5. • As the collection and annotation was carried out, it was found that many changes in the design were needed (section 4 and 5). • Finally, the pilot project was evaluated as described in section 6.1. Corpus requirements 48 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS The initial requirements for the ICE-Mauritius were gathered through the ICE website and other ICE-related books , which contains almost all that is needed to be included in an ICE sub-corpus. However, some more detailed requirements such as naming scheme for each text or the minimum amount of information to be included in the header were not specified. As mentioned above, Gerald Nelson from UCL agreed to send those details during the Easter break but he never got around doing so. Therefore, the only basic requirements from the official ICE website were applied to the pilot ICE-Mauritius. In addition to this, annotation requirements were also gained through the background research into the different technologies available. These were general requirements that any corpus should have and were not related to ICE. The prototype was evaluated by the people mentioned in the section above to see if the initial design was adequate, as well as to provide additional feedback. And as a result, they agreed that the pilot project did meet the basic requirements of ICE. In relation to the proposal, the requirements were taken directly from the EPSRC application guide. According to the feedback obtained, the proposal did meet the requirements of the EPSRC and hence consisted of a potential application for a follow-up project. Technologies Other than using Microsoft Word to collect and annotate the corpus manually, other technologies and tools were discussed throughout the report. It was seen that programs such as HTMASC could facilitate the stripping of HTML coding from the texts to produce ASCII text file while UniRed was used to provide a faster and more error -free compilation and saving of the mark-up texts. However, other specific corpus tools such as ICECUP or ICETREE could not be used and tested since they are not freely and easily available to anyone. Initial problem The initial problem identified the need for an ICE-Mauritius. To develop a full-scale ICE project would be impossible within this project. Therefore this project concentrated on developing a prototype of the ICE-Mauritius, investigating data-sources and instigating data-collection and looking at the different technologies available to investigate the requirements and feasibility of a larger-scale follow-on project. The pilot project, together with the feedback obtained proved that a full-scale ICE-Mauritius was feasible. However, a more ambitious follow-on project was 49 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS described in the proposal, which according to the feedback obtained, was a plausible application to the EPSRC for funding. 6.4 Planning and Schedule Up until the mid-project report, the schedule was followed very closely and everything was going according to plan. However, after feedback was obtained from the assessor in January, it became clear that a new direction for the project had to be devised with some changes to the aims and requirements. This resulted in a new schedule for the second semester. Both the old and new schedules were shown in section 1.5. To meet the new aims and requirements, more work was required in a much restricted amount of time. More research on the background and to understand the problem was needed and the one week allocated was not enough. Moreover, as the corpus was being developed, it was found that it was difficult to design the corpus since the categories to be inc luded would vary depending on the texts collected. Therefore, the texts had to be collected first and then classified accordingly. Also, while the schedule stated that the feasibility investigation of ICE-Mauritius and the writing up of the proposal would be done after the pilot project was compiled, drafting the proposal alongside compiling the corpus was easier since the different steps taken were noted as they were carried out and new ideas kept surfacing for the final proposal. And while drafting the proposal, the feasibility of ICE-Mauritius was being self -addressed. The initial schedule had been created failing to take into account that just before the end of the second term, other projects and essays would have to be submitted, and therefore not much time would be available to work on the project. The schedule hence had to be revised again, accounting for this flaw. With a clearer view of the amount of work the project would entail, the development of the corpus and writing up the proposal were both scheduled to be completed before the Easter break, to leave enough time to evaluate and write up the rest of the project during the holidays. This goal was achieved and with only slight revisions of the corpus and of the proposal needing to be done during the Easter break, there was enough time to evaluate the project. However, the time to get feedback from the different people to whom the corpus and the proposal were sent to was underestimated. Feedback was obtained only in the last week of the Easter break, leaving not much time to write the evaluation. Nevertheless, the write-up was completed with a week to spare before submission and the time was used to revise the final report. 50 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Schedule 3 below shows the revised project schedule. Schedule 3: Revised Project Schedule for second semester Dates Milestones Tasks 24/01/05 - 31/01/05 Section 1 Decide on new aims & objectives and design new plan 01/02/05 - 12/02/05 Section on Background Research on methods available to Research extend the ICE corpus to Mauritius 13/02/05 - 20/02/05 Appendix C,D Collect sample texts from the Internet & send request for copyright permission 20/02/05 - 22/02/05 Appendix B and Section 4 Design layout and text categories of ICE Mauritius 23/02/04 - 18/03/05 Section 4 and Appendix E Annotate corpus 23/02/05 - 18/03/05 Appendix F Draft a proposal for ICE-Mauritius 01/03/05 - 18/03/05 Section 4 and proposal Investigate feasibility of ICE-Mauritius 18/03/05 - 18/04/05 Evaluation Evaluate corpus & proposal 01/04/05 - 26/04/05 Final Report Complete final report. Most chapters should be already partially written up, but may need reworking. 7. Conclusion As stated in the first section, the aim of this project was “to develop a prototype of the Mauritius component of the International Corpus of English, to demonstrate feasibility and potential problems for a larger-scale follow-up project”. Throughout this project, both benefits and difficulties of developing a corpus, together with the techniques and tools availa ble for the development were discovered. The outcome was a prototype of the ICE-Mauritius up to 5 percent of its original size and in addition a work-plan for the follow-up project was set up, whereby showing the feasibility of an ICE-Mauritius collected only through the Internet. To summarise therefore, the project fulfilled its minimum requirements, as well as its suggested extended requirements and it went even further by providing a full proposal for the application of a much wider and more ambitious ICE-lite project to the EPSRC for funding. Despite some issues which would need further consideration for the ICE-lite, much interest and approvals were obtained from the two evaluators and field-experts mentioned above , proving its success. Therefore, as future work and improvements, it is hoped that the proposal will be sent to the EPSRC and that the prototype will be developed into a larger-scale project. 51 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS References: Al-Sulaiti, L. (2004) Designing and Developing a Corpus of Contemporary Arabic. Unpublished MSc thesis. University of Leeds. Atwell, E. and Al-Sulaiti, L. (2005) Development of the International Corpus of Arabic. EPSRC Application Form (not yet submitted). University of Leeds. Atwell, E. (1983) Constituent Likelihood Grammar. ICAME Journal (7) pp34-67. Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C., and Wilcock ,S. (2000) A comparative evaluation of modern English corpus grammatical annotation schemes. ICAME Journal (24) pp 7-23 Atwell, E. (2004) Gambian English ICE Corpus. University of Leeds, School of Computing. [News Group]. Baker, P. et al. (2003) Constructing corpora of South Asian languages. In Proceedings of the Corpus Linguistics 2003 conference, 16(1), 71-80. BPML (2004) Cybercity Mauritius - The Ebène CyberCity web site [online]. [Accessed 20th November 2004]. Available from World Wide Web: http://e-cybercity.mu/cybercity Breyer, Y. (2005) Gateway to Corpus Linguistics on the Internet [online]. [Accessed 15th February 2005]. Available from World Wide Web: http://www.corpus-linguistics.de/corpora/corp_engl_a_e.html Cutting, D., Kupiec, J., Pedersen, J., and Sibun, P. (2005) A Pratical Part-of-Speech Tagger. Palo Alto: Xerox Palo Alto Research Centre. Department of English Language & Literature, University College London (2002) The International Corpus of English (ICE) web site [online]. [Accessed 9th November 2004]. Available from World Wide Web: http://www.ucl.ac.uk/english-usage/ice/# Edwards, J. (1995) Principles and alternative systems in the transcription, coding and mark-up of spoken discourse. In Leech, G., Myers, G. and Thomas, J. (ed.) (1995) Spoken English on Computer: Transcription, mark-up and application. Harlow: Longman. EPSRC, The Engineering and Physical Sciences Research Council (2004) The EPSRC web site [online]. [Accessed 23rd October 2004]. Available from World Wide Web: http://www.epsrc.co.uk/ EPSRC (2004) EPSRC Funding Guide web site [online]. [Accessed 7th November 2004]. Available from World Wide Web: http://www.epsrc.co.uk/ EPSRC (2004) EPSRC Research Grants Beginners’ Guide [online]. [Accessed 26th October 2004]. Available from World Wide Web: http://www.epsrc.co.uk/ EPSRC (2004) EPSRC Mock Panel Guidance Notes [online]. [Accessed 6th November 2004]. Available from World Wide Web: http://www.epsrc.co.uk/ 52 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS EPSRC (2004) Guidance Notes for completing the Je-SRP1 (EPSRC) form [online]. [Accessed 9th November 2004]. Available from World Wide Web: http://www.epsrc.co.uk/ Fang, A. (1996) AUTASYS: Grammatical Tagging and Cross-Tagset Mapping. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. Garside, R. and Smith, N. (1997) A Hybrid Grammatical Tagger: CLAWS4. In Garside, R., Leech, G and McEnery, T. (ed.) (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora. London; New York: Longman. Greenbaum, S. (1991b) The development of the International Corpus of English. In Aijmer, K. and Altenberg, B. (eds.) English Corpus Linguistics. Studies in Honour of Jan Svartvik. London: Longman. Pp. 83-91. Greenbaum, S. (1996) Introducing ICE. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. Humanities Text Initiative (2005) The TEI Header [online]. [Accessed 16th February 2005]. Available from World Wide Web: http://www.hti.umich.edu/cgi/t/tei/tei- idx?type=pointer&value=HD Ku.era, H. and Francis, W.H. (1967) Computational analysis of present-day American English. Brown University Press, Providence, Rhode Island. Laudon, K. and Laudon, J. (2002) Management Information Systems – Managing the Digital Firm. 7th edition. New Jersey: Prentice Hall. Leech, G. (1997a) Introducing corpus annotatio n. In Garside, R., Leech, G and McEnery, T. (ed.) (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora. London; New York: Longman. Leech, G. (1997b) Grammatical Tagging. In Garside, R., Leech, G and McEnery, T. (ed.) (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora. London; New York: Longman. Leech, G. and Eyes, E. (1997) Syntactic Annotation: Treebanks. In Garside, R., Leech, G and McEnery, T. (ed.) (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora. London; New York: Longman. Marcus, M., Santorini, B., and Marcinkiewicz, M. (1993) Building a large annotated corpus of English: the Penn Treebank, Computational Linguistics, 19(2), 313-30. Meyer, C. (2002) English Corpus Linguistics, an Introduction. Cambridge: Cambridge University Press. Nelson, G. (1991a) Manual for Spoken Texts. London: Survey of English Usage, University College London. 53 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Nelson, G. (1991b) Manual for Written Texts. London: Survey of English Usage, University College London. Nelson, G. (1996a) The Design of the Corpus. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. Nelson, G., Wallis, S. and Aarts, B. (2002) Exploring Natural Language: working with the British component of the International Corpus of English. Philadelphia: John Benjamins Publishing Company. Novacek, W. (2000) Bite’n’Byte: Software Development [online]. [Accessed 17th February 2005]. Available from World Wide Web: http://www.bitenbyte.com/ Quinn, A. and Porter, N. (1996) ICE Annotation Tools. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. Republic of Mauritius (2004) History [online]. [Accessed 20th November 2004]. Available from the World Wide Web: http://www.gov.mu/abtmtius/history.htm Sampson, G. (1995) English for the computer: The SUSANNE Corpus and analytic scheme. Oxford: Clarendon Press. Sharoff, S. (2004) Methods and tools for development of the Russian Reference Corpus. In Archer, D., Wilson, A. and Rayson P. (eds.) Corpus Linguistics Around the World. Amsterdam: Rodopi. SourceForge (2005) Project: UniRed: Summary [online]. [Accessed 16th February 2005]. Available from World Wide Wed: http://sourceforge.net/projects/unired The School of Computing, University of Leeds (1998-2004) The University of Leeds web site [online]. [Accessed 21 st October 2004]. Available from World Wide Web: http://www.comp.leeds.ac.uk/research/index.shtml 54 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS APPENDIX A: Personal Experience Since I am doing a Joint Honours degree in Computing and Management, while selecting a project, I was looking for a task which would allow me to combine knowledge gained from both subjects. Therefore, my first choice, which was to develop a training course for research students on how to apply for funding seemed ideal at the time. When the project coordinator advised that I should relate my project more to computing aspects, I thought I would manage to do so by building an on-line training course. However, after feedback was obtained by the assessor in January, I was in a total state of shock and disappointment. At that moment, I realised that I should have listened to the advice I was given. It was clear that I would have to consider a new outline for my project. This meant that I needed to stop feeling sorry for myself and start working even harder right away. Also, being a Joint Honours student and taking on a 40-credit Computing project meant that I could only take another 20 credits of Computing modules in the final year. In addition, with only a subset of level 1 and level 2 modules, it was difficult to take other modules that would have been relevant to the project such as Knowledge Management or Natural Language Processing. Therefore, from this project, a number of lessons were learnt and the following advice can be given to future students: • Choose a project that meets the requirements of the School. It is important to know what the school is expecting and what your supervisor and assessor is expecting and most of all what constitute a good level 3 project. One recommendation will be to read carefully the final year project website and at least one past project before deciding and starting on yours. • Choose a project with a purpose or that interest you. The project is over two semesters and it is guaranteed that your initial enthusiasm will not last over the full course of the project. Therefore it is important that you choose a topic in which you have at least some interest or in which you feel concerned and want to get involved with. • Choose a project that is relevant to your course. Especially if you are a Joint Honours student, choose a project that allows you to make some use of the other half of your course such as project planning/management for Computing and Management students. • Always listen to advice given from your supervisor, the project coordinator and anyone else involved in the project. These people are more experienced and are here to guide you, so do not hesitate to contact them when you are confused. Don’t think you know best and can solve 55 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS the problems by yourself. Also, the weekly meeting with the supervisor are very useful and should be attended. • Do a considerable amount of background reading. Background reading may seem a lost of time, but it is very important to deliver a project of high quality. Firstly, the more you learnt about something, the more interested and involved you get and secondly background reading helps you understand the problem at an early stage and makes it easier to work on the project. • Don’t plan on being able to work consistently. You will still have a lot of coursework and other assignments to complete, and allowances must be made for these if the rest of your studies are to be unaffected by the extra work the project requires. Similarly leave time for the exam periods and time for yourself and a break. You will need it!!! • Don’t leave the write-up for the end. Always keep track of what you are doing and write the report as you go along. Then, it is less likely that you will forget to include something crucial to your project and it saves you from being stressed nearer the deadline. • Allow extra time and effort for evaluation. Any good evaluation relies on other people’s opinion or experience. However, third parties are very busy people and getting them involved may take longer than you expect. Therefore, ensure that your schedule is flexible and one recommendation will be to start by requesting their help, then begin on your own evaluation of the project and drop everything when they are ready to help you. • Never give up. There will be some time during the course of the project that everything will seem to go wrong and you will feel desperate, but remember that there is always a solution and that you are not the only one going through this nightmare. My overall experience of this project has had both its good and chaotic time; it was difficult to restart the project in the second semester but I have enjoyed the development of the pilot corpus and the writing of the proposal. The chance to work on a project of this size has given me the opportunity to develop project and time management skills and report writing which have already prove n vital with my work outside University. 56 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS APPENDIX B: Markup Symbols Written Text Markup Symbols <#> <I>...</I> <l> <p>...</p> <h>...</h> <w>...</w> <X>...</X> <?>...</?> <O>...</O> <.>...</.> <->...</-> <+>...</+> <=>...</=> <}>...</}> <&>...</&> <(>...</(> <)>...</)> <@>...</@> <sb>...</sb> <sp>...</sp> <ul>...</ul> <it>...</it> <bold>...</bold> <typeface>...</typeface> <roman>...</roman> <smallcaps>...</smallcaps> <footnote>...</footnote> <fnr>...</fnr> <space> <quote>...</quote> <del>...</del> <marginalia>...</marginalia> <mention>...</mention> <indig>...</indig> <foreign>...</foreign> Text unit marker Subtext marker Linebreak marker Paragraph marker Heading Orthographic word Extra-corpus text Uncertain transcription Untranscribed text Incomplete word Normative deletion Normative insertion Original normalization Normative replacement Editorial comment Discontinuous word Normalized discontinuous word Changed name or word Subscript Superscript Underline Italics Boldface Change of typeface Roman type Small capitals Footnote Reference to footnote Orthographic space Quotation Deleted text Marginalia Mention Indigenous word(s) Foreign word(s) 57 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Spoken Text Markup Symbols <$A>, <$B>, etc <I>...</I> <#> <O>...</O> <?>...<?> <->...</-> <+>...</+> <=>...</=> <.>...</.> <}>...</}> <[>...</[> <{>...</{> <,> <,,> <(>...</(> <)>...</)> <X>...</X> <&>...</&> <@>...</@> <w>...</w> <quote>...</quo te> <mention>...</mention> <foreign>...</foreign> <indig>...</indig> <unclear>...</unclear> Speaker identification Subtext marker Text unit marker Untranscribed text Uncertain transcription Normative deletion Normative insertion Original normalization Incomplete word Normative replacement Overlapping string Overlapping string set Short pause Long pause Discontinuous word Normalized disc. word Extra-corpus text Editorial comment Changed name or word Orthographic word Quotation Mention Foreign word(s) Indigenous word(s) Unclear word(s) 58 59 Public Dialogue Scripted Monologue Unscripted Spoken Student Writing ICE-Mauritius Printed Letters Academic Written Popular Unprinted Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS APPENDIX C: Corpus Design layout Letters Student Writing Type ID 60 - Author Prosi Magazine Publisher Mauritius Apr-99 Email Sent? various http://sundayvani.in tnet.mu/Links/Your %20voice.htm Sunday Vani - - 2006 - 266 rhevateeg@ Yes lbis.intnet.m u Assessment and Rhevatee Reports - February Gobin 2003 http://www.lebocag e.net/circular/Asses sment%20RG.htm various 507 LBIS@intn Yes et.mu http://www.lebocag LETT01 School Fees 2004 Jean-Paul Le Bocage Moka, Nov-03 e.net/circular/Circul de Chazal International Mauritius ar%20fees2004.htm School Mauritius Le Bocage Moka, Feb-03 International Mauritius School Mauritius 323 info@roger Yes s.mu Port-Louis, Oct-04 Mauritius 250 ramchurnco Yes @intnet.mu 762 prosi@bo Yes w.intnet. mu Publisher Date Word Place Message from the R. Mauritius Reduit, 2001 President of the Ramchurn Veterinary Mauritius Association Association The 1998 Illovo Award project competition: Summary of proposals made by the winning team from Dr Maurice Curé State Secondary School Title http://www.rogers. LETT02 Letter to Hector Rogers mu/ Shareholders (New EspitalierGroup Structure) Noël http://mva.intnet.m u/messages_files/an ee.htm http://www.prosi. STU01 net.mu/mag99/36 3 Web Address No not delivered Accept ? Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS APPENDIX D: List of Texts collected http://ncb.intnet .mu/ Speech ID Title Author Publisher 61 - http://mva.intne ACAD02 Diseases of Rabbits in t.mu/articles.ht Mauritius m http://www.uo Media and Democracy m.ac.mu/About Us/Newsle tter/j une_04.pdf - - Roukaya University of Reduit, Jun-04 Kasenally Mauritius Mauritius - - Government Port-Louis, Dec-04 of Mauritius Mauritius Government Port-Louis, Jan-96 of Mauritius Mauritius 313 roukaya@u Yes om.ac.mu 1499 - 993 barthestude nts @ yahoo.co.u k 1407 - 2072 - 1130 - Port-Louis, Nov-03 Mauritius Sent? 818 ncb01@nc Yes b.intnet.mu Email Port-Louis, Apr-04 Mauritius Publisher Date Word Place R. University of Reduit, Ramchurn Mauritius Mauritius - Speech by Hon. A.K. Gayan, Minister of Tourism and Leisure on the occasion of the handing over of certificates to skippers Address by the PresidentYear 1996 Academic http://sundayva ACAD01 Problems facing the bar ni.intnet.mu/Lin student in Mauritius – Law ks/views.htm students threatened with a lethal blow? http://tourism.g ov.mu/speech1. htm http://mauritius assembly.gov.m u/assem96.htm Speech by Chairman of the Mr Kemraz National National Computer Board Mohee Computer Board http://ncb.intnet SPEE01 Address by the Hon. Sushil Hon. Sushil National .mu/medrc.htm Khushiram, Minister of Khushiram Computer Development, Financial Board Services and Corporate Affairs on E-Business Web Address Type not delivered not delivered Accept ? Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 62 Biological Diversity Samad Rojoa and approaches to its Conservation Environment Protection Legislation should be more business friendly The Music Scene http://pages.intnet. mu/nathraj/article3. html http://www.jecmauritius.org/ http://www.infomau ritius.com/mauritius /latest/the_music_sc ene/?sid=35 Teleservices Ltd, the efficient response... http://www.serviho o.com/channels/kin ews/v3dossier_detai ls.php?id=61438 POP02 The Mauritius Kestrel, once the world's rarest bird http://www.maurine POP01 t.com/wildlife.html - - - Armand F. Pampusa - Raj Makoond Prosi Magazine - - - - - - Mauritius Oct-96 - - The Mauritian COMPNet Port Louis, Wildlife Mauritius Foundation - - Email Sent? 384 dahkiam@i Yes ntnet.mu 643 jec@intnet. Yes mu 569 - 210 tplus@intn Yes et.mu 700 - 405 hema@the Yes mauritianco nnection.co m 549 tplus@intn Yes et.mu Publisher Publisher Date Word Place Miss Hema Malini Paupiah Mauritian Sega Author http://www.themaur itianconnection.com /culture/sega/index. html Title Popular ID The transit of Venus Ricaud Auckbur Web Address Academic http://www.serviho o.com/channels/kin ews/v3dossier_detai ls.php?id=43863 Type Accept ? Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 63 http://www.prosi.net.mu/ mag99/365june/pram365 .htm http://www.businessmag. REP01 mu/displayNewsContent. asp?NID=5747&CID=30 Bridges of Hope: A post-violence social project Mr. Philippe Boullé: Jacques “Mauritius is looked Dinan at as an important economic entity” Defuse this time bomb http://www.businessmag. mu/displayNewsContent. asp?NID=6232&CID=26 - - Circle Cycle Tour wheels in new sponsor... http://www.servihoo.com/ channels/kinews/v3dossi er_details.php?id=44481 - National literacy and numeracy strategy (NL & NS) http://ministryeducation.gov.mu/majpro j/natlit.htm mu/default.asp?CID=10 Prozi Magazine Business Magazine Business Magazine - - - Feb-03 Mauritius Jun-99 Port Louis, Feb-05 Mauritius Port Louis, Feb-05 Mauritius - - Port Louis, Feb-05 Mauritius - Freeport Port Louis, Operations Mauritius (Mauritius) Ltd Email 451 - ntnet.mu 1693 busmag@i Yes ntnet.mu 2150 busmag@i Yes t.mu 467 tplus@intne Yes mail.gov.m u 928 psaddul@ Yes uom.ac.mu 1012 k.jankeee@ Yes 1739 - not delivered Sent? Accept ? 278 contact@fo Yes m.co.mu Publisher Publisher Date Word Place M.O. Rotary Club Bakarkhan - Author Competition in the Dr Business banking sector: Chandan Magazine further evidence: Jankee “Actions speak louder than words” Mauritius Drug Profiles http://rotary.intnet.mu/ Title Société Du Port ID http://www.freeportmauritius.com/holding/ Web Address Reportage http://www.businessmag. REP02 Popular Type Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Web Address MEF believes Budget should provide more for skill development Mr. Assad Bhuglah, Director, Trade Policy Unit: “The lobbying for the recognition of SIDS by the WTO should start right from now” CAC: Without Fear and Sir SatcamMauritius Favour? Boolell Times Whatever happened to Sir SatcamMauritius the Sachs Commission? Boolell Times http://www.businessmag. mu/displayNewsContent. asp?NID=6380&CID=8 http://www.businessmag. mu/displayNewsContent. asp?NID=6349&CID=26 64 http://mauritiustimes.co m/060902ssb.htm http://mauritiustimes.co m/041002ssb.htm Business Magazine Business Magazine Business Magazine Business Magazine Business Magazine Pointe-aux- Oct-02 Sables, Mauritius Pointe-aux- Sep-02 Sables, Mauritius Port Louis, Mar-05 Mauritius Port Louis, Mar-05 Mauritius Port Louis, Mar-05 Mauritius Port Louis, Mar-05 Mauritius Port Louis, Feb-05 Mauritius Email 1020 mtimes@i ntnet.mu 1012 mtimes@i ntnet.mu 1541 busmag@i ntnet.mu 409 busmag@i ntnet.mu 1468 busmag@i ntnet.mu 467 busmag@i ntnet.mu 1182 busmag@i ntnet.mu Publisher Publisher Date Word Place MCB estimates growth rate at 4.2% last year and at 5.2%in 2005 Author http://www.businessmag. mu/displayNewsContent. asp?NID=6304&CID=8 MCCI stresses the need to develop a more business-friendly environment Title Proposals from the Printers & Stationery Manufacturers Association (PSMA) ID http://www.businessmag. mu/displayNewsContent. asp?NID=6381&CID=8 Reportage http://www.businessmag. mu/displayNewsContent. asp?NID=6378&CID=8 Type Sent? Accept ? Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS 65 1999 Illovo Award Inter-College Project Competition Salt fish in tomato sauce Strategic Plan Cabinet Decisions taken on 04 March 2005 http://www.prosi.net. mu/mag99/366july/ilo vo366.htm http://ilemaurice.tripod.com/ro ugpoisal.htm http://www.uom.ac.m u/AboutUs/StrategicP lan/overview.htm http://pmo.gov.mu/de cision.htm - - Madeleine Philippe - - Pointe-aux- Feb-03 Sables, Mauritius Pointe-aux- Sep-02 Sables, Mauritius - Government Port-Louis, Mar-05 of Mauritius Mauritius - - Mauritius Jul-99 University of Reduit, Mauritius Mauritius - Prozi Magazine Email 1434 webmasterportal@mai l.gov.mu 1501 centraladmi n@uom.ac. mu @cjp.net 563 madeleine Yes intnet.mu 666 prosi@bow. Yes m.intnet.mu Yes not delivered Sent? Accept ? 2010 director@ut Yes 1635 mtimes@in tnet.mu 1476 mtimes@in tnet.mu Publisher Date Word Place University of Pointe-aux- Feb-04 Technology, Sables, Mauritius Mauritius Robert Lesage should S. Mauritius not allow himself to Modeliar Times be intimidated by anybody and least of all by ICAC and his arrest Publisher Sir SatcamMauritius Boolell Times Author http://mauritiustimes. com/210203mod.htm Title The Choice Cannot Be Clearer ID http://mauritiustimes. com/200902/200902s sb.htm Web Address Instructional http://www.utm.ac.mu INS01 Admission / Regulations Reportage Type Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS For a few at the Madhukar Mauritius cost of the many Ramlallah Times The Tide Is Turning Democratisation Madhukar Mauritius Ramlallah Times Harman Dahl's Legacy Not your day to die Mauritius and Sugar http://mauritiustimes .com/300802edito.h tm http://mauritiustime s.com/200902/2009 02edit.htm 66 http://mauritiustimes .com/040305mr.htm http://pages.intnet. NOV01 mu/rajbalkeehomep age/hdcomplete.htm http://pages.intnet. mu/rajbalkeehomep age/n-one.htm http://www.prosi.net. mu/simau97/prefac e.htm Creative Jacques Dinan Prozi Magazine Raj Balkee Oceanic Publishing Raj Balkee Oceanic Publishing Madhukar Mauritius Ramlallah Times Madhukar Mauritius Ramlallah Times - - Date Mauritius May-97 Mauritius 1995 Mauritius 2001 Pointe-aux- Mar-05 Sables, Mauritius Pointe-aux- Sep-02 Sables, Mauritius Pointe-aux- Feb-05 Sables, Mauritius Pointe-aux- Aug-02 Sables, Mauritius Government Port-Louis, of Mauritius Mauritius Subservience MSM style Publisher Place Government Port-Louis, of Mauritius Mauritius Publisher http://mauritiustimes EDIT01 .com/040205mr.htm Editorial - - Author Functions of the National Assembly Title http://mauritiusasse mbly.gov.mu/role/f unction.htm ID Our Constitution Web Address Instructional http://www.gov.mu/ govt/g_const.htm Type Email Sent? 1857 - ntnet.mu 1988 rajbalkee@i ntnet.mu 1890 rajbalkee@i Yes net.mu 820 mtimes@int net.mu 709 mtimes@int net.mu 835 mtimes@int net.mu 862 mtimes@int Yes 993 webmasterportal@mai l.gov.mu 730 webmasterportal@mai l.gov.mu Word Yes Yes Accept ? Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS APPENDIX E: Sample of the Letters of Copyright Sample 1: First letter to explain the purpose of the corpus and for the owners and authors to keep 10 February 2005 Dear General Director of Request for permission to use texts for linguistic research Creation of a Mauritian Corpus of English I am working on a student project at the University of Leeds that involves collecting English texts from Mauritian people in electronic form and storing them on a computer to create a corpus that may be freely available to all via the Web. I believe that you are the owner of the text(s) of on the website: I would like to use the text(s) as part of the corpus. People would be able to access your text(s) and the text(s) of others for further research and teaching. We may also want to use the text(s) for developing electronic products such as translators and dictionaries. I would be very grateful if you would grant to myself and the University of Leeds a free and perpetual non-exclusive licence for the above purposes only. In consideration for your consent mentioned above, I will gladly acknowledge your contribution in any relevant material. If you agree to above and can confirm that there are no other third parties that have any further rights in the text(s) that I need to contact, please acknowledge your acceptance to this by returning signed and dated the attached copy of this letter. Yours faithfully Dolly Koo Phone: [0044 - 7818855441] Email: [jhs2dlyk@leeds.ac.uk] Address: [c/o Mr Eric Atwell, Senior Lecturer University of Leeds Leeds LS2 9JT United Kingdom] 67 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Sample 2: Second letter for authors and owners to sign if they agree for their websites to be used. 10 February 2005 Dear General Director of Request for permission to use texts for linguistic research Creation of a Mauritian Corpus of English I am working on a student project at the University of Leeds that involves collecting English texts from Mauritian people in electronic form and storing them on a computer to create a corpus that may be freely available to all via the Web. I believe that you are the owner of the text(s) of on the website: I would like to use the text(s) as part of the corpus. People would be able to access your text(s) and the text(s) of others for further research and teaching. We may also want to use the text(s) for developing electronic products such as translators and dictionaries. I would be very grateful if you would grant to myself and the University of Leeds a free and perpetual non-exclusive licence for the above purposes only. In consideration for your consent mentioned above, I will gladly acknowledge your contribution in any relevant material. If you agree to above and can confirm that there are no other third parties that have any further rights in the text(s) that I need to contact, please acknowledge your acceptance to this by returning signed and dated the attached copy of this le tter. This is to confirm to the School of Computing at Leeds University that I agree to give permission for all the texts on my website to be used as explained to me by the researcher. I also agree to make the Corpus available for public use by researche rs, students and language engineers. Name (in block capitals)_____________________________________ Signature: ________________________________________________ Date: ____________________________________________________ 68 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS APPENDIX F: Template for the header <tei.2> <teiHeader id=" "> <fileDesc > <titleStmt > <title > </title > <author> </author> <respStmt > <resp>compiled by </resp> <name >Dolly Koo</name > </respStmt > </titleStmt > <publicationStmt > <publisher> </publisher> <pubPlace> </pubPlace> <date></date> </publicationStmt > <sourceDesc > <p>created in machine-readable form in “ “ </p> </sourceDesc > </fileDesc > <encodingDesc > <projectDesc> <p>Texts collected for use in the pilot project for ICE- Mauritius, February, 2005</p> </projectDesc> <samplingDecl> <p>Whole text of “ “ words copied from the site </p> </samplingDecl> </encodingDesc > <profileDesc > <creation> <date value=" "> </date> <rs type="city "> </rs > </creation> <langUsage>English</langUsage> <textClass> <text Desc n=" "> <channel mode="w">print; written</channel> </textDesc> <particDesc > <person id=" " sex=" " /> </particDesc > </textClass> </profileDesc > </teiHeader> <text > <body > </body > </text > </tei.2> 69 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS APPENDIX G: Example of Encoded Text An example of a raw text which belongs to the “Reportage” category, with id “REP14.xml”. Title Author Publisher Publisher Place Date Source Email Amount of Words Robert Lesage should not allow himself to be intimidated by anybody and least of all by ICAC and his arrest S. Modeliar Mauritius Times Pointe-aux-Sables, Mauritius February 2003 http://mauritiustimes.com/210203mod.htm mtimes@intnet.mu 1637 Robert Lesage should not allow himself to be intimidated by anybody and least of all by ICAC and his arrest “Robert Lesage should himself write out his statement and send it to the police, to ICAC, to the DPP, to Transparency International and to the President of the Republic. Only then will he acquire some legitimacy in his allegations and only then that the citadel of fraud that some unfortunately do not want to demolish will come crumbling down…” After the Cuttaree affair and his shielding in circumstances which have been repeated ad nauseam, after a minister of the present government has been arrested on suspicion of corruption and after a complete inaction in the case of another minister, another scandal has emerged. This time it does not concern any Swiss bank involving Eric Stauffer, Vasant Bunwaree and Navin Ramgoolam though Paul Bérenger has already ruled that these latter two are guilty. This time one of the most respected banks of the country, the Mauritius Commercial Bank, better known as the MCB and being considered the best, is involved. Until the ramifications of what is described as a fraud with regard to the National Pensions Fund (NPF) are known no blame should be attached to anybody. One section of the press has talked about this and has referred to Minister Choonee. It is to be hoped that such an attitude becomes a general feature of the press and that the civilised press as opposed to the gutter press and to the partisan press will be prevailed upon when it comes to the innocence and reputation of people. This philosophy should also be the hallmark of certain politicians who can blow hot and cold at the same time. While pontificating about presumption of innocence in the case of those close to the regime, because only those who espouse the cause of the supreme leader of Mauritius can aspire to be appointed to posts in the services including ICAC, Paul Bérenger has already found Vasant Bunwaree and Navin Ramgoolam guilty of offences in relation to the Swiss bank affair. Now that the MCB scandal has emerged he is trying to make a connection between the case at the MCB with the Swiss bank affair. On what basis he is doing that is not clear and yet he is saying that the whole matter will be fully investigated. Who will investigate the matter? Is it going to be investigated by ICAC? Whether we like it or not, ICAC is yet to be perceived as a totally independent institution and totally free from political influences. Even if it is, the perception is otherwise. At times perception of independence is as important if not more important than independence itself. In addition to trying to lay the blame for the MCB scandal on the Labour Party through a Swiss connection, Paul Bérenger is all praise for the MCB. By so doing Paul Bérenger is already brainwashing public opinion against any malpractice or offence that may have been committed by the MCB or any member of the MSM or MMM because it should not be forgotten that it appears that the scandal dates as far back as 1992, a time at which Paul Bérenger was in a coalition with Sir Anerood Jugnauth before being booted out in 1993. So let not Paul Bérenger shout victory too soon as the investigators would have to find out who were the ministers responsible for the NPF from 1992 up to today. The dates at which the funds have been misused will have to be determined as well as the companies that benefited from those transfers of funds. 70 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS But how will all this be determined? One would expect an impartial approach to the investigation. This is simply not possible and the perception is that this is not the case right now. Paul Bérenger has already placed a political coloration on the whole matter. The MCB has already talked of a total absence of any conspiracy at the level of the bank and is suggesting that the accusation of conspiracy at the bank finds its source in a personal vendetta of Robert Lesage. Most disturbing is the attitude of ICAC which has indicated yet once more that it is not functioning as a completely independent body. If proof is needed it is be found in the very revealing statement of both the Commissioner of ICAC, Navin Beekharry and that of Robert Lesage. Let us hope that the office of the DPP does not join the bandwagon. According to reports Robert Lesage is alleged to have stated that “…being given the new approach taken, I have decided to withdraw my cooperation with the inquiry altogether and not to make any statement. However, I confirm that I am still willing to continue my cooperation with the inquiry so long as the line taken since the beginning. But if such cooperation is resumed, I shall tell the truth, the whole truth and nothing but the truth.” In fact it would appear that what the ICAC investigators have been trying to do is to accept Robert Lesage’s statement on part of the scandal or investigation. Mr Beekharry, the independent commissioner of ICAC confirms this view in a statement to the press. What he says is that the statement of Robert Lesage will be taken according to procedures and according to revelations made. This is a very disturbing and vague statement and defies all logic. Surely when an investigation is underway the person who is willing to make a statement should be allowed to say all that he knows without any form of censorship and, once everything is taken down, then the investigator can retain whatever is relevant. The procedure that ICAC is propounding may lead to the conclusion that he does not want Robert Lesage to say all that he knows in order to shield some people. If this is the case or the perception, then let ICAC be closed down. Perhaps the novel investigative procedure that is put forward by the independent commission is unprecedented in the history of investigations. Now that Mr Beekharry has himself admitted that there has been an attempt to censure the statement of Robert Lesage he should explain to the public, in the name of transparency, and in the interest of ICAC, what he means by censorship. He should also explain in detail the procedures of any investigations and especially the taking of statements so that in future well-meaning citizens who want to expose those who have been making money illegally, will know what stand to take vis-à-vis so-called independent institutions. The arrest of Robert Lesage is also very revealing. This man has been praised by many of his former friends and colleagues as somebody who is clean. He went to the ICAC following the discovery of the misuse of the NPF funds and was not unduly worried as he told the ICAC investigators what he knew. However when he decided to make a written statement and is confronted by what seemed to be an arbitrary censorship on what he was going to say, and when he refused to play that kind of game it is only then that he is arrested. One wonders what Mrs Indira Manrakhan would have been made to endure if she had adopted such a procedure. Why is that Robert Lesage was not arrested following his oral statement? What additional information has come to light between the first appearance of Robert Lesage at ICAC and his arrest? On what basis has he been arrested? In the absence of a clear and unequivocal communiqué from ICAC, the impression would be that he was arrested in order to exert pressure on him in order to compel him to say only what the ICAC, for reasons best known to it, wants to hear. Rumour has it that politicians of all parties have been named by Robert Lesage. The MCB itself has said that no proper control of the NPF funds could have been made as high profile people were involved in the management of those funds. A former financial secretary who is very close to the MSM has an objection to departure against him. The names of officials at the bank have been named. The siphoning of funds to private companies has been taking place since the late eighties. Paul Bérenger was in the 1991 government. Questions also relate to those responsible for the audit of the MCB, the audit of the NPF funds and the overall responsibility of different politicians who had charge of such funds. It is not going to be a simple inquiry and censorship, the Beekharry style will certainly not help. Nobody should be spared. No stone should be left unturned to get to the truth because important government funds and an important bank are involved. Robert Lesage should not allow himself to be intimidated by anybody and least of all by ICAC and his arrest. He is being legally advised and as a responsible citizen he should go all the way by making public all that he knows. He should himself write out his statement and send it to the police, to ICAC, to the DPP, to Transparency International and to the President of the Republic. Only then will he acquire some legitimacy in his allegations and only then that the citadel of fraud that some unfortunately do not want to demolish will come crumbling down. For too long with the MSM or the MMM, there have been selective investigations with regard to fraud on a political line. It is high time for things to change. If the world population can get the United States to change its mind on war with Iraq, why can’t the people of Mauritius organise rallies against fraudsters and their occult institutional allies? S. MODELIAR 71 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS APPENDIX H: Example of Encoded Text Example of the encoded version of the text “REP14.xml” above. - - - - - - - - - <?xml version="1.0" encoding="utf-8" ?> <tei.2> <teiHeader id="REP14"> <fileDesc > <titleStmt > <title >Robert Lesage should not allow himself to be intimidated by anybody and least of all by ICAC and his arrest</title > <author>S. Modeliar</author> <respStmt > <resp>compiled by </resp> <name >Dolly Koo</name > </respStmt > </titleStmt > <publicationStmt > <publisher>Mauritius Times</publisher> <pubPlace>Pointe-aux-Sables, Mauritius </pubPlace> <date>2003</date> </publicationStmt > <sourceDesc > <p>created in machine-readable form in http://mauritiustimes.com/210203mod.htm</p> </sourceDesc > </fileDesc > <encodingDesc > <projectDesc> <p>Texts collected for use in the pilot project for ICE- Mauritius, February, 2005</p> </projectDesc > <samplingDecl> <p>Whole text of 1637 words copied from the site</p> </samplingDecl> </encodingDesc > <profileDesc > <creation> <date value="2003-02">Feb 2003</date> <rs type="city ">Pointe-aux-Sables</rs > </creation> <langUsage>English</langUsage> <textClass> <textDesc n="01"> <channel mode="w">print; written</channel> </textDesc > <particDesc > <person id="P1" sex="male " /> </particDesc > </textClass> </profileDesc > </teiHeader> 72 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS - <text > - <body> - <bold> <p>Robert Lesage should not allow himself to be intimidated by anybody and least of all by ICAC and his arrest</p> - <p> <it >“Robert Lesage should himself write out his statement and send it to the police, to ICAC, to the DPP, to Transparency International and to the President of the Republic. Only then will he acquire some legitimacy in his allegations and only then that the citadel of fraud that some unfortunately do not want to demolish will come crumbling down…”</it > </p> </bold> <p>After the Cuttaree affair and his shielding in circumstances which have been repeated ad nauseam, after a minister of the present government has been arrested on suspicion of corruption and after a complete inaction in the case of anothe r minister, another scandal has emerged. This time it does not concern any Swiss bank involving Eric Stauffer, Vasant Bunwaree and Navin Ramgoolam though Paul Bérenger has already ruled that these latter two are guilty. This time one of the most respected banks of the country, the Mauritius Commercial Bank, better known as the MCB and being considered the best, is involved.</p> <p>Until the ramifications of what is described as a fraud with regard to the National Pensions Fund (NPF) are known no blame should be attached to anybody. One section of the press has talked about this and has referred to Minister Choonee. It is to be hoped that such an attitude becomes a general feature of the press and that the civilised press as opposed to the gutter press and to the partisan press will be prevailed upon when it comes to the innocence and reputation of people. This philosophy should also be the hallmark of certain politicians who can blow hot and cold at the same time.</p> <p> <marginalia > While pontificating about presumption of innocence in the case of those close to the regime, because only those who espouse the cause of the supreme leader of Mauritius can aspire to be appointed to posts in the services including ICAC, Paul Bérenger has already found Vasant Bunwaree and Navin Ramgoolam guilty of offences in relation to the Swiss bank affair. Now that the MCB scandal has emerged he is trying to make a connection between the case at the MCB with the Swiss bank affair. On what basis he is doing that is not clear and yet he is saying that the whole matter will be fully investigated. Who will investigate the matter? Is it going to be investigated by ICAC? Whether we like it or not, ICAC is yet to be perceived as a totally independent institution and totally free from political influences. Even if it is, the perception is otherwise. At times perception of independence is as important if not more important than independence itself. </marginalia > </p> <p>In addition to trying to lay the blame for the MCB scandal on the Labour Party through a Swiss connection, Paul Bérenger is all praise for the MCB. By so doing Paul Bérenger is already brainwashing public opinion against any malpractice or offence that may have been committed by the MCB or any member of the MSM or MMM because it should not be forgotten that it appears that the scandal dates as far back as 1992, a time at which Paul Bérenger was in a coalition with Sir Anerood Jugnauth before being booted out in 1993. So let not Paul Bérenger shout victory too soon as the investigators would have to find out who were the ministers responsible for the NPF from 1992 up to today. The dates at which the funds have been misused will have to be determined as well as the companies that benefited from those transfers of funds. </p> <p>But how will all this be determined? One would expect an impartial approach to the investigation. This is simply not possible and the perception is that this is not the case right now. Paul Bérenger has already placed a political coloration 73 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS on the whole matter. The MCB has already talked of a total absence of any conspiracy at the level of the bank and is suggesting that the accusation of conspiracy at the bank finds its source in a personal vendetta of Robert Lesage. Most disturbing is the attitude of ICAC which has indicated yet once more that it is not functioning as a completely independent body. If proof is needed it is be found in the very revealing statement of both the Commissioner of ICAC, Navin Beekharry and that of Robert Lesage. Let us hope that the office of the DPP does not join the bandwagon. </p> <p>According to reports Robert Lesage is alleged to have stated that “…being given the new approach taken, I have decided to withdraw my cooperation with the inquiry altogether and not to make any statement. However, I confirm that I am still willing to continue my cooperation with the inquiry so long as the line taken since the beginning. But if such cooperation is resumed, I shall tell the truth, the whole truth and nothing but the truth.” In fact it would appear that what the ICAC investigators have been trying to do is to accept Robert Lesage’s statement on part of the scandal or investigation. Mr Beekharry, the independent commissioner of ICAC confirms this view in a statement to the press. What he says is that the statement of Robert Lesage will be taken according to procedures and according to revelations made. This is a very disturbing and vague statement and defies all logic.</p> <p>Surely when an investigation is underway the person who is willing to make a statement should be allowed to say all that he knows without any form of censorship and, once everything is taken down, then the investigator can retain whatever is relevant. The procedure that ICAC is propounding may lead to the conclusion that he does not want Robert Lesage to say all that he knows in order to shield some people. If this is the case or the perception, then let ICAC be closed down. Perhaps the novel investigative procedure that is put forward by the independent commission is unprecedented in the history of investigations. Now that Mr Beekharry has himself admitted that there has been an attempt to censure the statement of Robert Lesage he should explain to the public, in the name of transparency, and in the interest of ICAC, what he means by censorship. He should also explain in detail the procedures of any investigations and especially the taking of statements so that in future wellmeaning citizens who want to expose those who have been making money illegally, will know what stand to take vis-à-vis so-called independent institutions.</p> <p>The arrest of Robert Lesage is also very revealing. This man has been praised by many of his former friends and colleagues as somebody who is clean. He went to the ICAC following the discovery of the misuse of the NPF funds and was not unduly worried as he told the ICAC investigators what he knew. However when he decided to make a written statement and is confronted by what seemed to be an arbitrary censorship on what he was going to say, and when he refused to play that kind of game it is only then that he is arrested. One wonders what Mrs Indira Manrakhan would have been made to endure if she had adopted such a procedure. Why is that Robert Lesage was not arrested following his oral statement? What additional information has come to light between the first appearance of Robert Lesage at ICAC and his arrest? On what basis has he been arrested? In the absence of a clear and unequivocal communiqué from ICAC, the impression would be that he was arrested in order to exert pressure on him in order to compel him to say only what the ICAC, for reasons best known to it, wants to hear. </p> <p><marginalia > Rumour has it that politicians of all parties have been named by Robert Lesage. The MCB itself has said that no proper control of the NPF funds could have been made as high profile people were involved in the management of those funds. A former financial secretary who is very close to the MSM has an objection to departure against him. The names of officials at the bank have 74 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS been named. The siphoning of funds to private companies has been taking place since the late eighties. Paul Bérenger was in the 1991 government. Questions also relate to those responsible for the audit of the MCB, the audit of the NPF funds and the overall responsibility of different politicians who had charge of such funds. It is not going to be a simple inquiry and censorship, the Beekharry style will certainly not help. Nobody should be spared. No stone should be left unturned to get to the truth because important government funds and an important bank are involved. </marginalia > </p> <p>Robert Lesage should not allow himself to be intimidated by anybody and least of all by ICAC and his arrest. He is being legally advised and as a responsible citizen he should go all the way by making public all that he knows. He should himself write out his statement and send it to the police, to ICAC, to the DPP, to Transparency International and to the President of the Re public. Only then will he acquire some legitimacy in his allegations and only then that the citadel of fraud that some unfortunately do not want to demolish will come crumbling down. For too long with the MSM or the MMM, there have been selective investigations with regard to fraud on a political line. It is high time for things to change. If the world population can get the United States to change its mind on war with Iraq, why can’t the people of Mauritius organise rallies against fraudsters and their occult institutional allies?</p> - <h> <bold>S. MODELIAR</bold> </h> </body> </text > </tei.2> 75 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Appendix I: First Draft of the Case for Support for the ICE-lite Proposal EPSRC Research Proposal: Development of the ICE-lite Part 1: DESCRIPTION OF THE PROPOSED RESEARCH AND ITS CONTEXT 1. Background The University of Leeds and University College London have done previous research on computer analysis of English language texts, also known as English Corpus Linguistics. For example, development of a Part-of-Speech analysis system which is being used on other research projects such as the International Corpus of English (ICE), which includes research teams in fifteen countries where English is the main language. In many of these English-speaking countries, the national ICE sub-corpus is a recognised resource used in research and teaching (ICE, 2002). “The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Fifteen research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the nationa l variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.” (ICE, 2002) Mauritius is one of the many English-speaking African countries, but there is no Mauritian subcorpus in ICE yet. English has been used in Mauritius for around 195 years, since the British settlers arrived in 1810, and is the official language of the country (Republic of Mauritius, 2004). However, at that time, slaves were imported from Africa and Madagascar, a large number of labourers from India were brought to work in the sugar cane fields and a small number of Chinese came to trade, and the influence of the French who were the rulers before the British was still very strong. The languages brought by the Hindus workers and merchants from India include Bhojpuri, Hindi, Tamil, Telegu, Marathi and Gujerati. The Chinese who came to Mauritius generally speak Hakka or Cantonese and the Muslims workers from India speak Arabic or Urdu. The slaves brought Malagasy (the language spoken in Madagascar) and Afrikaan to the country as well. All those different languages have quite a big impact on the official language, which is English, but the mixture of those languages also resulted in a new language, which is Creole. Creole is the most widely spoken language on the island and it is used by more than half the population, including many people who are not of Creole descent. However, even if Creole is the most common language in Mauritius, all official communications, and teaching in schools are done in English. With the influence of the other languages, the traditional English brought by the British settlers have suffered drastic changes. In many official communications or press reports for instance, we will come across some French or Creole words. It might be names of individuals or companies or it might be used only to put some emphasis on a theme. In school textbooks, often there will be words in Hindi or Chinese, depending on whether they are used in private or public schools. It is also important to note that dialectal variation is reflected much more in spoken than written Mauritian English. This is due to the fact that people tend to think in their native language and then translate what they want to say in English. Therefore the structure and grammar of the sentences will differ among the different cultures in Mauritius. There are already numerous Mauritian websites available on the World Wide Web and most of them are written in English and the government has just started a Cyber City project, which is the 76 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS first of its kind of a new generation of IT parks in this part of the world (BPML, 2004). Therefore, it will be feasible to collect at least some written samples of Mauritian English remotely, via the World Wide Web. From the success of the other ICE corpuses, namely the sub-corpuses from Australia, Great Britain, India, Hong Kong and East Africa, among other, researching Mauritius English and developing a sub-corpus within this country will help it in its development whether it be in IT, research or teaching. However, the Mauritius sub-corpus will be compiled by collecting texts mostly from the Internet and some amendments will have to be made to the standards since the type of texts available on the Internet will not match the text categories of ICE and other information, such as details of authors and publication will not be easily available on the Internet. According to this technique, the sub-corpus will not contain all the text-categories required in the standard ICE scheme; instead, we will develop an “ICE-lite” scheme, to simplify compilation of an ICE-Mauritius corpus. Furthermore, a more ambitious extension will be to include other types of English from other English-speaking countries in the corpus. It will be a quick and simple way of compiling a corpus of ten million words, with around five hundred thousand words for each country. The twenty English-speaking countries not currently covered by ICE to be included in the corpus are: Bahamas, Bangladesh, Barbados, Bermuda, Botswana, Cayman Islands, Cyprus, Dominica, Gambia, Gibraltar, Grenada, Liberia, Malta, Mauritius, Namibia, Pakistan, Seychelles, Uganda, Zambia and Zimbabwe. 2. Objectives To achieve the goal of developing a Multi-national Corpus of English, the following specific research objectives have been identified: • To set up infrastructure and prototype sampler corpus for the Multi-national Corpus of English. • To collect, mark-up and lexico-grammatically annotate different samples of spoken and written texts in English from the twenty countries :- 250 texts of approximately 2,000 words each for each country, a total of approximately ten million words. • Use corpus exploration tools to analyse lexical and grammatical variation across the contributing dialects of English in this sampler corpus. 3. Research Work-Plan The work has been organised into eight activity streams: WP1: Collection of Spoken and Written Text of English. (24 months RF1, 12 months RF2, 9 months RF3) The corpus will contain texts from 1990 or later. The total amount of texts needed will be 250 texts of approximately 2,000 words each for each country - a total of approximately ten million words. The authors and speakers of the texts need to have been brought up and taught through the English medium. They must be aged 18 or over and were either born or immigrated at an early age to the country. 1.1 Written Text All of the written texts will be collected from the Internet and other freely available sources. However, it will be difficult to obtain non-printed texts such as social letters or student essays. 1.2 Spoken Text Recording spoken texts is labour-intensive, time-consuming and costly (Meyer, 2002) and will only be possible if done at the site, i.e. in each of the countries. Therefore, we will only seek to 77 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS use sources such as radio and TV broadcasts, which are available on the Internet. The solution proposed by Sharoff (2005) is to “increase the amount of ephemera (leaflets, junk mail and typed material), correspondence & spoken language samples”. 1.3 Copyright issues Letters of copyright will have to be obtained. This will involve in the first place identifying the owners of sources and finding the right contact details. 1.4 Classification of texts Selecting and organising the texts will be a complex task and careful consideration is required. As far as possible, the text classification will follow the ICE standard categories, but it is expected that some texts categories will not be available on the Internet. It is also important to classify the text according to the country it comes from. Deliverable D1: A detailed list of the texts collected together with information about the authors, publisher, publisher place and date. WP2: Transcription (19 months RF1, 12 months RF2, 9 months RF3) Most of the written texts will be in electronic format already. After the spoken texts have been collected and permission is received, the spoken texts will be transcribed, that is, written on paper or typed on screen. It is expected that most of the speech recorded will be in digital format already since they will be collected from the Internet. Deliverable D2: A sampler of the raw Multi-national Corpus of English. WP3: Textual Mark-up (6 months RF1, 2 months RF2, 2 months RF3) 3.1 Encoding of Text The texts will be encoded with XML mark-up, i.e. the features of the original texts that are lost when it is converted into a plain text file on a computer will be encoded. In written texts this includes features such as boldface, italics and underlining as well as sentence boundaries, paragraph boundaries and headings. In spoken texts the encoding features will be sentence boundaries, speaker turns, and pauses (Nelson, 1996a). Paragraphing and header information (adapted from the ICE standards) regarding author, publisher, etc. will be added. Texts with different formats (Doc, PDF, HTML) will be converted into a unified framework (XML format) (Al-Sulaiti, 2004). 3.2 Proofread Text Both spoken and written text will be proofread on the screen. This task includes deleting extra and unnecessary material from texts and checking and adjusting paragraphing markers. Deliverable D3: Multi-national Sampler Corpus ready for distribution. WP4: Word-class tagging (18 months RF3) Like in the other ICE, the texts will be “automatically tagged for wordclass by the TOSCA Tagger, developed by the TOSCA Research Group at the University of Nijmegen. This assigns wordclass tags to each lexical item in the corpus. The tagset has been developed especially for ICE, and is largely based on Quirk et al (1985) A Comprehensive Grammar of the English Language” (ICE, 2002). During this stage, each item will be assigned a label or tag, for example, ‘N’ for noun and ‘ADV’ for adverb. Other information, such as singular or plural, or the verb tense will be added in brackets next to the label or tag. Deliverable D4: Proofread and manually-corrected tagged Multi-national Sampler Corpus. WP5: Syntactic parsing The tagged corpus from the previous stage formed the input to the next major stage, the syntactic parsing. 5.1 Syntactic marking (18 months RF2) The corpus is pre-edited (also known as syntactic marking) before the rest of the parsing stage. This involves manually marking several high-frequency constructions in order to reduce the 78 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS ambiguity of the input, and thereby reduce the number of decisions that the automatic parser would have to make. 5.2 TOSCA parser The TOSCA parser (Nelson et al., 2002), a software that has been developed by the TOSCA group, will be used to automate this stage. The output from the TOSCA parser will be series of labelled syntactic trees, in which the nodes will be labelled for function, category, and features. 5.3 Manual Analysis The TOSCA parser should yield a complete analysis for around 70% of the parsing units in the corpus and for the remainder, the analysis will have to be done manually. Deliverable D5: An analysis of the corpus at phrase, clause, and sentence level, and the analysis will be shown in the form of a parse tree. WP6: Evaluation (7 months RF1, 2 months RF2, 17 months RF3) 6.1 Cross-sectional checking The syntactic trees will be checked on a cross-sectional, construction–by-construction basis. This will allow the check to be concentrated on just one grammatical construction at time and correction can be made on each instance of the construction throughout the whole corpus, if necessary. The ICECUP (ICE, 2002) can be used for the cross-sectional checking. 6.2 Spot-checking Finally, the corpus will be ‘spot-checked’ before being released. Deliverable D6: The final Multi-national Corpus of English WP7: Comparison across dialects (10 months RF1, 16 months RF2, 6 months RF3) We will use English concordance and corpus exploration tools to analyse lexical and grammatical variation across the contributing dialects of English in the sampler corpus. Deliverable D7: Research paper on lexical and grammatical variation in the Multi-National Sampler Corpus, to be submitted to the International Journal of Corpus Linguistics. WP8: Dissemination for Exploitation (2 months each) The normal dissemination route for academic research is journal and conference papers. D5 and D6 are directly publishable, other papers will need to be written from other deliverables. Deliverable D8: Plan for continuing expansion of the Multi-national Corpus of English, extending to new countries. 4. Benefits The research will first be beneficial to the government and the educational system in each of the twenty countries mentioned above. A comprehensive description of the different types of English can be obtained from the corpus and therefore each country will be able to develop its own reference guides to usage, dictionaries and other teaching materials. This can help both schools and universities to adapt their methods of teaching, and especially the structure in which English is taught and spoken to a better standard. The comparison across the dialects of English to find any striking similarities or differences will be useful for further research and teaching methods in each country and will also benefit those people who want to travel to or trade with other Englishspeaking countries since the comparison will provide a useful insight in how they will have to adapt their language. When the corpus is released, it will also be beneficial to other research or academic institutions across the world. It can be used as a comparison or for further research by the existing corpuses or other potential corpuses. 79 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Longer-term impacts of the work to be done include • Promoting cooperation between English speaking countries and for the purpose of developing basic components for the linguistic society. • • Easing the entrance requirements of English speaking countries into the different markets. Promoting the different culture as a whole. 5. Resources Staff: the development of the project will require the employment of: Three English corpus linguists as post-doctoral Research Fellows and project managers for three years. Consumables: A powerful laptop PC for each researcher, costing £2000 each. Consultancy fees of £40,000 for transcription and mark-up of source materials. Travel and Subsistence: Results will be reported and published in conference proceedings including Corpus Linguistics (CL’07 Lisbon) and ICAME (ICAME’06, ‘07, ‘08, locations not yet known), estimated total cost of £4,000. The costs for the International Steering Panel meetings at start, mid and end of project are estimated at a total of £21,000. References: Al-Sulaiti, L. (2004) Designing and Developing a Corpus of Contemporary Arabic. Unpublished MSc thesis. University of Leeds. BPML (2004) Cybercity Mauritius - The Ebène CyberCity web site [online]. [Accessed 20th November 2004]. Available from World Wide Web: http://e-cybercity.mu/cybercity Department of English Language & Literature, University College London (2002) The International Corpus of English (ICE) web site [online]. [Accessed 9th November 2004]. Available from World Wide Web: http://www.ucl.ac.uk/english-usage/ice/# Meyer, C. (2002) English Corpus Linguistics, an Introduction. Cambridge: Cambridge University Press. Nelson, G. (1996a) The Design of the Corpus. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press Nelson, G., Wallis, S. and Aarts, B. (2002) Exploring Natural Language: working with the British component of the International Corpus of English. Philadelphia: John Benjamins Publishing Company. Republic of Mauritius (2004) History [online]. [Accessed 20th November 2004]. Available from the World Wide Web: http://www.gov.mu/abtmtius/history.htm Sharoff, S. (2004) Methods and tools for development of the Russian Reference Corpus. In Archer, D., Wilson, A. and Rayson P. (eds.) Corpus Linguistics Around the World. Amsterdam: Rodopi. The School of Computing, University of Leeds (1998-2004). The University of Leeds web site [online]. [Accessed 21 st October 2004]. Available from World Wide Web: http://www.comp.leeds.ac.uk/research/index.shtml 80 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Part 2: DIAGRAMMATIC WORK PLAN The work has been organised into eight activity streams. The three research fellows will be working in parallel during the 36- month project starting August 2006. WP1: Collection of Spoken and Written Text of English (24 months RF1, 12 months RF2, 9 months RF3) WP2: Transcription (19 months RF1, 12 months RF2, 9 months RF3) WP3: Textual Mark-up (6 months RF1, 2 months RF2, 2 months RF3) WP4: Word-class tagging (18 months RF3) WP5: Syntactic parsing (18 months RF2) WP6: Evaluation (7 months RF1, 2 months RF2, 17 months RF3) WP7: Comparison across dialects (10 months RF1, 16 months RF2, 6 months RF3) WP8: Dissemination for Exploitation (2 months each) Research Fellow 1: Month: A S O N D J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J WP1 WP2 WP3 WP4 WP5 WP6 WP7 WP8 Research Fellow 2: Month: A S O N D J WP1 WP2 WP3 WP4 WP5 WP6 WP7 WP8 Research Fellow 3: Month: A S O N D J WP1 WP2 WP3 WP4 WP5 WP6 WP7 WP8 81 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Appendix J: EPSRC Application Form for ICE-lite Engineering & Physical Sciences Research Council Polaris House, North Star Avenue, Swindon, Wiltshire, United Kingdom, SN2 1ET Telephone +44 (0) 1793 444000 Web http://www.epsrc.ac.uk/ Je-SRP1 (EPSRC) v1.1 COMPLIANCE WITH THE DATA PROTECTION ACT 1998 In accordance with the Data Protection Act 1998, the personal data provided on this form will be processed by EPSRC, and may be held on computerised database and/or manual files. Further details may be found in the guidance notes EPSRC Reference: RESEARCH PROPOSAL 1. DETAILS OF PROPOSAL You should read the separate notes for guidance, the 'EPSRC Funding Guide’ and any specific call documentation on the EPSRC Web site before completing any research proposal. Form Je-SRP1 (EPSRC) must be accompanied by a Case for Support. EPSRC will reject incomplete research proposals. A. Organisation Where Grant Would Be Held Organisation University of Leeds Division or Department School of Computing Address Line 1 Computer Vision and Language group Address Line 2 School of Computing Address Line 3 University of Leeds Town/City Leeds Admin Area/County West Yorkshire Research Organisation Reference: Postal Code B. Investigators LS2 9JT Please give details of each investigator below. Please provide the details of any additional investigators on a separate sheet using the same format as below. Details Principal Investigator (PI) Title Mr Forename(s) Eric Surname Atwell Organisation University of Leeds Division or Department School of Computing Post will outlast project (Y/N) Y % time committed to project 20 Other commitments (description and average hours per week) 8 Co-Investigator 1 0 Total number of co-investigators (ie. excluding the PI) 82 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS C. Recognised Researchers Please give details of each Recognised Researcher below. Please provide the details of any additional Recognised Researchers on a separate sheet using the same format as below. Details Recognised Researcher 1 Recognised Researcher 2 Title Forename(s) Surname Organisation Division or Department % time committed to project Dr Serge Sharoff University of Leeds Centre for Translation Studies 100 Ms Bayan Abu Shawar University of Leeds School of Computing 100 3 Total number of Recognised Researchers D. Title of Research Project [up to 150 chars] Development of an ICE-lite E. Start Date and Duration a. Proposed start date August 1st 2005 b. Duration of the grant (months) 36 F. Type of Proposal Scheme: Call: n/a G. Summary of EPSRC Resources Required for Project a. Financial resources required b. Summary of staff effort requested Total £ Staff Months 330,072 Research Travel and Subsistence 25,000 Technician Consumables 46,000 Other Exceptional Items Project Students Equipment Visiting Researchers Large Capital PCTF c. Services Total £ 108 108 500 Sub-total 401,572 Indirect Costs 101,222 Total 502,794 H. Related Proposals EPSRC Reference Number How related? (one of Continuation, Follow-up to outline proposal, Invited resubmission, Uninvited resubmission) a. If this proposal is related to a previous proposal to EPSRC, please give the previous EPSRC research grant proposal reference number(s) and indicate the type of relationship. Total Number of Proposals being submitted b. If there is more than one organisation submitting a JeSRP1 (EPSRC) proposal form for this project, please give the number of proposals involved, the lead Research Organisation and the project common reference. 83 Name of Lead Research Organisation Common Reference Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS I. Research Councils / MoD Joint Research Grants Scheme (JGS) If you have received a commitment of support from the Defence Science Technology Laboratory (DSTL), please give the following details: Percentage funding indicated by DSTL DSTL contact (name and address) Title/Forename(s) Surname Address Line 1 Address Line 2 Address Line 3 Town/City Administrative Area/County Postal Code Telephone Fax E- mail DSTL Reference (please ensure that the letter providing this reference is attached with the Case for Support) J. Objectives List main objectives of the proposed research in order of priority [up to 4000 chars] We will set up unfrastructure and prototype sample corpus for the ICE-lite, an International Corpus of English component which contains a 'lite' version of Englishes from 40 different English-speaking countries. An international steering panel will establish agreed standards for text types and categories and the other annotation standards such as encoding and XML mark-up and tagging, distribution. We will collect, mark-up and annotate different samples of spoken and written texts in English from 40 English-speaking countries:- 250 texts of approximately 2,000 words each, a total of approximately 20 million words. We will use corpus exploration tools to analyse lexical and grammatical variation across the contributing dialects of English in this sample corpus. 84 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS K. Summary Describe the proposed research in a style that would be accessible to an interested 14 year old [up to 4000 chars] Texts stored on the computer, known as corpora together with software tools provide a powerful method to learn more about language usage. Corpora are useful for studying all aspects of language such as grammar, meaning, speech sounds and helping dictionary makers in spotting new words. Corpus linguistics are nowadays analysing the use of structures and investigating factors that affect our choice of a particular structure. For instance, the factors may be related to the nature of the writing or speaking such as science rather than literature. Other factors that may influence of choice include as age, gender, period of time, text type, and medium (spoken or written) and these are being fully examined to get best result in our study of language. The main aim of a corpus linguistic is to discover common linguistic patterns in some specific contexts rather than stating whether the pattern is correct or incorrect. Therefore, with the computer storing a huge amount of data, this view of language analysis becomes more accessible and gives a good resource to start with. The corpus can be searched and handled at high speed using a special software tool. In addition, some information such as grammar, meaning, and speech sound can be added to it to make it useful to examine. Many large corpora have been developed during the past few years. Some are for general-use in linguistics research and represent different languages such as English, Spanish, French, and Russian, while others are more specialised such as the Air Traffic Control corpus. English is widely spoken in different parts of the world and one main corpus that handles its variation is known as the International Corpus of English (ICE). The main purpose of collecting this corpus is for comparing English as spoken worldwide. Around the world, fifteen research teams are preparing electronic corpora of their own variety of English and each one consists of one million words: 60% spoken and 40% written. Each team is following the same corpus design to ensure compatibility. Many English-speaking countries do not have a component in ICE yet and developing a sub-corpus for each one will be very costly and time-consuming. A better extension to the ICE project will be to collect a small version of the corpus for each of the English-speaking country and grouping them together to form the ICE -lite. The term “lite” is borrowed from other simplified projects such as “TEI-lite” which means a simpler version of TEI, a standard XML-markup convention for text corpora. Therefore, the aim of this project is to build a corpus for the ICE-lite which will follow similar conventions to the full ICE version. To ensure compatibility with the other ICE-projects, the “lite” version of the teams already in ICE will also be included in the corpus together with other 25 countries. The aim is to collect a corpus of 20 million words: 250 texts of 2,000 words for each country. An international steering panel will be appointed to agree on a general design structure for the corpus. The different types of English will then be analysed and compared to find similarities or differences across countries. This will allow an understanding of the different cultures and therefore will be useful for other research and academic institutions across the world. L. Beneficiaries Describe who will benefit from the research [up to 4000 chars] The research will first be beneficial to the governement and the educational system in each of the twenty countries mentioned above and the existing ICE teams. A comprehensive description of the different types of English can be obtained from the corpus and therefore each country will be able to develop its own reference guides to usage, dictionaries and other teaching materials. This can help both schools and universities to adapt their methods of teaching, and especially the structure in which English is taught and spoken to a better standard. The comparison across the dialects of English to find any striking similarities or differences will be useful for further research and teaching methods in each country and will also benefit those people who want to travel to or trade with other English-speaking countries since the comparison will provide a useful insight in how they will have to adapt their language. When the corpus is released, it will also be beneficial to other research or academic institutions across the world. It can be used as a comparison or for further research by the existing corpuses or other potential corpuses. Longer-term impacts of the work to be done include: • Promoting cooperation between other English speaking countries and for the purpose of developing basic components for the linguistic society. • Easing the entrance requirements of English speaking countries into the different markets. • Promote the different cultures of the 40 countris across the world. 85 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS M. Staff Joint Negotiating Committee For Higher Education Staff (JNCHES – formerly UCEA) Posts EFFORT ON PROJECT Name /Post Identifier Grade Starting Spine Point Effective Date of Salary Scale Increment Date Start Date Period on Project (months) % of Full Time London Allowance (Y/N) Total cost on grant (£) i) Research Staff Serge Sharoff RAII 11 01/08/2005 01/08/2006 01/08/2005 36 100 % N 110,024 Bayan Abu Shawar RAII 11 01/08/2005 01/08/2006 01/08/2005 36 100 % N 110,024 Sean Wallis at UCL RAII 11 01/08/2005 01/08/2006 01/08/2005 36 100 % N 110,024 % % % % % % ii) Technical Staff % % % % % % % % % iii) Visiting Researchers % % % % % % % % Total [ 86 330,072 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Non-JNCHES Posts EFFORT ON PROJECT Name / Post Identifier Basic Starting Salary Scale Effective Date of Salary Scale Increment Date Start Date Period on Project (months) % of Full Time London Allowance (£) Superannuation and NI (£) Total cost on grant (£) i) Research Staff % % % % % % % ii) Technical Staff % % % % % % % iii) Other Staff % % % % % % % iv) Visiting Researchers % % % % Total 87 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Ma. Project Studentships Name/Post Identifier Start Date London (Y/N) Stipend (£) Total Mb. Visiting Researchers Please provide the details of any additional visiting researchers on a separate sheet in the same format as below. Details Visiting Researcher 1 Visiting Researcher 2 Title Forename(s) Surname Home Organisation Division or Department Address Line 1 Address Line 2 Address Line 3 Town/City Administrative Area/County Postal Code Country Telephone Fax E- mail Post held a) If you have requested an amount from EPSRC for the Visiting Researcher's salary in Section M, will the Visiting Researcher receive any other contribution on top of this? (Y/N) b) If the Visiting Researcher will receive another contribution, how much will this be? (£) c) What annual salary would the host organisation expect to pay staff of the Visiting Researcher's status? (£) Total number of visiting researchers 0 Mc. Public Communication Training Funds (PCTF) Do you wish to apply for £500 towards Public Communication Training Funds? 88 YES NO Visiting Researcher 3 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS N. Travel and Subsistence Destination and purpose Total £ (i) Within UK International Steering Panel review meetings at start, mid and end of project 21,000 (ii) Outside UK Corpus Linguistics conferences (CL'2007, TALC'06,07,08) to disseminate results 4,000 Total £ 25,000 O. Consumables Description Total £ Consultancy fee funds for transcription and markup of source materials 40,000 3 laptops for data collection and analysis 6,000 Total £ 46,000 P. Exceptional Items Description Total £ Total £ 89 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Q. Equipment (single items between £3,000 and £99,999, including VAT)) Description Country of Manufacture Delivery Date Basic price £ Import duty £ VAT £ Total £ Total £ R. Large Capital (single items £100,000 and over, including VAT) Description Country of Manufacture Delivery Date Basic price £ Import duty £ VAT £ Total £ Total £ S. Services Service Instrument(s) Units Cost £ Total T. Other Support Give details of any support sought or received from any source for this or related research in the past three years (minimum £10,000) Awarding Organisation Awarding Organisation’s Reference Title of project Decision Made (Y/N) 90 Award Made (Y/N) Start Date End Date Amount Sought/ Awarded (£) Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS Appendix K: Revised Case for Support for the ICE-lite Proposal EPSRC Research Proposal: Development of the Multi -National Corpus of English Part 1: DESCRIPTION OF THE PROPOSED RESEARCH AND ITS CONTEXT 1. Background The University of Leeds (2004) has done previous research on computer analysis of English language texts, also known as English Corpus Linguistics. For example, the University has developed a Partof-Speech analysis system which is being used on other research projects such as the International Corpus of English (ICE), which includes research teams in fifteen countries where English is the first language or second official language. In many of these English-speaking countries, the national ICE sub-corpus is a recognised resource used in research and teaching. ICE began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Fifteen research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.” (ICE, 2002) These corpora are collected mostly by looking at standard and more traditional materials such as books, newspapers and articles. This method allows a wide variety of texts to be obtained, however it is very time consuming and costly since researchers have to be sent on the site and the texts need to be transcribed and converted into electronic format. This is one of the reasons why five other ICE projects (Cameroon, Fiji, Ghana, Nigeria and Sierra Leone) have not been able to start collecting any texts up to this date. Therefore, an alternative way of quickly and simply compiling a big corpus of much more than one million words would be to use the World Wide Web. Other attempts at using this method have proved to be successful (e.g. Serge Sharoff at Leeds University has developed tools to extract 100 million words corpora of Russian, German and Chinese). A pilot project to investigate the possibility of collecting a corpus for Mauritius, one of the many English-speaking African countries, was undertaken. English has been used in Mauritius for around 195 years, since the British settlers arrived in 1810, and is the official language of the country (Republic of Mauritius, 2004). However, at that time, slaves were imported from Africa and Madagascar, a large number of labourers from India were brought to work in the sugar cane fields and a small number of Chinese came to trade, and the influence of the French who were the rulers before the British was still very strong. The different languages brought by the different settlers have therefore influenced significantly the official English language of the country. Still, all official communications and teaching in schools are done in English, even though in many official communications or press reports or school textbooks for instance, you might come across some dialect words, such as names of individuals or companies written in French or Hindi. There are already numerous Mauritian websites available on the World Wide Web and most of them are written in English and the government has just started a Cyber City project, which is the first of its kind of a new generation of IT parks in this part of the world (BPML, 2004). Therefore, it will be feasible to collect at least some written samples of Mauritian English remotely, via the World Wide Web. For the pilot project, a sample of 30 texts between 1,000 to 2,000 words were collected (a total of 51,960 words). Each text, including its details such as author, publisher and date, took between15 and 20 minutes to find. From the texts obtained, it was noted that some amendments will have to be made to the standards of ICE since the types of texts available on the Internet will not match the text categories nor the text size and other information of the texts are not easily available on the Internet. 18 permissions for the use of the texts were sent out by emails and it took 6 minutes on average to 91 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS send each one. The issue of potential commercial use will have to be addressed in more details since the other ICE projects are strictly not commercial and according to Gerald Nelson at UCL this statement might cause difficulties in obtaining persmissions from owners and might cause problem to the other ICE teams. The 30 texts were also mark-up with a reduced ICE header and it took between 20 to 30 minutes to “clean up” the source webpage and to add the markup to each text. This stage can be done partly by a program, where human interaction is only needed to proofread and post-edit or correct the draft of the marked up texts which are produced. However, for the pilot project, no such program was available and therefore the estimates are derived from the manual process. The tagging and parsing will be done automatically and hence it was estimated that one million words will take one and a half weeks to be tagged and one and a half weeks to be parsed. Evidence from the pilot project shown that with this internet collection technique, the sub-corpus will contain less than one million words due to the limited set of text categories available on the World Wide Web. Therefore, a better extension will be to include other types of English from other English-speaking countries in the corpus. This will result in an “ICE-lite” with around five hundred thousands words for each country. The term “lite” is borrowed from other simplified projects such as “TEI-lite” which means a simpler version of TEI, a standard XML-markup convention for text corpora (TEI, 2005). The twenty countries which have been chosen to form part of the Multinational Corpus of English are: Bahamas, Bangladesh, Barbados, Bermuda, Botswana, Cayman Islands, Cyprus, Dominica, Gambia, Gibraltar, Grenada, Liberia, Malta, Mauritius, Namibia, Pakistan, Seychelles, Uganda, Zambia and Zimbabwe. In each of these abovementioned countries, English is either the national language or one of the main speaking languages. For instance, in Bermuda and Zimbabwe, English is the official language while in Liberia English is used mostly for trading purposes. Therefore, the form of English in Liberia has significant differences in terms of its word structure and it can take some time and practice to master. Like in Mauritius, English in most of these countries has in a way or another been highly influenced by other languages, either brought by ancestors or derived from their culture. The ICE-lite corpus will hence allow an interesting and useful analysis of the variation in English across the nations. To ensure compatibility and provide an enhanced comparison with the other existing projects, the “lite” version of the 20 teams already in ICE will also be included in the corpus. For each country, numerous websites are easily accessible via the World Wide Web, and different texts categories are available. Google provides evidence that for the Mauritius pilot project there are 250 million words of text to select from. Therefore, a large amount of texts should be available to collect 250 texts of 2,000 words each for each country and thus the corpus will aim to contain approximately 20 million words in total. An important issue which will need further consideration is the distribution method. The existing ICE corpora are distributed on CD and this results in reduced accessibility. One possible solution will be to make the ICE-lite corpus availa ble on the Internet via a public licence. For instance, it can be distributed via the GNU public licence, analogous to open-source software freely downloadable from Sourceforge.net. From the success of the other ICE corpuses, namely the sub-corpuses from Great Britain, India, Hong Kong and East Africa, among others, researching English across the different nations and developing an ICE-lite Corpus will help in the development of each of the participating countries whether it be in IT, research or teaching. 2. Objectives To achieve the goal of developing an ICE-lite Corpus of English, the following specific research objectives have been identified: • To set up infrastructure and prototype sampler corpus for the ICE-lite Corpus of English. • An international steering panel will establish agreed standards for text types and categories and the other annotation standards such as encoding and XML mark-up and tagging, distribution. 92 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS • • To collect, mark-up and lexico-grammatically annotate different samples of spoken and written texts in English from the 40 countries :- 250 texts of approximately 2,000 words each for each country, a total of approximately 20 million words. Use corpus exploration tools to analyse lexical and grammatical variation across the contributing dialects of English in this sampler corpus. 3. Research Work-Plan The work has been organised into eight activity streams (assuming that a month consists of 20 days of 8 hours working time): WP0: Project Management via International Steering Panel (1 month each over 3 years) The Panel will establish agreed standards for text types and categories; encoding and XML mark-up; morphological analysis and Part-of-Speech tagging and distribution methods. Standards proposals will be drawn by the project investigators but subject to approval and improvement by the Panel. Members will be chosen from the 40 countries that will form part of the ICE-lite. Deliverable D0: ICE-lite International Steering Panel to meet annually to oversee project progress. WP1: Collection of Spoken and Written Text of English. (10 months RF1, 8 months RF2, 8 months RF3) The corpus will contain texts from 1990 or later. The total amount of texts needed will be 250 texts of approximately 2,000 words each for each country - a total of approximately 20 million words. The authors and speakers of the texts need to have been brought up and taught through the English medium. They must be aged 18 or over and were either born or immigrated at an early age to the country. 1.1 Written Text All of the written texts will be collected from the Internet only. However, it will be difficult to obtain non-printed texts such as social letters or student essays. 1.2 Spoken Text Recording spoken texts is labour-intensive, time-consuming and costly (Meyer, 2002) and will only be possible if done at the site, i.e. in each of the countries. Therefore, we will only seek to use sources such as radio and TV broadcasts, which are available on the Internet. The solution proposed by Sharoff (2005) is to “increase the amount of ephemera (leaflets, junk mail and typed material), correspondence & spoken language samples”. 1.3 Copyright issues Letters of copyright will have to be obtained. This will involve in the first place identifying the owners of sources and finding the right contact details. 1.4 Classification of texts Selecting and organising the texts will be a complex task and careful consideration is required. As far as possible, the text classification will follow the ICE standard categories, but it is expected that some texts categories will not be available on the Internet. It is also important to classify the text according to the country it comes from. Deliverable D1: A detailed list of the texts collected together with information about the authors, publisher, publisher place and date. WP2: Transcription (5 months RF1, 3 months RF2, 2 months RF3) The written texts will be in electronic format already. After the spoken texts have been collected and permission is received, the spoken texts will be transcribed, that is, written on paper or typed on screen. It is expected that most of the speech recorded will be in digital format already since they will be collected from the Internet. No sample of spoken texts from Mauritius was available, so it is expected that only a limited number of spoken texts will be obtained and the time required for transcription is only based on personal judgement and relative to the time allowed for WP1. Deliverable D2: A sampler of the raw Multi-national Corpus of English. 93 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS WP3: Textual Mark-up (10 months RF1, 8 months RF2, 8 months RF3) 3.1 Encoding of Text The texts will be encoded with XML mark-up, i.e. the features of the original texts that are lost when it is converted into a plain text file on a computer will be encoded. In written texts this includes features such as boldface, italics and underlining as well as sentence boundaries, paragraph boundaries and headings. In spoken texts the encoding features will be sentence boundaries, speaker turns, and pauses (Nelson, 1996a). Paragraphing and header information (adapted from the ICE standards) regarding author, publisher, etc. will be added. Texts with different formats (Doc, PDF, HTML) will be converted into a unified framework (XML format) (Al-Sulaiti, 2004). 3.2 Proofread Text Both spoken and written text will be proofread on the screen. This task includes deleting extra and unnecessary material from texts and checking and adjusting paragraphing markers. Deliverable D3: Multi-national Sampler Corpus ready for distribution. WP4: Word-class tagging (6 months RF2, 8 months RF3) Like in the other ICE, the texts will be “automatically tagged for wordclass by the TOSCA Tagger, developed by the TOSCA Research Group at the University of Nijmegen. This assigns wordclass tags to each lexical item in the corpus. The tagset has been developed especially for ICE, and is largely based on Quirk et al (1985) A Comprehensive Grammar of the English Language” (ICE, 2002). During this stage, each item will be assigned a label or tag, for example, ‘N’ for noun and ‘ADV’ for adverb. Other information, such as singular or plural, or the verb tense will be added in brackets next to the label or tag. Deliverable D4: Proofread and manually-corrected tagged Multi-national Sampler Corpus. WP5: Evaluation (6 months RF1, 4 months RF2, 2 months RF3) 5.1 Cross-sectional checking The syntactic wordclass tags will be checked on a cross-sectional, construction–by-construction basis. This will allow the check to be concentrated on just one grammatical construction at time and correction can be made on each instance of the construction throughout the whole corpus, if necessary. The ICECUP (ICE, 2002) can be used for the cross-sectional checking. 5.2 Spot-checking Finally, the corpus will be ‘spot-checked’ before being released. Deliverable D5: The final Multi-national Corpus of English WP6: Comparison across dialects (3 months RF1, 3 months RF2, 5 months RF3) We will use English concordance and corpus exploration tools to analyse lexical and grammatical variation across the contributing dialects of English in the sampler corpus. Deliverable D6: Research paper on lexical and grammatical variation in the Multi-National Sampler Corpus, to be submitted to the International Journal of Corpus Linguistics. WP7: Dissemination for Exploitation (2 months each) The normal dissemination route for academic research is journal and conference papers. D6 is directly publishable, other papers will need to be written from other deliverables. Deliverable D7: Plan for continuing expansion of the Multi-national Corpus of English, extending to new countries. 4. Benefits The research will first be beneficial to the governement and the educational system in each of the twenty countries mentioned above and the existing ICE teams. A comprehensive description of the different types of English can be obtained from the corpus and therefore each country will be able to develop its own reference guides to usage, dictionaries and other teaching materials. This can help both schools and universities to adapt their methods of teaching, and especially the structure in which English is taught and spoken to a better standard. The comparison across the dialects of English to find any striking similarities or differences will be useful for further research and teaching methods 94 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS in each country and will also benefit those people who want to travel to or trade with other Englishspeaking countries since the comparison will provide a useful insight in how they will have to adapt their language. When the corpus is released, it will also be beneficial to other research or academic institutions across the world. It can be used as a comparison or for further research by the existing corpuses or other potential corpuses. Longer-term impacts of the work to be done include: • Promoting cooperation between other English speaking countries and for the purpose of developing basic components for the linguistic society. • Easing the entrance requirements of English speaking countries into the different markets. • Promote the different cultures of the 40 countries across the world. 5. Resources Staff: the development of the project will require the employment of: 3 English corpus linguists as post-graduate Research Fellows and project managers for 3 years. Consumables: A powerful laptop PC for each researcher, costing £2000 each. Consultancy fees of £40,000 for transcription and mark-up of source materials. Travel and Subsistence: Results will be reported and published ni conference proceedings including Corpus Linguistics (CL’07 Lisbon) and ICAME (ICAME’06, ‘07, ‘08, locations not yet known), estimated total cost of £4,000. The costs for the International Steering Panel meetings at start, mid and end of project are estimated at a total of £21,000. Reference: Al-Sulaiti, L. (2004) Designing and Developing a Corpus of Contemporary Arabic. Unpublished MSc thesis. University of Leeds. BPML (2004) Cybercity Mauritius - The Ebène CyberCity web site [online]. [Accessed 20th November 2004]. Available from World Wide Web: http://e-cybercity.mu/cybercity Department of English Language & Literature, University College London (2002) The International Corpus of English (ICE) web site [online]. [Accessed 9th November 2004]. Available from World Wide Web: http://www.ucl.ac.uk/english-usage/ice/# Humanities Text Initiative (2005) The TEI Header [online]. [Accessed 16th February 2005]. Available from World Wide Web: http://www.hti.umich.edu/cgi/t/tei/tei-idx?type=pointer&value=HD Meyer, C. (2002) English Corpus Linguistics, an Introduction. Cambridge: Cambridge University Press. Nelson, G. (1996a) The Design of the Corpus. In Greenbaum, S. (ed.) (1996) Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press Nelson, G., Wallis, S. and Aarts, B. (2002) Exploring Natural Language: working with the British component of the International Corpus of English. Philadelphia: John Benjamins Publishing Company. Republic of Mauritius (2004) History [online]. [Accessed 20th November 2004]. Available from the World Wide Web: http://www.gov.mu/abtmtius/history.htm Sharoff, S. (2004) Methods and tools for development of the Russian Reference Corpus. In Archer, D., Wilson, A. and Rayson P. (eds.) Corpus Linguistics Around the World. Amsterdam: Rodopi. SourceForge.net (2005) Project: MinGW - Minimalist GNU for Windows [online]. [Accessed 11th March 2005]. Available from World Wide Web: http://sourceforge.net 95 Koo Tee Fong, Dolly A PILOT PROJECT FOR ICE-MAURITIUS The School of Computing, University of Leeds (1998-2004). The University of Leeds web site [online]. [Accessed 21st October 2004]. Available from World Wide Web: http://www.comp.leeds.ac.uk/research/index.shtml Part 2: DIAGRAMMATIC WORK PLAN The work has been organised into eight activity streams. The three research fellows will be working in parallel during the 36-month project starting August 2005. WP0: Project Management via International Steering Panel (1 month each over 3 years) WP1: Collection of Spoken and Written Text of English (10 months RF1, 8 months RF2, 8 months RF3) WP2: Transcription (5 months RF1, 3 months RF2, 2 months RF3) WP3: Textual Mark-up (10 months RF1, 8 months RF2, 8 months RF3) WP4: Word-class tagging (6 months RF2, 8 months RF3) WP5: Evaluation (6 months RF1, 4 months RF2, 2 months RF3) WP6: Comparison across dialects (3 months RF1, 3 months RF2, 5 months RF3) WP7: Dissemination for Exploitation (2 months each) Research Fellow 1: Month: A S O N D J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J F M A M J J A S O N D J F M A M J J A S O N D J F M A M J J WP0 WP1 WP2 WP3 WP4 WP5 WP6 WP7 Research Fellow 2: Month: A S O N D J WP0 WP1 WP2 WP3 WP4 WP5 WP6 WP7 Research Fellow 3: Month: A S O N D J WP0 WP1 WP2 WP3 WP4 WP5 WP6 WP7 Actual Estimate Possible overflow 96