Weather Forecast Corpus and a Sample Dictionary Abstract
Transcription
Weather Forecast Corpus and a Sample Dictionary Abstract
Weather Forecast Corpus and a Sample Dictionary Siwarak Paengpho and Jirapa Vitayapirak Proceedings of the International Conference: Doing Research in Applied Linguistics Abstract This study aims to present word lists into general, academic, and technical and analyzes word classes, collocations, phrases, and abbreviations in the Weather Forecast Corpus (WFC). The corpus consists of a 555,818-word of weather forecast texts. The concordance software, Wordsmith Tools was used in processing the data by using word frequency and percentage. The results showed that the vocabulary in the WFC was highly general. In the WFC, the general vocabulary occurred 76.37 %, academic vocabularies 5.86 %, technical vocabularies 5.30 %, and other vocabularies 12.58 % of tokens. Some general words found as the technical and many words in the corpus are general words when they collocate with other words, they become multi-word phrasal items in the meteorological meaning. These results lead to design a sample meteorological dictionary which can illustrate some guidelines for English teachers and students in Thai Meteorological Institute in English courses. Introduction Nowadays, English is a universal language or lingua franca that plays a crucial role in pedagogy at higher education in Thailand, especially English for science and technology. Thai Meteorological Institute (TMI) is a higher educational institute under administration of Thai Meteorological Department (TMD). This institute offers many courses in meteorology for meteorologists in TMD and meteorological students who need to prepare themselves for their careers in the future as meteorological officers or meteorologists. English for meteorology is required for all students. Thus, all meteorologists and meteorological students in TMI have to study English for Specific Purposes (ESP). ESP is a major activity in English language teaching around the world. It is an enterprise involving education, training and practice. It draws upon three major realms of knowledge: language, pedagogy and the student’s or participants’ specialist areas of interest such as pharmacology, gastronomy, photography, meteorology, and so on (Robinson, 1991: 1, 18). However, one important problem for ESP learners seems to be English lexical knowledge: reading, writing, and translating. Vocabularies are the core of learning procedure in terms of their meanings and usages. When ESP learners encounter difficult vocabularies such as technical terminology, they often look them up in technical dictionaries. As technology changes, computers have been used in vocabulary studies and in lexicography for several decades (Landau, 2001). A corpus-based study is widespread in learning specialized vocabulary and making technical dictionary from authentic texts. In other words, the corpus-based study is another way for using computer readable forms of texts for the purpose of linguistic research and making dictionaries. There are several advantages in using specialized corpora including searching, sorting, retrieving, and calculating linguistic data with high speed and accuracy (Mallikamas, 1991). In the past, a technical dictionary had to be carefully defined manually by experts in that particular field. At the present time, modern technical dictionaries have been revolutionized by the introduction of corpus-based techniques and they are now usually based upon huge corpora of English (e.g. Bank of English (COBUILD), British National Corpus (BNC)), from which words, forms, spellings, meanings and grammatical behavior are extracted, thus allowing lexicographers to appeal directly to the observed facts of language (Trask, 1999: 166). Such progress enables lexicographers to appeal directly to the observed facts of language. Cleary, the adoption of computer innovation using corpus-based analysis can be made possible in the field of meteorology. Interestingly, little previous research had been carried on the design of specialized corpora. None of the specialized corpora so far has been applied for the bilingual (English and Thai) meteorological terminologies. In Thailand, the corpus-based analysis has never been applied in any of the courses or research on English for Meteorology in Thai Meteorological 88 Institute and no meteorological dictionary for Thais was based on corpus research and compiled by the corpus-based analysis which met the needs of the users. As a result, the bilingual meteorological dictionaries have not been well-developed and very few references in this area are available. This study thus applies corpus-based analysis in studying English weather forecast texts in order to provide insights into the language of weather forecast using authentic texts. It could also be used as a guideline to design a meteorological dictionary and teach English for meteorological students. Objectives 1) To establish the Weather Forecast Corpus (WFC). 2) To analyze word classes, collocations, compound nouns, and abbreviations. 3) To classify word lists in the corpus into general, academic, and technical vocabularies. 4) To design a sample of meteorological dictionary based on the technical vocabularies found in the Weather Forecast Corpus (WFC). Materials and Methods 1) Data Collection The English weather forecasts texts were collected from three main sources: 1) websites: Thai Meteorological Department’s website; World Meteorological Organization’s website; 2) newspapers: Bangkok Post; and 3) documents in working units at Thai Meteorological Department. The categories of English weather forecasts texts were classified by using the weather forecast text classification from Thai Meteorological Department (TMD, 2007: Online). These weather forecast texts were divided into two main categories, i.e. Thailand weather forecasts (WF) and meteorological documents used in Thai Meteorological Department working unit (MD). Thailand weather forecasts (WF) were divided into seven sub-categories: Weather Advisory (WF1), Daily Forecast (WF2), Shipping Forecast (WF3), Weekly Forecast (WF4), Monthly Forecast (WF5), Sunrise-Sunset and Moonrise-Moonset Forecast (WF6), and Water Level Forecast (WF7). In the category of the meteorological documents (MD) were divided into six sub-categories: Weather News (MD1), Weather Report (MD2), Meteorological Journal (MD3), Meteorological Magazine (MD4), Meteorological Textbook and Manual (MD5), and Other Meteorological Documents (MD6). The samples were 555,818 words of the weather forecast texts were randomly selected by using proportional stratified random sampling procedures. The frequency and percentage of text coverage were used to select and classify the data in the corpus. 2) Research Instruments In this study, a concordance software, Wordsmith Tools (WST) Version 3.0 (Scott, 1998) was used to process the data for making indexes and word lists, counting word frequencies, comparing different usages of each word, analyzing keywords, and finding phrases or idioms (word-clusters) throughout the whole corpus. Additionally, the guidelines on the technical vocabulary selection were chiefly based on the American Meteorological Society (AMS)’s Online Glossary of Meteorology (2000) which was used to define English technical vocabularies, the Glossary of Meteorological Terms (1979), and Online Glossary of Meteorology (2007) which were written by Thai Meteorological Department (TMD) were also used as the references to define Thai technical vocabularies. 3) Data Analysis Procedures Firstly, all samples were collected and scanned as plain text files (*.txt). There were 13 files of subtypes of weather forecast texts in the WFC. Next, the optical character recognition 89 (OCR) software was used in the scanning procedure, to accept scanned alphabets and represent the texts electronically. Afterwards, all text files were transferred into Microsoft Words 2007 before being saved as document files (*.doc). At this stage, the spell checker was used to check the spell of all words in the texts. The Wordlist Tool was also used to compute the frequency of running words or tokens and word types, including type/token ratio. Finally, the obtained results from both word frequency counts and an analysis of collected information were interpreted. Results and Discussion The total number of the word frequency list consists of 555,818 tokens or 13,172 word types, and the relative of proportion between types and tokens was 2.37 or 1:29 as shown in Figure 1 below. Figure 1 Statistical detail of each sub-type of weather forecast texts in the WFC The samples used in this study were those found more than 55 times which consisted of 593 lemmas. All 593 lemmas were divided into 4 groups: general vocabularies, academic vocabularies, technical vocabularies, and others (those unfitted to be under any of the 4 main groups according to the study of Bauman and Culligan (1995) and Coxhead (2000)). Then, the word frequency list was lemmatized. The obtained results could be illustrated in Figure 2: Figure 2 Proportion of 4 groups of vocabularies used in this study Figure 2 shows a proportion of 4 groups of counted vocabularies used in the WFC. From the total of 268,600 tokens in 4 groups of vocabularies, it was found that 205,125 tokens (or 76.37 %) were classified as general vocabularies while 13,876 tokens (or 5.86 %) were sorted 90 as academic vocabularies. Besides, 14,252 tokens (or 5.30 %) were defined as technical vocabularies whereas 33,792 tokens (or 12.58 %) were categorized as others. Among 4 groups of vocabularies in this study, the general group was the highest proportion while the technical group has the lowest. In addition, the proportion of the academic group was lower than that of the other group. Interestingly, we found that some words categorized as the general, the academic, and the other vocabularies could also provide technical meanings in different contexts. This finding confirmed with the idea of Chung and Nation (2003) that many technical words could be created from common words, including words from the GSL and the AWL. Hence, it may not be absolutely correct to just identify some certain words in the corpus as either the GSL or the AWL only. However, some general or other words in the corpus were found as the technical ones: high, pressure, low, cold, cool, storm, station, ice, shower, moderate, isolated, calm, etc. A single word can be a part of a multi-word lexical unit as well, for example, ‘station temperature’, ‘sea surface temperature’, ‘extreme minimum temperature’, ‘torrential rain, ‘fairly widespread rain’, ‘widespread thundershowers’, ‘isolated thundershowers’, ‘high pressure system’, ‘low pressure cell’, ‘heat wave’, ‘moderate sea’, etc. In term of word classes, two main types of word classes were found in the corpus, namely closed class and open class. The words in the closed class were found as the top ten word frequency lists in the WFC such as ‘the’, ‘and’, ‘of’, ‘in’, ‘to’, and ‘with’ as shown in Figure 3 below. Figure 3 The top 10 word frequency lists in the WFC Some words in the WFC have more than one function. The overlaps between nouns and full verbs, nouns and auxiliary verbs, nouns and abbreviations, auxiliary verbs and abbreviations, and pronouns and abbreviations were found in the WFC and could be identified by using the concordance data from the corpus as displayed in an example in Figure 4 below. Figure 4 Concordance sample of ‘am’ in the WFC In this study, it was found 751 times of abbreviations ‘am’ (for Ante Meridiem) and 1 time as a verb. Therefore, the words ‘am’ in the WFC often were used as abbreviation ‘Ante Meridiem’. Since the verb ‘am’ rather use in spoken texts but in the WFC is the written text. 91 Regarding collocations, both lexical and grammatical collocations were found in this study. Some technical noun phrases containing words from the GSL, AWL, and other vocabularies could be technically used with meteorological meanings. From these examples, it can be seen that words from the GSL, AWL, and other vocabularies are not limited to provide common meanings to all academic texts. Some are used together with words in their groups or other groups and become meteorological meaning. To conclude, many words from the GSL, AWL, other vocabularies in the WFC can provide meteorological meanings as technical noun phrases when collocated with words in the other vocabularies as shown an example in Figure 5. Figure 5 Collocations sample of ‘temperature’ in the WFC. As can be seen in Figure 5, the noun ‘temperature’ often co-occurs with adjectives and nouns such as minimum temperature, virtual temperature, sea surface temperature, maximum temperature, and soil temperature. These noun phrases were created from the GSL and AWL, but they can provide meteorological meanings, namely when ‘temperature’ collocated with words ‘virtual’, ‘maximum’, ‘minimum’, ‘sea’ and ‘surface’, and ‘soil’ became technical noun phrases in meteorology. Interestingly, a number of low frequent words in the WFC are technical vocabularies which should not be ignored in the specialized corpus since low frequent words may sometimes be technical vocabularies. Furthermore, separating any single words from a certain phrase could impair the accuracy of their meanings when used in the phrase. Therefore, to get more actual proportions of categories of vocabularies, single words and multi-word items should be analyzed separately, and both of them should be identified by checking against not only the general vocabularies and the academic ones but also their meanings according to the context they are in. In case of the abbreviations, it takes 10.85 % or 30,784 tokens (74 lemmas) in this study. They could be categorized into 6 types of abbreviations, i.e. 1) clippings, 2) acronyms, 3) initialism, 4) contractions, 5) substitutions, and 6) symbols. Table 1 shows the top 10 abbreviations in the WFC. Table 1 The Top 10 Abbreviations in the WFC No. Rank 1 2 3 4 5 6 7 8 9 10 6 11 14 46 47 63 67 70 10 75 Abbreviation Frequency Percentage °C 4,244 km 3,119 hr 2,765 Precip 1,239 °F 1,212 P.M./p.m. 849 AZ 790 *A.M./a.m. 751 m 727 UTC 727 0.76 0.56 0.50 0.22 0.22 0.15 0.14 0.14 0.13 0.13 92 Meaning Celsius Kilometer Hour Precipitation Fahrenheit Post Meridiem Azimuth Ante Meridiem Meter Universal Time Coordinated 1) The Design of a Sample of Meteorological Dictionary A good bilingual dictionary usually give information about the meaning, pronunciation, word classes or part of speech (POS), word grammar, collocations, example sentences, and synonyms (Redman and Edward, 1997: 8). Ellendersen (2007) pointed out that information on word classes, morphology, and syntax ought to be included in Language for Specific Purposes (LSP) dictionary. 2) The Corpus Inputs to the Dictionary The corpus is a primary source of information about the way words behave. It forms the basis of the way words combine with each other (syntactically and collocationally). It also provides information about word frequencies, grammatical patterns, and collocations. It is the main source of the example sentences shown in the dictionary (Macmillan: Online, 2009). Some information in the dictionary includes signs and symbols, lists of abbreviations, foreign words and phrases which were found in the corpus. From the evidence of the WFC, information about the word frequencies is very important for vocabulary grading and selection. Therefore, the occurrence frequencies of the technical vocabularies, compound nouns, and abbreviations with high frequencies in this study are proposed to be contained in a sample of meteorological dictionary. The sample of meteorological dictionary should consist of the following entries: the English headword, abbreviation, grammatical information, Thai pronunciation, Thai synonym, Thai definition, English synonym, example of usage, and illustration as shown in Figure 6. Headword Abbreviation Grammatical information Pronunciation Thai Synonym È È È È È Cumulonimbus (Cb.) n. /คิวมูโลนิมบัส/ “เมฆฟาคะนอง Å Illustration Thai Definition: เมฆที่มีลักษณะเปนเมฆกอนใหญรูปรางคลายภูเขาใหญ มียอดเมฆแผออกเปนรูปรางคลายทั่ง ฐานเมฆต่าํ มีสีดํามืด เปนเมฆหนา มืดทึบ มีฟาแลบ ฟารอง อาจอยูก ระจัดกระจายหรือรวมกันอยู มักมีฝนตกลงมา Ex. After sunset, the Cumulonimbus clouds are often being transferred over the sea in Norway and that was obvious on 3rd September. Å Example of Use S. Cumulonimbus cloud, Thundercloud, Cumulonimbus incus, Å English Synonym Cumulonimbus calvus, Cumulonimbus with mammatus, Figure 6 A sample of meteorological dictionary entry Conclusions The study aimed to compile the Weather Forecast Corpus of English vocabularies, to analyze word classes, collocations, compound nouns, and abbreviations, and to identify word lists into general, academic, and technical groups in weather forecast corpus in order to design a sample meteorological dictionary. The texts used in weather forecasting tasks were divided into 2 main types of products: 1) weather forecasts and 2) meteorological documents. The texts were then separated into 13 sub-types. The study results showed all word lists (groups of vocabularies) in terms of word frequency and percentage. According to statistical analysis of the WFC, the 93 whole corpus consists of 555,818 tokens, the total word types in the WFC were 13,172, and the relative of proportion between types and tokens was 2.37 or 1:29. This research, however, is only a tentative study on the technical words in synoptic meteorology. It is just a discipline under meteorology. Future studies in other discipline of meteorology (such as aeronautical meteorology, agricultural meteorology, hydrological meteorology, maritime meteorology, biometeorology, astrometeorology, military meteorology, medical meteorology, and highways meteorology, etc.) ought to be done in order to achieve the reliable results. The meteorological dictionary using in Thailand should be compiled by means of corpus-based analysis to select the representative headwords. Further studies on meteorological symbols (both letters and numbers) as well as multiword units in meteorological texts should be carried out. In conclusion, using a corpus to design technical dictionaries is a way to perform with the confidence that technical dictionaries could become the better tool to satisfy the users’ demands for technical vocabularies. Acknowledgements I am genuinely grateful to Assoc. Dr. Jirapa Vitayapirak, my advisor for her invaluable guidance, precious time, beneficial comments, and patience in editing this paper. I would also like to thank Mr. Mana Tosatjawong for giving advice on software use as well as their sharing all relevant experience, knowledge, and resources. References Aarts, B., & McMahon, A. (2006). The handbook of English linguistics. United Kingdom: Blackwell. American Meteorological Society. (2000). Glossary of meteorology. Retrieved from http://amsglossary.allenpress.com/glossary. Bauman, J., & Culligan, B. (1995). The General Service List. Retrieved from http://Jbauman.com/gsl.html. Chung, T. M. and Nation, P. (2003). Technical vocabulary in specialised texts. Reading in a Foreign Language 15 (2): 103-116. Coxhead, A. (2000). A New Academic Word List. TESOL Quarterly, 34, 213-238. Ellendersen, J. (2007). Grammar in dictionaries of languages for special purposes. Retrieved from http://pure.au.dk/portal-asb-student/files/1462/000161028-161028.pdf Gouws, R. H., & Prinsloo, D. J. (2005). Left-expanded article structures in Bantu with special reference to Isizulu and Sepedi. International Journal of Lexicography 18(1): 25-46. Landau, B. (2001). Perceptual units and their mapping with language. New York: Elsevier. Macmillan, 2009. How was the corpus used in creating the dictionary?. Retrieved from http://www.macmillandictionary.com/about.html Mallikamas, P. (1999). Application of Corpora in Language Teaching. Thai TESOL BULLETIN, February, 1-17. Nation, I.S.P. (2001). Learning vocabulary in another language. Cambridge: Cambridge University Press. Nation, P., & Waring, R. (1997). Vocabulary size, text coverage and word lists. In N. Schmitt and M. McCarthy (Ed.), Vocabulary: description, acquistion and pedagogy (pp. 6-19). Redman, S., & Edwards, L. (2003). English vocabulary in use (pre-intermediate and intermediate). (2nd ed.). United Kingdom: Cambridge University Press. Robinson, P. C. (1991). ESP today: a practitioner’s guide. New York: Prentice Hall. Scot, M . (1998). Wordsmith Tools. Oxford: Oxford University Press. Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Thai Meteorological Department. (1979). Glossary of meteorological terms. Bangkok: Thai Meteorological Department Press. 94 Thai Meteorological Department. (2007). Glossary of Meteorology. Retrieved from http://www.tmd.go.th/met_dict.php Thai Meteorological Department. (2006). Thai Meteorological Department’ s Visions. Retrieved from http://www.tmd.go.th/en/aboutus/vision.pp Trask, R.L. (1999). Language: the basics. (2nd ed.). Routledge West, M. (1953). A General Service List of English Words. London: Longman, Green. The Authors Siwarak Paengpho was born in Nakhonratchasima, Thailand in 1984, education : 2002-2004 Bachelor’s degree in Political Science Ramkhumhaeng University, 2008-present M.A. student in King Mongkut’s Institute of Technology Ladkrabang, work experience: 2004-present a meteorological officer in the Northeastern Meteorological Center, at PakChong Agrometeorological Station. Jirapa Vitayapirak is a professor at Faculty of Industrial Education, King Mongkut’s Institute of Technology Ladkrabang (KMITL) 95