Using corpus analysis software to analyse specialised texts
1. What is a corpus?
In corpus linguistics, a corpus (sometimes used in the plural form 'corpora') can be generally defined as...'a collection of naturally-occurring texts in a computer-readable format which can be retrieved and analyzed using corpus analysis software
2. Sources of language corpora
• Subscribe to a large corpus provider such as the British National Corpus (BNC) http://www.natcorp.ox.ac.uk/
• Use web concordancing
- http://corpus.leeds.ac.uk/protected/query.html (general corpus; English)
- http://corpus.byu.edu/ (general corpus; American/British English)
- http://lextutor.ca/conc/eng/ (general and specialized corpora; English)
- http://www.arts.chula.ac.th/~ling/TNCII/ (general corpus; Thai)
-http://www.arts.chula.ac.th/~ling/ParaConc/index.html (English-Thai parallel concordance)
• Compile own corpora and analyse data using corpus analysis software - Antconc'(http://www.antlab.sci.waseda.ac.jp/software.html) (for monolingual corpus) - Wordsmith' (http://www.lexically.net/wordsmith/) (for monolingual corpus) - 'Paraconc' (http://www.athel.com/para.html) (for multilingual corpora)
3. Designing a specialized corpus
Corpus size
• There are no fixed rules; depending on research purposes, availability of data and time.
• Large, general corpora may be less useful than small, focused corpora if searches are made on context-specific terms.
• There are limitations of 'too small' corpora e.g. not enough hits to make decent generalization, not covering enough concepts, terms, or patterns under investigation.
• It is preferable to create a 'monitor' or 'open' corpus because specialized words/usage are dynamic
Text extracts vs. full texts
• Depends on the aim of corpus compilation.
• Whole text offers more coverage because words or terms to be looked at may be randomly
distributed throughout the text.
• Specific sections may be helpful if we are looking for words or phrases under particular
content areas or want to create purposeful sub-corpora.Number of texts
• Choices can be made between collect few texts of large size or a number of texts with smaller sizes. • Choices can also be made between selecting texts written by one or two key writers or sources, or texts retrieved from different sources or written by different authors.
• Depends on your research focus e.g. to study overall language use or to study linguistic choices preferred by particular writers.
Medium
• Can be spoken or written texts or mixed.
• Depends on research questions.
• Some practical factors should also be considered e.g. compiling spoken corpora can be when consuming and needs special types of tagging (= giving codes to the data c.8. curang, paralinguistic features)
Subject and text type
• Should mainly focus on the specialized text under investigation, although this is less clear-cut
in multidisciplinary subjects.
• Texts may come from different subjects if the research focus is on the study of particular language features rather than term extraction.
• Text types within a specialized subject field may vary from "expert-to-expert' texts to 'expert-to-non-expert' texts, or in other words, from technical to popular texts.
Other considerations
• Authorship: Texts written by experts in a field tend to present more reliable and authentic examples of specialised language.
• Language: Specialised texts can be stored and retrieved in the form of monolingual. comparable, or parallel corpora.
• Publication date: Texts should come from recent publications unless queries are made in relation to particular periods of time.
4. Sources of specialized texts
• Printed materials (must be converted to text files using a scanner with good OCR (Ontin Character Recognition) software.
• Word document texts (must be converted to text files e.g. using 'save as' or cut and texts in Notepad) • CD-ROMs (must be converted to text files)
• Texts on the Web (must be converted to text files and/or have the html mark-ups remove
• Online databases (must convert word documents or pdf documents into text files)
5. Getting started with Antconc
Download the latest version of Antconc and watch YouTube tutorials from http://www.antlab.sci.waseda.ac.jp/antconc index.html)
6. Creating a specialised corpus profile
0 ความคิดเห็น:
แสดงความคิดเห็น