วันอาทิตย์ที่ 23 กันยายน พ.ศ. 2561

Week 5 : Using corpus analysis software to analyse specialised texts

Using corpus analysis software to analyse specialised texts


1. What is a corpus?
In corpus linguistics, a corpus (sometimes used in the plural form 'corpora') can be generally defined as...'a collection of naturally-occurring texts in a computer-readable format which can be retrieved and analyzed using corpus analysis software

2. Sources of language corpora
Subscribe to a large corpus provider such as the British National Corpus (BNC) http://www.natcorp.ox.ac.uk/
Use web concordancing
- http://corpus.leeds.ac.uk/protected/query.html (general corpus; English)
- http://corpus.byu.edu/ (general corpus; American/British English)
- http://lextutor.ca/conc/eng/ (general and specialized corpora; English)
- http://www.arts.chula.ac.th/~ling/TNCII/ (general corpus; Thai)
-http://www.arts.chula.ac.th/~ling/ParaConc/index.html (English-Thai parallel concordance)
• Compile own corpora and analyse data using corpus analysis software - Antconc'(http://www.antlab.sci.waseda.ac.jp/software.html) (for monolingual corpus) - Wordsmith' (http://www.lexically.net/wordsmith/) (for monolingual corpus) - 'Paraconc' (http://www.athel.com/para.html) (for multilingual corpora)

3. Designing a specialized corpus
Corpus size
• There are no fixed rules; depending on research purposes, availability of data and time.
Large, general corpora may be less useful than small, focused corpora if searches are made on context-specific terms.
There are limitations of 'too small' corpora e.g. not enough hits to make decent generalization, not covering enough concepts, terms, or patterns under investigation.
It is preferable to create a 'monitor' or 'open' corpus because specialized words/usage are dynamic

Text extracts vs. full texts
• Depends on the aim of corpus compilation.
• Whole text offers more coverage because words or terms to be looked at may be randomly
distributed throughout the text.
• Specific sections may be helpful if we are looking for words or phrases under particular
content areas or want to create purposeful sub-corpora.

Number of texts
Choices can be made between collect few texts of large size or a number of texts with smaller sizes. Choices can also be made between selecting texts written by one or two key writers or sources, or texts retrieved from different sources or written by different authors.
Depends on your research focus e.g. to study overall language use or to study linguistic choices preferred by particular writers.

Medium
Can be spoken or written texts or mixed.
Depends on research questions.
Some practical factors should also be considered e.g. compiling spoken corpora can be when consuming and needs special types of tagging (= giving codes to the data c.8. curang, paralinguistic features)

Subject and text type
Should mainly focus on the specialized text under investigation, although this is less clear-cut
in multidisciplinary subjects.
Texts may come from different subjects if the research focus is on the study of particular language features rather than term extraction.
Text types within a specialized subject field may vary from "expert-to-expert' texts to 'expert-to-non-expert' texts, or in other words, from technical to popular texts.
Other considerations
Authorship: Texts written by experts in a field tend to present more reliable and authentic examples of specialised language.
Language: Specialised texts can be stored and retrieved in the form of monolingual. comparable, or parallel corpora.
Publication date: Texts should come from recent publications unless queries are made in relation to particular periods of time.

4. Sources of specialized texts
Printed materials (must be converted to text files using a scanner with good OCR (Ontin Character Recognition) software.
Word document texts (must be converted to text files e.g. using 'save as' or cut and texts in Notepad) CD-ROMs (must be converted to text files)
Texts on the Web (must be converted to text files and/or have the html mark-ups remove
Online databases (must convert word documents or pdf documents into text files)

5. Getting started with Antconc
Download the latest version of Antconc and watch YouTube tutorials from http://www.antlab.sci.waseda.ac.jp/antconc index.html)


6. Creating a specialised corpus profile
A sample profile




0 ความคิดเห็น:

แสดงความคิดเห็น

How to Stop Translating in Your Head and Start Thinking in English Like a Native