Although corpus-based analyses of linguistic variation have provided fresh insights into previously intractable issues, several methodological criticisms have been raised about the overall design of text corpora and the validity of text 'genres' as a basis for analyses of variation. Unfortunately, most of these criticisms have been based on intuitive judgements rather than empirical investigation. The present study begins to correct this lack of evidence concerning these issues It focuses on four particular methodological issues. (1) how long texts should be in order to reliably represent the distribution of linguistic features in particular text categories; (2) how many texts within each text category are required in order to reliablity represent the linguistic characteristics of that category, and related questions concerning the validity of 'genre'categories, (3) how many texts are needed in a corpus to accurately identify the salient parameters of linguistic variation among texts; and (4) how much of a cross-section is required to identify and analyze the salient parameters of variation among texts These issues are addressed through statistical investigation of the distribution of linguistic features across various sub-samples of the LOB and London-Lond corpora, in comparison to their distribution across the full corpora. The results indicate that existing corpora are adequate for many analyses of linguistic variation, In conclusion, the paper welcomes the future availablity of larger and more representative corpora, but it also urges researches to fully exploit existing corpora for ongoing investigations of linguistic variation.
ASJC Scopus subject areas
- Immunology and Microbiology(all)
- Pathology and Forensic Medicine
- Safety, Risk, Reliability and Quality
- Information Systems
- Linguistics and Language