Methodological issues regarding corpus-based analyses of linguistic variation

Research output: Contribution to journalArticle

70 Citations (Scopus)

Abstract

Although corpus-based analyses of linguistic variation have provided fresh insights into previously intractable issues, several methodological criticisms have been raised about the overall design of text corpora and the validity of text 'genres' as a basis for analyses of variation. Unfortunately, most of these criticisms have been based on intuitive judgements rather than empirical investigation. The present study begins to correct this lack of evidence concerning these issues It focuses on four particular methodological issues. (1) how long texts should be in order to reliably represent the distribution of linguistic features in particular text categories; (2) how many texts within each text category are required in order to reliablity represent the linguistic characteristics of that category, and related questions concerning the validity of 'genre'categories, (3) how many texts are needed in a corpus to accurately identify the salient parameters of linguistic variation among texts; and (4) how much of a cross-section is required to identify and analyze the salient parameters of variation among texts These issues are addressed through statistical investigation of the distribution of linguistic features across various sub-samples of the LOB and London-Lond corpora, in comparison to their distribution across the full corpora. The results indicate that existing corpora are adequate for many analyses of linguistic variation, In conclusion, the paper welcomes the future availablity of larger and more representative corpora, but it also urges researches to fully exploit existing corpora for ongoing investigations of linguistic variation.

Original languageEnglish (US)
Pages (from-to)257-269
Number of pages13
JournalLiterary and Linguistic Computing
Volume5
Issue number4
DOIs
StatePublished - 1990

Fingerprint

Linguistics
linguistics
Statistical Distributions
genre
criticism
Corpus-based
Linguistic Variation
Research
lack

ASJC Scopus subject areas

  • Pharmacology
  • Neuroscience(all)
  • Immunology and Microbiology(all)
  • Pathology and Forensic Medicine
  • Safety, Risk, Reliability and Quality
  • Information Systems
  • Linguistics and Language

Cite this

Methodological issues regarding corpus-based analyses of linguistic variation. / Biber, Douglas E.

In: Literary and Linguistic Computing, Vol. 5, No. 4, 1990, p. 257-269.

Research output: Contribution to journalArticle

@article{0e19115def674c21968b2cabba30bee0,
title = "Methodological issues regarding corpus-based analyses of linguistic variation",
abstract = "Although corpus-based analyses of linguistic variation have provided fresh insights into previously intractable issues, several methodological criticisms have been raised about the overall design of text corpora and the validity of text 'genres' as a basis for analyses of variation. Unfortunately, most of these criticisms have been based on intuitive judgements rather than empirical investigation. The present study begins to correct this lack of evidence concerning these issues It focuses on four particular methodological issues. (1) how long texts should be in order to reliably represent the distribution of linguistic features in particular text categories; (2) how many texts within each text category are required in order to reliablity represent the linguistic characteristics of that category, and related questions concerning the validity of 'genre'categories, (3) how many texts are needed in a corpus to accurately identify the salient parameters of linguistic variation among texts; and (4) how much of a cross-section is required to identify and analyze the salient parameters of variation among texts These issues are addressed through statistical investigation of the distribution of linguistic features across various sub-samples of the LOB and London-Lond corpora, in comparison to their distribution across the full corpora. The results indicate that existing corpora are adequate for many analyses of linguistic variation, In conclusion, the paper welcomes the future availablity of larger and more representative corpora, but it also urges researches to fully exploit existing corpora for ongoing investigations of linguistic variation.",
author = "Biber, {Douglas E}",
year = "1990",
doi = "10.1093/llc/5.4.257",
language = "English (US)",
volume = "5",
pages = "257--269",
journal = "Literary and Linguistics Computing",
issn = "0268-1145",
publisher = "Oxford University Press",
number = "4",

}

TY - JOUR

T1 - Methodological issues regarding corpus-based analyses of linguistic variation

AU - Biber, Douglas E

PY - 1990

Y1 - 1990

N2 - Although corpus-based analyses of linguistic variation have provided fresh insights into previously intractable issues, several methodological criticisms have been raised about the overall design of text corpora and the validity of text 'genres' as a basis for analyses of variation. Unfortunately, most of these criticisms have been based on intuitive judgements rather than empirical investigation. The present study begins to correct this lack of evidence concerning these issues It focuses on four particular methodological issues. (1) how long texts should be in order to reliably represent the distribution of linguistic features in particular text categories; (2) how many texts within each text category are required in order to reliablity represent the linguistic characteristics of that category, and related questions concerning the validity of 'genre'categories, (3) how many texts are needed in a corpus to accurately identify the salient parameters of linguistic variation among texts; and (4) how much of a cross-section is required to identify and analyze the salient parameters of variation among texts These issues are addressed through statistical investigation of the distribution of linguistic features across various sub-samples of the LOB and London-Lond corpora, in comparison to their distribution across the full corpora. The results indicate that existing corpora are adequate for many analyses of linguistic variation, In conclusion, the paper welcomes the future availablity of larger and more representative corpora, but it also urges researches to fully exploit existing corpora for ongoing investigations of linguistic variation.

AB - Although corpus-based analyses of linguistic variation have provided fresh insights into previously intractable issues, several methodological criticisms have been raised about the overall design of text corpora and the validity of text 'genres' as a basis for analyses of variation. Unfortunately, most of these criticisms have been based on intuitive judgements rather than empirical investigation. The present study begins to correct this lack of evidence concerning these issues It focuses on four particular methodological issues. (1) how long texts should be in order to reliably represent the distribution of linguistic features in particular text categories; (2) how many texts within each text category are required in order to reliablity represent the linguistic characteristics of that category, and related questions concerning the validity of 'genre'categories, (3) how many texts are needed in a corpus to accurately identify the salient parameters of linguistic variation among texts; and (4) how much of a cross-section is required to identify and analyze the salient parameters of variation among texts These issues are addressed through statistical investigation of the distribution of linguistic features across various sub-samples of the LOB and London-Lond corpora, in comparison to their distribution across the full corpora. The results indicate that existing corpora are adequate for many analyses of linguistic variation, In conclusion, the paper welcomes the future availablity of larger and more representative corpora, but it also urges researches to fully exploit existing corpora for ongoing investigations of linguistic variation.

UR - http://www.scopus.com/inward/record.url?scp=0039049274&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0039049274&partnerID=8YFLogxK

U2 - 10.1093/llc/5.4.257

DO - 10.1093/llc/5.4.257

M3 - Article

AN - SCOPUS:0039049274

VL - 5

SP - 257

EP - 269

JO - Literary and Linguistics Computing

JF - Literary and Linguistics Computing

SN - 0268-1145

IS - 4

ER -