Representativeness in corpus design

Research output: Contribution to journalArticle

278 Citations (Scopus)

Abstract

The present paper addresses a number of issues related to achieving 'representativeness' in linguistic corpus design, including: discussion of what it means to 'represent' a language, definition of the target population, stratified versus proportional sampling of a language, sampling within texts, and issues relating to the required sample size (number of texts) of a corpus. The paper distinguishes among various ways that linguistic features can be distributed within and across texts; it analyses the distributions of several particular features, and it discusses the implications of these distributions for corpus design.The paper argues that theoretical research should be prior in corpus design, to identify the situational parameters that distinguish among texts in a speech community, and to identify the types of linguistic features that will be analysed in the corpus. These theoretical considerations should be complemented by empirical investigations of linguistic variation in a pilot corpus of texts, as a basis for specific sampling decisions. The actual construction of a corpus would then proceed in cycles: the original design based on theoretical and pilot-study analyses, followed by collection of texts, followed by further empirical investigations of linguistic variation and revision of the design.

Original languageEnglish (US)
Pages (from-to)243-257
Number of pages15
JournalLiterary and Linguistic Computing
Volume8
Issue number4
DOIs
StatePublished - 1993

Fingerprint

Linguistics
linguistics
Sampling
Language
Health Services Needs and Demand
Sample Size
language
Theoretical Models
Representativeness
Research
community

ASJC Scopus subject areas

  • Pharmacology
  • Neuroscience(all)
  • Immunology and Microbiology(all)
  • Pathology and Forensic Medicine
  • Safety, Risk, Reliability and Quality
  • Information Systems
  • Linguistics and Language

Cite this

Representativeness in corpus design. / Biber, Douglas E.

In: Literary and Linguistic Computing, Vol. 8, No. 4, 1993, p. 243-257.

Research output: Contribution to journalArticle

@article{0c6aec50cbb547448413edc7579a0883,
title = "Representativeness in corpus design",
abstract = "The present paper addresses a number of issues related to achieving 'representativeness' in linguistic corpus design, including: discussion of what it means to 'represent' a language, definition of the target population, stratified versus proportional sampling of a language, sampling within texts, and issues relating to the required sample size (number of texts) of a corpus. The paper distinguishes among various ways that linguistic features can be distributed within and across texts; it analyses the distributions of several particular features, and it discusses the implications of these distributions for corpus design.The paper argues that theoretical research should be prior in corpus design, to identify the situational parameters that distinguish among texts in a speech community, and to identify the types of linguistic features that will be analysed in the corpus. These theoretical considerations should be complemented by empirical investigations of linguistic variation in a pilot corpus of texts, as a basis for specific sampling decisions. The actual construction of a corpus would then proceed in cycles: the original design based on theoretical and pilot-study analyses, followed by collection of texts, followed by further empirical investigations of linguistic variation and revision of the design.",
author = "Biber, {Douglas E}",
year = "1993",
doi = "10.1093/llc/8.4.243",
language = "English (US)",
volume = "8",
pages = "243--257",
journal = "Literary and Linguistics Computing",
issn = "0268-1145",
publisher = "Oxford University Press",
number = "4",

}

TY - JOUR

T1 - Representativeness in corpus design

AU - Biber, Douglas E

PY - 1993

Y1 - 1993

N2 - The present paper addresses a number of issues related to achieving 'representativeness' in linguistic corpus design, including: discussion of what it means to 'represent' a language, definition of the target population, stratified versus proportional sampling of a language, sampling within texts, and issues relating to the required sample size (number of texts) of a corpus. The paper distinguishes among various ways that linguistic features can be distributed within and across texts; it analyses the distributions of several particular features, and it discusses the implications of these distributions for corpus design.The paper argues that theoretical research should be prior in corpus design, to identify the situational parameters that distinguish among texts in a speech community, and to identify the types of linguistic features that will be analysed in the corpus. These theoretical considerations should be complemented by empirical investigations of linguistic variation in a pilot corpus of texts, as a basis for specific sampling decisions. The actual construction of a corpus would then proceed in cycles: the original design based on theoretical and pilot-study analyses, followed by collection of texts, followed by further empirical investigations of linguistic variation and revision of the design.

AB - The present paper addresses a number of issues related to achieving 'representativeness' in linguistic corpus design, including: discussion of what it means to 'represent' a language, definition of the target population, stratified versus proportional sampling of a language, sampling within texts, and issues relating to the required sample size (number of texts) of a corpus. The paper distinguishes among various ways that linguistic features can be distributed within and across texts; it analyses the distributions of several particular features, and it discusses the implications of these distributions for corpus design.The paper argues that theoretical research should be prior in corpus design, to identify the situational parameters that distinguish among texts in a speech community, and to identify the types of linguistic features that will be analysed in the corpus. These theoretical considerations should be complemented by empirical investigations of linguistic variation in a pilot corpus of texts, as a basis for specific sampling decisions. The actual construction of a corpus would then proceed in cycles: the original design based on theoretical and pilot-study analyses, followed by collection of texts, followed by further empirical investigations of linguistic variation and revision of the design.

UR - http://www.scopus.com/inward/record.url?scp=0039049276&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0039049276&partnerID=8YFLogxK

U2 - 10.1093/llc/8.4.243

DO - 10.1093/llc/8.4.243

M3 - Article

AN - SCOPUS:0039049276

VL - 8

SP - 243

EP - 257

JO - Literary and Linguistics Computing

JF - Literary and Linguistics Computing

SN - 0268-1145

IS - 4

ER -