The American national corpus

More than the web can provide

Nancy Ide, Randi Reppen, Keith Suderman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Citations (Scopus)

Abstract

The American National Corpus (ANC) project is developing a corpus comparable to the British National Corpus (BNC), covering American English. Recent interest in the web as a source of corpus materials has caused some in the language processing community to suggest that the development of a corpus of American English is unnecessary. However, we argue that far from being rendered superfluous by the availability of web materials, the ANC is likely to provide a resource for developing web acquisition techniques to support tasks such as genre and language detection and automatic annotation. This paper presents a comparison of the ANC in terms of both content and format with a test corpus compiled from web data, and a discussion of points of intersection and divergence.

Original languageEnglish (US)
Title of host publicationProceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002
PublisherEuropean Language Resources Association (ELRA)
Pages839-844
Number of pages6
StatePublished - Jan 1 2002
Event3rd International Conference on Language Resources and Evaluation, LREC 2002 - Las Palmas, Canary Islands, Spain
Duration: May 29 2002May 31 2002

Other

Other3rd International Conference on Language Resources and Evaluation, LREC 2002
CountrySpain
CityLas Palmas, Canary Islands
Period5/29/025/31/02

Fingerprint

language
divergence
genre
resources
community
World Wide Web
American English

ASJC Scopus subject areas

  • Linguistics and Language
  • Language and Linguistics
  • Education
  • Library and Information Sciences

Cite this

Ide, N., Reppen, R., & Suderman, K. (2002). The American national corpus: More than the web can provide. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002 (pp. 839-844). European Language Resources Association (ELRA).

The American national corpus : More than the web can provide. / Ide, Nancy; Reppen, Randi; Suderman, Keith.

Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002. European Language Resources Association (ELRA), 2002. p. 839-844.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ide, N, Reppen, R & Suderman, K 2002, The American national corpus: More than the web can provide. in Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002. European Language Resources Association (ELRA), pp. 839-844, 3rd International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, Canary Islands, Spain, 5/29/02.
Ide N, Reppen R, Suderman K. The American national corpus: More than the web can provide. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002. European Language Resources Association (ELRA). 2002. p. 839-844
Ide, Nancy ; Reppen, Randi ; Suderman, Keith. / The American national corpus : More than the web can provide. Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002. European Language Resources Association (ELRA), 2002. pp. 839-844
@inproceedings{b60a7af836af41e9afab1250e0427f30,
title = "The American national corpus: More than the web can provide",
abstract = "The American National Corpus (ANC) project is developing a corpus comparable to the British National Corpus (BNC), covering American English. Recent interest in the web as a source of corpus materials has caused some in the language processing community to suggest that the development of a corpus of American English is unnecessary. However, we argue that far from being rendered superfluous by the availability of web materials, the ANC is likely to provide a resource for developing web acquisition techniques to support tasks such as genre and language detection and automatic annotation. This paper presents a comparison of the ANC in terms of both content and format with a test corpus compiled from web data, and a discussion of points of intersection and divergence.",
author = "Nancy Ide and Randi Reppen and Keith Suderman",
year = "2002",
month = "1",
day = "1",
language = "English (US)",
pages = "839--844",
booktitle = "Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - The American national corpus

T2 - More than the web can provide

AU - Ide, Nancy

AU - Reppen, Randi

AU - Suderman, Keith

PY - 2002/1/1

Y1 - 2002/1/1

N2 - The American National Corpus (ANC) project is developing a corpus comparable to the British National Corpus (BNC), covering American English. Recent interest in the web as a source of corpus materials has caused some in the language processing community to suggest that the development of a corpus of American English is unnecessary. However, we argue that far from being rendered superfluous by the availability of web materials, the ANC is likely to provide a resource for developing web acquisition techniques to support tasks such as genre and language detection and automatic annotation. This paper presents a comparison of the ANC in terms of both content and format with a test corpus compiled from web data, and a discussion of points of intersection and divergence.

AB - The American National Corpus (ANC) project is developing a corpus comparable to the British National Corpus (BNC), covering American English. Recent interest in the web as a source of corpus materials has caused some in the language processing community to suggest that the development of a corpus of American English is unnecessary. However, we argue that far from being rendered superfluous by the availability of web materials, the ANC is likely to provide a resource for developing web acquisition techniques to support tasks such as genre and language detection and automatic annotation. This paper presents a comparison of the ANC in terms of both content and format with a test corpus compiled from web data, and a discussion of points of intersection and divergence.

UR - http://www.scopus.com/inward/record.url?scp=33846277160&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33846277160&partnerID=8YFLogxK

M3 - Conference contribution

SP - 839

EP - 844

BT - Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002

PB - European Language Resources Association (ELRA)

ER -