Exploring the composition of the searchable web

A corpus-based taxonomy of web registers

Douglas E Biber, Jesse Egbert, Mark Davies

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

One major challenge for Web-As-Corpus research is that a typical Web search provides little information about the register of the documents that are searched. Previous research has attempted to address this problem (e.g., through the Automatic Genre Identification initiative), but with only limited success. As a result, we currently know surprisingly little about the distribution of registers on the web. In this study, we tackle this problem through a bottom-up user-based investigation of a large, representative corpus of web documents.We base our investigation on a much larger corpus than those used in previous research (48,571 web documents), and obtained through random sampling from across the full range of documents that are publically available on the searchable web. Instead of relying on individual expert coders, we recruit typical endusers of the Web for register coding, with each document in the corpus coded by four different raters. End-users identify basic situational characteristics of each web document, coded in a hierarchical manner. Those situational characteristics lead to general register categories, which eventually lead to lists of specific sub-registers. By working through a hierarchical decision tree, users are able to identify the register category of most Internet texts with a high degree of reliability. After summarising our methodological approach, this paper documents the register composition of the searchable web. Narrative registers are found to be the most prevalent, while Opinion and Informational Description/Explanation registers are also found to be extremely common. One of the major innovations of the approach adopted here is that it permits an empirical identification of 'hybrid' documents, which integrate characteristics from multiple general register categories (e.g., opinionated-narrative). These patterns are described and illustrated through sample Internet documents.

Original languageEnglish (US)
Pages (from-to)11-45
Number of pages35
JournalCorpora
Volume10
Issue number1
DOIs
StatePublished - Apr 1 2015
Externally publishedYes

Fingerprint

taxonomy
Internet
World Wide Web
Taxonomy
Corpus-based
narrative
coding
genre
expert
innovation

Keywords

  • Hybrid registers
  • Informational registers
  • Internet language
  • Mechanical turk
  • Narrative
  • Opinion
  • Web registers
  • Web-As-Corpus

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Exploring the composition of the searchable web : A corpus-based taxonomy of web registers. / Biber, Douglas E; Egbert, Jesse; Davies, Mark.

In: Corpora, Vol. 10, No. 1, 01.04.2015, p. 11-45.

Research output: Contribution to journalArticle

Biber, Douglas E ; Egbert, Jesse ; Davies, Mark. / Exploring the composition of the searchable web : A corpus-based taxonomy of web registers. In: Corpora. 2015 ; Vol. 10, No. 1. pp. 11-45.
@article{c50e7c6b409043eebb000649e405465d,
title = "Exploring the composition of the searchable web: A corpus-based taxonomy of web registers",
abstract = "One major challenge for Web-As-Corpus research is that a typical Web search provides little information about the register of the documents that are searched. Previous research has attempted to address this problem (e.g., through the Automatic Genre Identification initiative), but with only limited success. As a result, we currently know surprisingly little about the distribution of registers on the web. In this study, we tackle this problem through a bottom-up user-based investigation of a large, representative corpus of web documents.We base our investigation on a much larger corpus than those used in previous research (48,571 web documents), and obtained through random sampling from across the full range of documents that are publically available on the searchable web. Instead of relying on individual expert coders, we recruit typical endusers of the Web for register coding, with each document in the corpus coded by four different raters. End-users identify basic situational characteristics of each web document, coded in a hierarchical manner. Those situational characteristics lead to general register categories, which eventually lead to lists of specific sub-registers. By working through a hierarchical decision tree, users are able to identify the register category of most Internet texts with a high degree of reliability. After summarising our methodological approach, this paper documents the register composition of the searchable web. Narrative registers are found to be the most prevalent, while Opinion and Informational Description/Explanation registers are also found to be extremely common. One of the major innovations of the approach adopted here is that it permits an empirical identification of 'hybrid' documents, which integrate characteristics from multiple general register categories (e.g., opinionated-narrative). These patterns are described and illustrated through sample Internet documents.",
keywords = "Hybrid registers, Informational registers, Internet language, Mechanical turk, Narrative, Opinion, Web registers, Web-As-Corpus",
author = "Biber, {Douglas E} and Jesse Egbert and Mark Davies",
year = "2015",
month = "4",
day = "1",
doi = "10.3366/cor.2015.0065",
language = "English (US)",
volume = "10",
pages = "11--45",
journal = "Corpora",
issn = "1749-5032",
publisher = "Edinburgh University Press",
number = "1",

}

TY - JOUR

T1 - Exploring the composition of the searchable web

T2 - A corpus-based taxonomy of web registers

AU - Biber, Douglas E

AU - Egbert, Jesse

AU - Davies, Mark

PY - 2015/4/1

Y1 - 2015/4/1

N2 - One major challenge for Web-As-Corpus research is that a typical Web search provides little information about the register of the documents that are searched. Previous research has attempted to address this problem (e.g., through the Automatic Genre Identification initiative), but with only limited success. As a result, we currently know surprisingly little about the distribution of registers on the web. In this study, we tackle this problem through a bottom-up user-based investigation of a large, representative corpus of web documents.We base our investigation on a much larger corpus than those used in previous research (48,571 web documents), and obtained through random sampling from across the full range of documents that are publically available on the searchable web. Instead of relying on individual expert coders, we recruit typical endusers of the Web for register coding, with each document in the corpus coded by four different raters. End-users identify basic situational characteristics of each web document, coded in a hierarchical manner. Those situational characteristics lead to general register categories, which eventually lead to lists of specific sub-registers. By working through a hierarchical decision tree, users are able to identify the register category of most Internet texts with a high degree of reliability. After summarising our methodological approach, this paper documents the register composition of the searchable web. Narrative registers are found to be the most prevalent, while Opinion and Informational Description/Explanation registers are also found to be extremely common. One of the major innovations of the approach adopted here is that it permits an empirical identification of 'hybrid' documents, which integrate characteristics from multiple general register categories (e.g., opinionated-narrative). These patterns are described and illustrated through sample Internet documents.

AB - One major challenge for Web-As-Corpus research is that a typical Web search provides little information about the register of the documents that are searched. Previous research has attempted to address this problem (e.g., through the Automatic Genre Identification initiative), but with only limited success. As a result, we currently know surprisingly little about the distribution of registers on the web. In this study, we tackle this problem through a bottom-up user-based investigation of a large, representative corpus of web documents.We base our investigation on a much larger corpus than those used in previous research (48,571 web documents), and obtained through random sampling from across the full range of documents that are publically available on the searchable web. Instead of relying on individual expert coders, we recruit typical endusers of the Web for register coding, with each document in the corpus coded by four different raters. End-users identify basic situational characteristics of each web document, coded in a hierarchical manner. Those situational characteristics lead to general register categories, which eventually lead to lists of specific sub-registers. By working through a hierarchical decision tree, users are able to identify the register category of most Internet texts with a high degree of reliability. After summarising our methodological approach, this paper documents the register composition of the searchable web. Narrative registers are found to be the most prevalent, while Opinion and Informational Description/Explanation registers are also found to be extremely common. One of the major innovations of the approach adopted here is that it permits an empirical identification of 'hybrid' documents, which integrate characteristics from multiple general register categories (e.g., opinionated-narrative). These patterns are described and illustrated through sample Internet documents.

KW - Hybrid registers

KW - Informational registers

KW - Internet language

KW - Mechanical turk

KW - Narrative

KW - Opinion

KW - Web registers

KW - Web-As-Corpus

UR - http://www.scopus.com/inward/record.url?scp=84927754107&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84927754107&partnerID=8YFLogxK

U2 - 10.3366/cor.2015.0065

DO - 10.3366/cor.2015.0065

M3 - Article

VL - 10

SP - 11

EP - 45

JO - Corpora

JF - Corpora

SN - 1749-5032

IS - 1

ER -