TY - JOUR
T1 - Exploring the composition of the searchable web
T2 - A corpus-based taxonomy of web registers
AU - Biber, Douglas
AU - Egbert, Jesse
AU - Davies, Mark
N1 - Publisher Copyright:
© Edinburgh University Press.
Copyright:
Copyright 2015 Elsevier B.V., All rights reserved.
PY - 2015/4/1
Y1 - 2015/4/1
N2 - One major challenge for Web-As-Corpus research is that a typical Web search provides little information about the register of the documents that are searched. Previous research has attempted to address this problem (e.g., through the Automatic Genre Identification initiative), but with only limited success. As a result, we currently know surprisingly little about the distribution of registers on the web. In this study, we tackle this problem through a bottom-up user-based investigation of a large, representative corpus of web documents.We base our investigation on a much larger corpus than those used in previous research (48,571 web documents), and obtained through random sampling from across the full range of documents that are publically available on the searchable web. Instead of relying on individual expert coders, we recruit typical endusers of the Web for register coding, with each document in the corpus coded by four different raters. End-users identify basic situational characteristics of each web document, coded in a hierarchical manner. Those situational characteristics lead to general register categories, which eventually lead to lists of specific sub-registers. By working through a hierarchical decision tree, users are able to identify the register category of most Internet texts with a high degree of reliability. After summarising our methodological approach, this paper documents the register composition of the searchable web. Narrative registers are found to be the most prevalent, while Opinion and Informational Description/Explanation registers are also found to be extremely common. One of the major innovations of the approach adopted here is that it permits an empirical identification of 'hybrid' documents, which integrate characteristics from multiple general register categories (e.g., opinionated-narrative). These patterns are described and illustrated through sample Internet documents.
AB - One major challenge for Web-As-Corpus research is that a typical Web search provides little information about the register of the documents that are searched. Previous research has attempted to address this problem (e.g., through the Automatic Genre Identification initiative), but with only limited success. As a result, we currently know surprisingly little about the distribution of registers on the web. In this study, we tackle this problem through a bottom-up user-based investigation of a large, representative corpus of web documents.We base our investigation on a much larger corpus than those used in previous research (48,571 web documents), and obtained through random sampling from across the full range of documents that are publically available on the searchable web. Instead of relying on individual expert coders, we recruit typical endusers of the Web for register coding, with each document in the corpus coded by four different raters. End-users identify basic situational characteristics of each web document, coded in a hierarchical manner. Those situational characteristics lead to general register categories, which eventually lead to lists of specific sub-registers. By working through a hierarchical decision tree, users are able to identify the register category of most Internet texts with a high degree of reliability. After summarising our methodological approach, this paper documents the register composition of the searchable web. Narrative registers are found to be the most prevalent, while Opinion and Informational Description/Explanation registers are also found to be extremely common. One of the major innovations of the approach adopted here is that it permits an empirical identification of 'hybrid' documents, which integrate characteristics from multiple general register categories (e.g., opinionated-narrative). These patterns are described and illustrated through sample Internet documents.
KW - Hybrid registers
KW - Informational registers
KW - Internet language
KW - Mechanical turk
KW - Narrative
KW - Opinion
KW - Web registers
KW - Web-As-Corpus
UR - http://www.scopus.com/inward/record.url?scp=84927754107&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84927754107&partnerID=8YFLogxK
U2 - 10.3366/cor.2015.0065
DO - 10.3366/cor.2015.0065
M3 - Article
AN - SCOPUS:84927754107
VL - 10
SP - 11
EP - 45
JO - Corpora
JF - Corpora
SN - 1749-5032
IS - 1
ER -