Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys

Jeffrey J. Werner, Omry Koren, Philip Hugenholtz, Todd Z. Desantis, William A. Walters, James G Caporaso, Largus T. Angenent, Rob Knight, Ruth E. Ley

Research output: Contribution to journalArticle

249 Citations (Scopus)

Abstract

Taxonomic classification of the thousands-millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a nave Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases.

Original languageEnglish (US)
Pages (from-to)94-103
Number of pages10
JournalISME Journal
Volume6
Issue number1
DOIs
StatePublished - Jan 2012
Externally publishedYes

Fingerprint

rRNA Genes
ribosomal RNA
taxonomy
anaerobic digesters
gene
digestive system
Databases
genes
Synergistetes
mice
Soil
Tenericutes
Chloroflexi
Boidae
nucleotide sequences
Python
automation
Automation
Microbiota
train

Keywords

  • Greengenes
  • microbiome
  • naïve Bayesian classifier
  • pyrosequencing
  • taxonomy

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Microbiology

Cite this

Werner, J. J., Koren, O., Hugenholtz, P., Desantis, T. Z., Walters, W. A., Caporaso, J. G., ... Ley, R. E. (2012). Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. ISME Journal, 6(1), 94-103. https://doi.org/10.1038/ismej.2011.82

Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. / Werner, Jeffrey J.; Koren, Omry; Hugenholtz, Philip; Desantis, Todd Z.; Walters, William A.; Caporaso, James G; Angenent, Largus T.; Knight, Rob; Ley, Ruth E.

In: ISME Journal, Vol. 6, No. 1, 01.2012, p. 94-103.

Research output: Contribution to journalArticle

Werner, JJ, Koren, O, Hugenholtz, P, Desantis, TZ, Walters, WA, Caporaso, JG, Angenent, LT, Knight, R & Ley, RE 2012, 'Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys', ISME Journal, vol. 6, no. 1, pp. 94-103. https://doi.org/10.1038/ismej.2011.82
Werner, Jeffrey J. ; Koren, Omry ; Hugenholtz, Philip ; Desantis, Todd Z. ; Walters, William A. ; Caporaso, James G ; Angenent, Largus T. ; Knight, Rob ; Ley, Ruth E. / Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. In: ISME Journal. 2012 ; Vol. 6, No. 1. pp. 94-103.
@article{3fba8eed125e4d7bb2eca208f855f033,
title = "Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys",
abstract = "Taxonomic classification of the thousands-millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a nave Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50{\%} reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases.",
keywords = "Greengenes, microbiome, na{\"i}ve Bayesian classifier, pyrosequencing, taxonomy",
author = "Werner, {Jeffrey J.} and Omry Koren and Philip Hugenholtz and Desantis, {Todd Z.} and Walters, {William A.} and Caporaso, {James G} and Angenent, {Largus T.} and Rob Knight and Ley, {Ruth E.}",
year = "2012",
month = "1",
doi = "10.1038/ismej.2011.82",
language = "English (US)",
volume = "6",
pages = "94--103",
journal = "ISME Journal",
issn = "1751-7362",
publisher = "Nature Publishing Group",
number = "1",

}

TY - JOUR

T1 - Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys

AU - Werner, Jeffrey J.

AU - Koren, Omry

AU - Hugenholtz, Philip

AU - Desantis, Todd Z.

AU - Walters, William A.

AU - Caporaso, James G

AU - Angenent, Largus T.

AU - Knight, Rob

AU - Ley, Ruth E.

PY - 2012/1

Y1 - 2012/1

N2 - Taxonomic classification of the thousands-millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a nave Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases.

AB - Taxonomic classification of the thousands-millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a nave Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases.

KW - Greengenes

KW - microbiome

KW - naïve Bayesian classifier

KW - pyrosequencing

KW - taxonomy

UR - http://www.scopus.com/inward/record.url?scp=84355166737&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84355166737&partnerID=8YFLogxK

U2 - 10.1038/ismej.2011.82

DO - 10.1038/ismej.2011.82

M3 - Article

VL - 6

SP - 94

EP - 103

JO - ISME Journal

JF - ISME Journal

SN - 1751-7362

IS - 1

ER -