Species abundance information improves sequence taxonomy classification accuracy

Benjamin D. Kaehler, Nicholas A. Bokulich, Daniel McDonald, Rob Knight, J. Gregory Caporaso, Gavin A. Huttley

Research output: Contribution to journalArticle

Abstract

Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments.

Original languageEnglish (US)
Article number4643
JournalNature Communications
Volume10
Issue number1
DOIs
StatePublished - Dec 1 2019

Fingerprint

taxonomy
Taxonomies
Classifiers
classifiers
Databases

ASJC Scopus subject areas

  • Chemistry(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Physics and Astronomy(all)

Cite this

Species abundance information improves sequence taxonomy classification accuracy. / Kaehler, Benjamin D.; Bokulich, Nicholas A.; McDonald, Daniel; Knight, Rob; Caporaso, J. Gregory; Huttley, Gavin A.

In: Nature Communications, Vol. 10, No. 1, 4643, 01.12.2019.

Research output: Contribution to journalArticle

Kaehler, Benjamin D. ; Bokulich, Nicholas A. ; McDonald, Daniel ; Knight, Rob ; Caporaso, J. Gregory ; Huttley, Gavin A. / Species abundance information improves sequence taxonomy classification accuracy. In: Nature Communications. 2019 ; Vol. 10, No. 1.
@article{de8ee97a5ab84c3f92a4f6cc171ac317,
title = "Species abundance information improves sequence taxonomy classification accuracy",
abstract = "Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25{\%} to 14{\%}, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16{\%}). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments.",
author = "Kaehler, {Benjamin D.} and Bokulich, {Nicholas A.} and Daniel McDonald and Rob Knight and Caporaso, {J. Gregory} and Huttley, {Gavin A.}",
year = "2019",
month = "12",
day = "1",
doi = "10.1038/s41467-019-12669-6",
language = "English (US)",
volume = "10",
journal = "Nature Communications",
issn = "2041-1723",
publisher = "Nature Publishing Group",
number = "1",

}

TY - JOUR

T1 - Species abundance information improves sequence taxonomy classification accuracy

AU - Kaehler, Benjamin D.

AU - Bokulich, Nicholas A.

AU - McDonald, Daniel

AU - Knight, Rob

AU - Caporaso, J. Gregory

AU - Huttley, Gavin A.

PY - 2019/12/1

Y1 - 2019/12/1

N2 - Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments.

AB - Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments.

UR - http://www.scopus.com/inward/record.url?scp=85073157132&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85073157132&partnerID=8YFLogxK

U2 - 10.1038/s41467-019-12669-6

DO - 10.1038/s41467-019-12669-6

M3 - Article

C2 - 31604942

AN - SCOPUS:85073157132

VL - 10

JO - Nature Communications

JF - Nature Communications

SN - 2041-1723

IS - 1

M1 - 4643

ER -