Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences

Jai Ram Rideout, Yan He, Jose A. Navas-Molina, William A. Walters, Luke K. Ursell, Sean M. Gibbons, John Chase, Daniel McDonald, Antonio Gonzalez, Adam Robbins-Pianka, Jose C. Clemente, Jack A. Gilbert, Susan M. Huse, Hong Wei Zhou, Rob Knight, James G Caporaso

Research output: Contribution to journalArticle

238 Citations (Scopus)

Abstract

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on nextgeneration sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closedreference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to "classic" open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, "classic" open-reference OTU clustering is often faster).We illustrate that here by applying it to the first 15,000 samples sequenced for the EarthMicrobiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of "classic" open reference OTU picking.We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by "classic" open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME's uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME's OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Original languageEnglish (US)
Article numbere545
JournalPeerJ
Volume2014
Issue number1
DOIs
StatePublished - 2014

Fingerprint

Cluster Analysis
ribosomal RNA
Programming Languages
Workflow
Software packages
Computer programming languages
microbial communities
Software
Genes
Databases
genetic markers
Datasets
sampling

Keywords

  • Bioinformatics
  • Microbial ecology
  • Microbiome
  • OTU picking
  • Qiime

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)
  • Neuroscience(all)

Cite this

Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. / Rideout, Jai Ram; He, Yan; Navas-Molina, Jose A.; Walters, William A.; Ursell, Luke K.; Gibbons, Sean M.; Chase, John; McDonald, Daniel; Gonzalez, Antonio; Robbins-Pianka, Adam; Clemente, Jose C.; Gilbert, Jack A.; Huse, Susan M.; Zhou, Hong Wei; Knight, Rob; Caporaso, James G.

In: PeerJ, Vol. 2014, No. 1, e545, 2014.

Research output: Contribution to journalArticle

Rideout, JR, He, Y, Navas-Molina, JA, Walters, WA, Ursell, LK, Gibbons, SM, Chase, J, McDonald, D, Gonzalez, A, Robbins-Pianka, A, Clemente, JC, Gilbert, JA, Huse, SM, Zhou, HW, Knight, R & Caporaso, JG 2014, 'Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences', PeerJ, vol. 2014, no. 1, e545. https://doi.org/10.7717/peerj.545
Rideout, Jai Ram ; He, Yan ; Navas-Molina, Jose A. ; Walters, William A. ; Ursell, Luke K. ; Gibbons, Sean M. ; Chase, John ; McDonald, Daniel ; Gonzalez, Antonio ; Robbins-Pianka, Adam ; Clemente, Jose C. ; Gilbert, Jack A. ; Huse, Susan M. ; Zhou, Hong Wei ; Knight, Rob ; Caporaso, James G. / Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. In: PeerJ. 2014 ; Vol. 2014, No. 1.
@article{7ac334bfc9c04a3cacad3eec9dad23b6,
title = "Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences",
abstract = "We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on nextgeneration sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closedreference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to {"}classic{"} open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, {"}classic{"} open-reference OTU clustering is often faster).We illustrate that here by applying it to the first 15,000 samples sequenced for the EarthMicrobiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of {"}classic{"} open reference OTU picking.We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by {"}classic{"} open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME's uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME's OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.",
keywords = "Bioinformatics, Microbial ecology, Microbiome, OTU picking, Qiime",
author = "Rideout, {Jai Ram} and Yan He and Navas-Molina, {Jose A.} and Walters, {William A.} and Ursell, {Luke K.} and Gibbons, {Sean M.} and John Chase and Daniel McDonald and Antonio Gonzalez and Adam Robbins-Pianka and Clemente, {Jose C.} and Gilbert, {Jack A.} and Huse, {Susan M.} and Zhou, {Hong Wei} and Rob Knight and Caporaso, {James G}",
year = "2014",
doi = "10.7717/peerj.545",
language = "English (US)",
volume = "2014",
journal = "PeerJ",
issn = "2167-8359",
publisher = "PeerJ",
number = "1",

}

TY - JOUR

T1 - Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences

AU - Rideout, Jai Ram

AU - He, Yan

AU - Navas-Molina, Jose A.

AU - Walters, William A.

AU - Ursell, Luke K.

AU - Gibbons, Sean M.

AU - Chase, John

AU - McDonald, Daniel

AU - Gonzalez, Antonio

AU - Robbins-Pianka, Adam

AU - Clemente, Jose C.

AU - Gilbert, Jack A.

AU - Huse, Susan M.

AU - Zhou, Hong Wei

AU - Knight, Rob

AU - Caporaso, James G

PY - 2014

Y1 - 2014

N2 - We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on nextgeneration sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closedreference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to "classic" open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, "classic" open-reference OTU clustering is often faster).We illustrate that here by applying it to the first 15,000 samples sequenced for the EarthMicrobiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of "classic" open reference OTU picking.We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by "classic" open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME's uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME's OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

AB - We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on nextgeneration sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closedreference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to "classic" open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, "classic" open-reference OTU clustering is often faster).We illustrate that here by applying it to the first 15,000 samples sequenced for the EarthMicrobiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of "classic" open reference OTU picking.We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by "classic" open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME's uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME's OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

KW - Bioinformatics

KW - Microbial ecology

KW - Microbiome

KW - OTU picking

KW - Qiime

UR - http://www.scopus.com/inward/record.url?scp=84920714710&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84920714710&partnerID=8YFLogxK

U2 - 10.7717/peerj.545

DO - 10.7717/peerj.545

M3 - Article

VL - 2014

JO - PeerJ

JF - PeerJ

SN - 2167-8359

IS - 1

M1 - e545

ER -