Open-Source Sequence Clustering Methods Improve the State Of the Art

Evguenia Kopylova, Jose A. Navas-Molina, Céline Mercier, Zhenjiang Zech Xu, Frédéric Mahé, Yan He, Hong Wei Zhou, Torbjørn Rognes, James G Caporaso, Rob Knight

Research output: Contribution to journalArticle

64 Citations (Scopus)

Abstract

Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1).

Original languageEnglish (US)
Article numbere00003-15
JournalmSystems
Volume1
Issue number1
DOIs
StatePublished - Jan 1 2016

Fingerprint

Clustering Methods
Clustering algorithms
Open Source
Cluster Analysis
Sequencing
Clustering Algorithm
swarms
Clustering
Swarm
Unit
Filtering
Biodiversity
Bioinformatics
Multithreading
bioinformatics
Hierarchical Clustering
Selectivity
Earth (planet)
methodology
preserves

Keywords

  • Amplicon sequencing
  • Microbial community analysis
  • Operational taxonomic units
  • Sequence clustering

ASJC Scopus subject areas

  • Molecular Biology
  • Physiology
  • Genetics
  • Biochemistry
  • Modeling and Simulation
  • Computer Science Applications
  • Ecology, Evolution, Behavior and Systematics
  • Microbiology

Cite this

Kopylova, E., Navas-Molina, J. A., Mercier, C., Xu, Z. Z., Mahé, F., He, Y., ... Knight, R. (2016). Open-Source Sequence Clustering Methods Improve the State Of the Art. mSystems, 1(1), [e00003-15]. https://doi.org/10.1128/mSystems.00003-15

Open-Source Sequence Clustering Methods Improve the State Of the Art. / Kopylova, Evguenia; Navas-Molina, Jose A.; Mercier, Céline; Xu, Zhenjiang Zech; Mahé, Frédéric; He, Yan; Zhou, Hong Wei; Rognes, Torbjørn; Caporaso, James G; Knight, Rob.

In: mSystems, Vol. 1, No. 1, e00003-15, 01.01.2016.

Research output: Contribution to journalArticle

Kopylova, E, Navas-Molina, JA, Mercier, C, Xu, ZZ, Mahé, F, He, Y, Zhou, HW, Rognes, T, Caporaso, JG & Knight, R 2016, 'Open-Source Sequence Clustering Methods Improve the State Of the Art', mSystems, vol. 1, no. 1, e00003-15. https://doi.org/10.1128/mSystems.00003-15
Kopylova E, Navas-Molina JA, Mercier C, Xu ZZ, Mahé F, He Y et al. Open-Source Sequence Clustering Methods Improve the State Of the Art. mSystems. 2016 Jan 1;1(1). e00003-15. https://doi.org/10.1128/mSystems.00003-15
Kopylova, Evguenia ; Navas-Molina, Jose A. ; Mercier, Céline ; Xu, Zhenjiang Zech ; Mahé, Frédéric ; He, Yan ; Zhou, Hong Wei ; Rognes, Torbjørn ; Caporaso, James G ; Knight, Rob. / Open-Source Sequence Clustering Methods Improve the State Of the Art. In: mSystems. 2016 ; Vol. 1, No. 1.
@article{1f93a7003fd740c8b4b9f8523ceb6c93,
title = "Open-Source Sequence Clustering Methods Improve the State Of the Art",
abstract = "Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60{\%} fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1).",
keywords = "Amplicon sequencing, Microbial community analysis, Operational taxonomic units, Sequence clustering",
author = "Evguenia Kopylova and Navas-Molina, {Jose A.} and C{\'e}line Mercier and Xu, {Zhenjiang Zech} and Fr{\'e}d{\'e}ric Mah{\'e} and Yan He and Zhou, {Hong Wei} and Torbj{\o}rn Rognes and Caporaso, {James G} and Rob Knight",
year = "2016",
month = "1",
day = "1",
doi = "10.1128/mSystems.00003-15",
language = "English (US)",
volume = "1",
journal = "mSystems",
issn = "2379-5077",
publisher = "American Society for Microbiology",
number = "1",

}

TY - JOUR

T1 - Open-Source Sequence Clustering Methods Improve the State Of the Art

AU - Kopylova, Evguenia

AU - Navas-Molina, Jose A.

AU - Mercier, Céline

AU - Xu, Zhenjiang Zech

AU - Mahé, Frédéric

AU - He, Yan

AU - Zhou, Hong Wei

AU - Rognes, Torbjørn

AU - Caporaso, James G

AU - Knight, Rob

PY - 2016/1/1

Y1 - 2016/1/1

N2 - Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1).

AB - Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1).

KW - Amplicon sequencing

KW - Microbial community analysis

KW - Operational taxonomic units

KW - Sequence clustering

UR - http://www.scopus.com/inward/record.url?scp=85041918285&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85041918285&partnerID=8YFLogxK

U2 - 10.1128/mSystems.00003-15

DO - 10.1128/mSystems.00003-15

M3 - Article

AN - SCOPUS:85041918285

VL - 1

JO - mSystems

JF - mSystems

SN - 2379-5077

IS - 1

M1 - e00003-15

ER -