Quality control of microbiota metagenomics by k-mer analysis

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Quality control of microbiota metagenomics by k-mer analysis. / Plaza Onate, Florian; Batto, Jean Michel; Juste, Catherine; Fadlallah, Jehane; Fougeroux, Cyrielle; Gouas, Doriane; Pons, Nicolas; Kennedy, Sean; Levenez, Florence; Dore, Joel; Ehrlich, S. Dusko; Gorochov, Guy; Larsen, Martin.

In: BMC Genomics, Vol. 16, No. 1, 183, 14.03.2015.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Plaza Onate, F, Batto, JM, Juste, C, Fadlallah, J, Fougeroux, C, Gouas, D, Pons, N, Kennedy, S, Levenez, F, Dore, J, Ehrlich, SD, Gorochov, G & Larsen, M 2015, 'Quality control of microbiota metagenomics by k-mer analysis', BMC Genomics, vol. 16, no. 1, 183. https://doi.org/10.1186/s12864-015-1406-7

APA

Plaza Onate, F., Batto, J. M., Juste, C., Fadlallah, J., Fougeroux, C., Gouas, D., Pons, N., Kennedy, S., Levenez, F., Dore, J., Ehrlich, S. D., Gorochov, G., & Larsen, M. (2015). Quality control of microbiota metagenomics by k-mer analysis. BMC Genomics, 16(1), [183]. https://doi.org/10.1186/s12864-015-1406-7

Vancouver

Plaza Onate F, Batto JM, Juste C, Fadlallah J, Fougeroux C, Gouas D et al. Quality control of microbiota metagenomics by k-mer analysis. BMC Genomics. 2015 Mar 14;16(1). 183. https://doi.org/10.1186/s12864-015-1406-7

Author

Plaza Onate, Florian ; Batto, Jean Michel ; Juste, Catherine ; Fadlallah, Jehane ; Fougeroux, Cyrielle ; Gouas, Doriane ; Pons, Nicolas ; Kennedy, Sean ; Levenez, Florence ; Dore, Joel ; Ehrlich, S. Dusko ; Gorochov, Guy ; Larsen, Martin. / Quality control of microbiota metagenomics by k-mer analysis. In: BMC Genomics. 2015 ; Vol. 16, No. 1.

Bibtex

@article{f5ec47f4a8b040f4a7814d9136bb5632,
title = "Quality control of microbiota metagenomics by k-mer analysis",
abstract = "Background: The biological and clinical consequences of the tight interactions between host and microbiota are rapidly being unraveled by next generation sequencing technologies and sophisticated bioinformatics, also referred to as microbiota metagenomics. The recent success of metagenomics has created a demand to rapidly apply the technology to large case-control cohort studies and to studies of microbiota from various habitats, including habitats relatively poor in microbes. It is therefore of foremost importance to enable a robust and rapid quality assessment of metagenomic data from samples that challenge present technological limits (sample numbers and size). Here we demonstrate that the distribution of overlapping k-mers of metagenome sequence data predicts sequence quality as defined by gene distribution and efficiency of sequence mapping to a reference gene catalogue. Results: We used serial dilutions of gut microbiota metagenomic datasets to generate well-defined high to low quality metagenomes. We also analyzed a collection of 52 microbiota-derived metagenomes. We demonstrate that k-mer distributions of metagenomic sequence data identify sequence contaminations, such as sequences derived from {"}empty{"} ligation products. Of note, k-mer distributions were also able to predict the frequency of sequences mapping to a reference gene catalogue not only for the well-defined serial dilution datasets, but also for 52 human gut microbiota derived metagenomic datasets. Conclusions: We propose that k-mer analysis of raw metagenome sequence reads should be implemented as a first quality assessment prior to more extensive bioinformatics analysis, such as sequence filtering and gene mapping. With the rising demand for metagenomic analysis of microbiota it is crucial to provide tools for rapid and efficient decision making. This will eventually lead to a faster turn-around time, improved analytical quality including sample quality metrics and a significant cost reduction. Finally, improved quality assessment will have a major impact on the robustness of biological and clinical conclusions drawn from metagenomic studies.",
keywords = "Metagenomics, Next generation sequencing, Quality control, Sample size limits, Sampling bias",
author = "{Plaza Onate}, Florian and Batto, {Jean Michel} and Catherine Juste and Jehane Fadlallah and Cyrielle Fougeroux and Doriane Gouas and Nicolas Pons and Sean Kennedy and Florence Levenez and Joel Dore and Ehrlich, {S. Dusko} and Guy Gorochov and Martin Larsen",
note = "Funding Information: The authors acknowledge the funding agencies and the volunteers providing samples for the study. The study was funded by INSERM, the University Pierre et Marie Curie {\"E}MERGENCE” program, Fondation pour l{\textquoteright}Aide a la Recherche sur la Sclerose En Plaques (ARSEP), ARTHRITIS Fondation COURTIN and Agence nationale de la recherch{\'e} (ANR). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Publisher Copyright: {\textcopyright} Plaza Onate et al.",
year = "2015",
month = mar,
day = "14",
doi = "10.1186/s12864-015-1406-7",
language = "English",
volume = "16",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central Ltd.",
number = "1",

}

RIS

TY - JOUR

T1 - Quality control of microbiota metagenomics by k-mer analysis

AU - Plaza Onate, Florian

AU - Batto, Jean Michel

AU - Juste, Catherine

AU - Fadlallah, Jehane

AU - Fougeroux, Cyrielle

AU - Gouas, Doriane

AU - Pons, Nicolas

AU - Kennedy, Sean

AU - Levenez, Florence

AU - Dore, Joel

AU - Ehrlich, S. Dusko

AU - Gorochov, Guy

AU - Larsen, Martin

N1 - Funding Information: The authors acknowledge the funding agencies and the volunteers providing samples for the study. The study was funded by INSERM, the University Pierre et Marie Curie ËMERGENCE” program, Fondation pour l’Aide a la Recherche sur la Sclerose En Plaques (ARSEP), ARTHRITIS Fondation COURTIN and Agence nationale de la recherché (ANR). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Publisher Copyright: © Plaza Onate et al.

PY - 2015/3/14

Y1 - 2015/3/14

N2 - Background: The biological and clinical consequences of the tight interactions between host and microbiota are rapidly being unraveled by next generation sequencing technologies and sophisticated bioinformatics, also referred to as microbiota metagenomics. The recent success of metagenomics has created a demand to rapidly apply the technology to large case-control cohort studies and to studies of microbiota from various habitats, including habitats relatively poor in microbes. It is therefore of foremost importance to enable a robust and rapid quality assessment of metagenomic data from samples that challenge present technological limits (sample numbers and size). Here we demonstrate that the distribution of overlapping k-mers of metagenome sequence data predicts sequence quality as defined by gene distribution and efficiency of sequence mapping to a reference gene catalogue. Results: We used serial dilutions of gut microbiota metagenomic datasets to generate well-defined high to low quality metagenomes. We also analyzed a collection of 52 microbiota-derived metagenomes. We demonstrate that k-mer distributions of metagenomic sequence data identify sequence contaminations, such as sequences derived from "empty" ligation products. Of note, k-mer distributions were also able to predict the frequency of sequences mapping to a reference gene catalogue not only for the well-defined serial dilution datasets, but also for 52 human gut microbiota derived metagenomic datasets. Conclusions: We propose that k-mer analysis of raw metagenome sequence reads should be implemented as a first quality assessment prior to more extensive bioinformatics analysis, such as sequence filtering and gene mapping. With the rising demand for metagenomic analysis of microbiota it is crucial to provide tools for rapid and efficient decision making. This will eventually lead to a faster turn-around time, improved analytical quality including sample quality metrics and a significant cost reduction. Finally, improved quality assessment will have a major impact on the robustness of biological and clinical conclusions drawn from metagenomic studies.

AB - Background: The biological and clinical consequences of the tight interactions between host and microbiota are rapidly being unraveled by next generation sequencing technologies and sophisticated bioinformatics, also referred to as microbiota metagenomics. The recent success of metagenomics has created a demand to rapidly apply the technology to large case-control cohort studies and to studies of microbiota from various habitats, including habitats relatively poor in microbes. It is therefore of foremost importance to enable a robust and rapid quality assessment of metagenomic data from samples that challenge present technological limits (sample numbers and size). Here we demonstrate that the distribution of overlapping k-mers of metagenome sequence data predicts sequence quality as defined by gene distribution and efficiency of sequence mapping to a reference gene catalogue. Results: We used serial dilutions of gut microbiota metagenomic datasets to generate well-defined high to low quality metagenomes. We also analyzed a collection of 52 microbiota-derived metagenomes. We demonstrate that k-mer distributions of metagenomic sequence data identify sequence contaminations, such as sequences derived from "empty" ligation products. Of note, k-mer distributions were also able to predict the frequency of sequences mapping to a reference gene catalogue not only for the well-defined serial dilution datasets, but also for 52 human gut microbiota derived metagenomic datasets. Conclusions: We propose that k-mer analysis of raw metagenome sequence reads should be implemented as a first quality assessment prior to more extensive bioinformatics analysis, such as sequence filtering and gene mapping. With the rising demand for metagenomic analysis of microbiota it is crucial to provide tools for rapid and efficient decision making. This will eventually lead to a faster turn-around time, improved analytical quality including sample quality metrics and a significant cost reduction. Finally, improved quality assessment will have a major impact on the robustness of biological and clinical conclusions drawn from metagenomic studies.

KW - Metagenomics

KW - Next generation sequencing

KW - Quality control

KW - Sample size limits

KW - Sampling bias

UR - http://www.scopus.com/inward/record.url?scp=84925351919&partnerID=8YFLogxK

U2 - 10.1186/s12864-015-1406-7

DO - 10.1186/s12864-015-1406-7

M3 - Journal article

C2 - 25887914

AN - SCOPUS:84925351919

VL - 16

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

M1 - 183

ER -

ID: 339849165