1. Introduction
Over the years, the microbiome has become the focus of an increasing number of studies in different fields and applications, from environmental research to various interdisciplinary fields, e.g., agriculture, food science, biotechnology, bioeconomy, mathematics (informatics, statistics, modeling), plant pathology, and especially human medicine [1]. In particular, the human gut microbiome has gained attention due to its critical role in human health and disease. It functions as an additional organ in our body and has a vital role in physiology, metabolism, and immune responses, establishing a symbiotic relationship with the host [2]. Hence, the recent interest in investigating how the composition and function of the microbiome vary in response to diseases or how they may influence the onset of diseases. Moreover, this newly acquired knowledge has been used to develop new and innovative strategies for the prevention, diagnosis, and treatment of various disorders affecting human health [3]. In the past decade, metagenomics has greatly improved our understanding of the microbial communities, including prokaryotes, fungi, viruses, and protozoans, that inhabit the human body and other environments, although there is still a large fraction of uncharacterized micro-organisms that are sometimes called “microbial dark matter” [4]. According to Pérez-Cobas et al. [5], two major methods, amplicon-based and shotgun sequencing, represent valid approaches for exploring the microbiome, using high-throughput sequencing technologies [6]. These approaches are constantly evolving along with the rapid upgrade and development of new high-throughput sequencing technologies, which have progressed from second-generation sequencing or next-generation sequencing (NGS), to third-generation sequencing (TGS) technologies, shifting from short-read to long-read sequencing [7]. Considering the remarkable genome plasticity of eubacteria, where specific functions/features are associated with strain-specific genome tracts possibly originated by lateral gene transfer, the main challenge now is to achieve the most fine-grained taxonomic classification to gain a better understanding of the composition and function of the microbiome. [8,9,10,11]. To date, the amplicon-based sequencing approach (also known as DNA metabarcoding), based on the analysis of a target gene/genomic region, specific for the taxonomic domain of interest (e.g., 16S rRNA gene for prokaryotic characterization), remains the most common and cost-effective strategy with great potential for microbiome profiling. This is due to several peculiarities, such as (i) high sensitivity; (ii) less risk of host contamination related to the specificity of the target gene used; (iii) possibility of checking and reducing the presence of false positives [12,13]; (iv) availability of computational error correction tools; (v) several publicly available and user-friendly suites, like QIIME 2 [14] and Mothur [15]; and (vi) the lower cost compared to the shotgun sequencing approach. Until recently, most studies have focused on the amplification and sequencing of just one or a few selected regions of the entire 16S rRNA gene [16,17,18,19,20], but the obtained results did not provide an exhaustive representation of the biodiversity. In fact, the choice of specific regions of a gene and the corresponding primer pairs, as well as the sequencing methods employed, introduce bias and variability into the results. This can lead to differences in specificity and sensitivity during the analysis, ultimately influencing the overall outcome of the study [21,22,23,24]. Hence, the single-region amplicon-based approach has evolved into two new approaches: the multiplex and the full-length. The multiplex approach tries to mitigate common issues of the single-region approach by simultaneously targeting different hypervariable regions of the selected marker; this approach is expected to overcome the sensitivity limits of a single primer pair, but still relies on short-read sequencing. Conversely, TGS technologies enable the full-length approach, which covers the whole target gene, by using long-read sequencing. Also, a faster library set up and running times, accuracy, and cost-efficiency are TGS advantages. These improvements, together with the development of high-resolution computational methods, have made the full-length approach highly competitive compared to short-read sequencing methods. However, long-read sequencing studies still face limitations. These include higher sequencing error rates, systematic errors, and a lack of mature bioinformatic resources for interpreting the data [25,26,27].
This study compares three different amplicon-based sequencing methods, represented by (a) the single-region approach, testing three different hypervariable regions within the 16S rDNA, V3V4 [28,29], V5V6 [30], and V4 [31,32]; (b) the multiplex approach, and (c) the full-length approach, adopting second- and third-generation sequencing platforms. The aim of this work was to identify the best and the most efficient approach in terms of the processing time, contamination risks, sequencing quality, downstream analysis, high coverage and resolution at the species level, and costs. Firstly, we used a prokaryotic mock community with a known composition, and then the results were validated with biological samples from a mouse model of intestinal inflammation.
Considering the broad impact of microbiome research across various fields that often requires the processing of many samples, our goal was to enhance the efficiency of microbial characterization and offer practical guidelines for obtaining a fast and cost-effective snapshot of the microbiome.
2. Materials and Methods
2.1. Samples
The commercial mock microbial community, ATCC® 20 Strain Even Mix Genomic Material (MSA-1002TM, ATCC®, Manassas, VA, USA, https://www.atcc.org/products/msa-1002, accessed on 1 March 2021), composed of a mix of genomic DNA belonging to 20 fully sequenced, characterized, and authenticated ATCC Genuine Cultures (5% for each strain), was used as a benchmark for testing the three different amplicon-based sequencing approaches. The bacterial content of this mock microbial community is described in .
The same three approaches were tested on real biological samples: 78 samples (n. 56 feces and n. 22 intestinal content) from an in vivo experimental mouse model of intestinal inflammation. The DNAs were extracted with DNeasy Power Soil Pro kit (Qiagen, Germantown, MD, USA) following the manufacturer’s protocol. The DNAs were provided by the company Postbiotica s.r.l. [29].
2.2. 16S rRNA Gene Amplicon-Based Library Preparation and Sequencing
Three different 16S rRNA gene amplicon-based sequencing methods were applied to the mock community and to the DNA extracted from feces/intestinal content.
All the sample library preparations for Illumina NGS, according to the single-region and multiplex approaches, were performed, starting from 1 ng of DNA. The QIAseq 16S/ITS Region Panels kit (QIAGEN) was used for targeting the V3V4 hypervariable regions of the bacterial 16S rRNA gene, according to the manufacturer’s instructions. The V5V6 and V4 hypervariable regions of the 16S rDNA were amplified using two overhang primer pairs, BV5/AV6 and 515F/806R, respectively, using the Phusion High-Fidelity DNA Polymerase and following the protocols described in Manzari et al. [33].
For the multiplex approach, the Swift amplicon® 16S+ITS Panel Kit (Swift Biosciences, Inc., Ann Arbor, MI, USA) was used, following the manufacturer’s instructions, by applying 22 amplification cycles in the first PCR step and by introducing a purification step of the libraries with AMPure XP beads (0.8×, v/v) (Beckman Coulter, Inc., Brea, CA, USA) at the end of the protocol. Each NGS library was quality checked through TapeStation 4200 (Agilent Technologies, Santa Clara, CA, USA) or 1.2% agarose gel electrophoresis and quantified using fluorometric assay (Qubit dsDNA HS assay kit, Thermo Fisher Scientific, Waltham, MA, USA). For each different method, the libraries were pooled at equimolar concentrations and sequenced on the MiSeq Illumina platform. The 2 × 300 bp paired-end sequencing strategy (MiSeq Reagent Kit v3, 600-cycle) was used for the V3V4 sequencing on the MiSeq platform. Conversely, the 2 × 250 bp paired-end sequencing strategy (MiSeq Reagent Kit v2, 500-cycle) was used for the V5V6, V4, and multiplex sequencing approaches.
Sample library preparation for PacBio TGS was performed, starting from 2 ng of total DNA. The 27F/1492R universal primer pair [34] was used to amplify the full-length 16S rRNA gene (corresponding to the region from V1 to V9), with both primers tailed with sample-specific PacBio barcode sequences for multiplexed sequencing (for the barcode sequences, see the Appendix of the PacBio protocol “Amplification of Full-Length 16S Gene with Barcoded Primers for multiplexed SMRTbell® Library Preparation and Sequencing”—Version 04—January 2021). Amplifications were performed using a 25 µL mixture containing 2 ng of DNA, 1× Buffer HF, 0.2 mM dNTPs, 0.375 µM of each primer (Fwd-Rev), and 1U/L of Phusion High-Fidelity DNA Polymerase. The cycling parameters for amplification of the full-length 16S rRNA gene were standardized as follows: initial denaturation at 98 °C for 30 s, followed by 10 cycles of denaturation at 98 °C for 10 s, annealing at 58 °C for 30 s, extension at 72 °C for 1 min, and subsequently, 15 cycles of denaturation at 98 °C for 10 s, annealing at 62 °C for 30 s, extension at 72 °C for 1 min, with a final extension step of 7 min at 72 °C. The PCR products (~1.5 kb long) were quality checked by 1.2% agarose gel electrophoresis and quantified using Qubit 1× dsDNA HS Assay kit. Each barcoded PCR reaction product was pooled at equimolar concentrations. Then, the SMRTbell library construction was performed according to the manufacturer’s instructions of the “Amplification of Full-Length 16S Gene with Barcoded Primers for multiplexed SMRTbell® Library Preparation and Sequencing” (Version 04—January 2021), starting from the step “Pooling Barcoded Amplicons”. The SMRTbell® Express Template Prep Kit 2.0, the Binding Kit 2.1, the Sequel II Sequencing Kit 2.0, and a single SMRT® Cell 8M (Pacific Biosciences, Menlo Park, CA, USA) were used for library preparation and sequencing on the PacBio Sequel II System.
All PCR reactions were performed in the presence of a negative control (Molecular Biology Grade Water, RNase/DNase-free water), to exclude contamination during library preparation steps. The negative controls were evaluated qualitatively and quantitatively, just like the other samples. As the analyses confirmed the absence of contaminations, these controls were not sequenced.
2.3. Bioinformatic Data Analysis
3. Results
3.1. Mock Benchmark Analysis
A mock community was used as an experimental and bioinformatic benchmark in order to evaluate the efficiency of the three amplicon-based sequencing methods, i.e., the single-region, multiplex, and full-length approaches, to provide an accurate microbiome characterization. The mock microbiome community that we analyzed contained bacteria that were both frequent and rare in the human microbiome, under eubiosis and dysbiosis conditions. It covers the phyla Bacteroidota (Bacteroides vulgatus, Porphyromonas gingivalis), Actinobacteriota (Bifidobacterium adolescentis, Cutibacterium acnes, Schaalia odontolytica), Firmicutes (Bacillus pacificus, Clostridium beijerinckii, Enterococcus faecalis, Lactobacillus gasseri, Staphylococcus aureus, Staphylococcus epidermidis, Streptococcus agalactiae, Streptococcus mutans), Proteobacteria (Acinetobacter baumannii, Escherichia coli, Helicobacter pylori, Neisseria meningitidis, Pseudomonas paraeruginosa, Cereibacter sphaeroides), and Deinococcota (Deinococcus radiodurans), for a total of 18 genera and 20 bacterial species . For each method, the amplicon libraries were sequenced by a specific platform and the results were compared at genus and species level. As for the genus-level comparisons , the V3V4, V4, and full-length approaches gave the best results since they identified the 18 expected genera. Moreover, a strong linear correlation (Pearson’s r = 0.76 for V3V4, r = 0.70 for V4, and r = 0.71 for full-length; p-value < 0.001) was observed between the expected and observed 16S rRNA relative abundances identified at the genus level (Figure 1). Otherwise, the V5V6 single-region and the multiplex approach identified only 10/18 and 10/18 genera, respectively, without a significant statistical correlation between the expected and observed 16S rRNA abundances, as shown in Figure 1.
Figure 1. Correlation between the expected and observed 16S rRNA relative abundances (%) at the genus level for each sequencing approach. Correlation for (a) the V3V4 region; (b) the V5V6 region; (c) the V4 region; (d) the multiplex approach; (e) the full-length approach. For each method, the adjusted R2, linear model coefficient (coef.), Pearson correlation coefficient (r), and p-value are shown.
Moving to the species-level analysis , in the single-region approach, 4 species, 12 species, and 9 species of the mock community were identified by the V5V6, the V3V4, and the V4 targets, respectively. Additionally, only for the V3V4 target was a weak correlation between the expected and observed 16S relative abundances evaluated (Pearson’s r = 0.26; p-value = 0.251), even if no statistical significance was revealed (Figure 2). The multiplex approach detected only four species, and the 16S rDNA-associated counts were a lot lower than expected (Figure 2, ). The full-length approach detected 3 more species than the V3V4 single-region approach, represented by Clostridium beijerinckii, Enterococcus faecalis, and Pseudomonas aeruginosa, for a total of 15 observed species. The observed counts by the full-length approach show a weak linear correlation (Pearson’s r = 0.28; p-value = 0.223) (Figure 2). No methods identified the two species of the genus Staphylococcus (S. aureus and S. epidermidis), except for S. epidermidis which was identified only by the V5V6 region. On the contrary, only the V3V4, V4, and full-length approaches were able to distinguish the species Streptococcus agalactiae and Streptococcus mutans.
Figure 2. Correlation between the expected and observed 16S rRNA relative abundances (%) at the species level for each sequencing approach. (a) Correlation for the V3V4 region; (b) correlation for the V5V6 region; (c) correlation for the V4 region; (d) correlation for the multiplex approach; (e) correlation for the full-length approach. For each method, the adjusted R2, linear model coefficient (coef.), Pearson correlation coefficient (r), and p-value are shown.
To further investigate the failure in the classification of some taxa and identify the eventual limiting step of the entire workflow, we initially mapped the raw PE reads, retained following the primer trimming, on the mock species reference genomes. For short-read mapping, we considered only PE reads mapping on the same 16S rRNA genes or on an identical copy. PE reads mapping on different non-identical copies of the SSU genes were considered ambiguous and not considered for subsequent analysis. Finally, we also considered PE reads mapped in the genomic region out of those annotated to contain 16S rRNA genes and unmapped ones. The largest amount of read mapping on 16S rRNA genes was observed in full-length (99.98%) and V4 (98.76%) . V5V6 obtained the lowest amount of PE reads mapping unambiguously on SSU genes (15.04%) and the highest rate of ambiguous ones (78.40%) . Reads mapping on unexpected regions were observed only in V4 (0.03%) and V3V4 (0.001%) . Finally, the topmost unmapped rates were observed in multiplex (13.14%) and V5V6 (6.57%) . In and are shown the relative abundances of reads mapping on 16S rRNA genes per each species represented in the mock community. It was only via the multiplex approach that we did not find reads mapping to all the mock species (in particular D. radiodurans and A. baumannii). A relevant positive Pearson correlation among the expected and observed relative abundances was observed for V3V4 (Pearson’s r = 0.61; p-value = 0.004), V4 (Pearson’s r = 0.60; p-value = 0.005), V5V6 (Pearson’s r = 0.51; p-value = 0.022), and full-length (Pearson’s r = 0.54; p-value = 0.015). Moreover, considering the Swift bioinformatic workflow implements a preliminary denoising step before the selection of 16S rRNA sequences from the reference collection, we decided to evaluate pASVs and map them on the reference mock genomes. We observed these pASVs mapped on 12 out of 20 species. Finally, we have also evaluated and compared the observed 16S rRNA gene coverage in multiplex raw reads and pASVs . Regarding the raw reads, in 6 out of 18 species (namely B. adolescentis, C. beijerinckii, E. fecalis, E. coli, L. gasseri, and B. vulgatus), the expected pattern, covering the whole 16S rRNA gene, was revealed. The same pattern was also observed at the pASV level, with the exception of C. beijerinckii. Furthermore, regardless of the analyzed species, the coverage pattern was uneven. Regarding the mapped pASVs, we overall observed the same coverage pattern as seen at the raw reads level in B. pacificus, N. meningitidis, R. sphaeroides, S. aureus, S. agalactiae, S. epidermidis, and S. mutant.
S. odontolytica had a different trend, showing a profile of pattern coverage higher than the one observed in the raw reads.
Subsequently, the ASVs derived from each single-region and full-length approach and 16S rRNA full-length sequences selected by the multiplex approach were mapped to the available reference genomes of the mock species (BLAST analysis with query coverage 90% and identity 97%) . For the V3V4 and full-length methods, all of the generated ASVs (20 out of 20) were aligned to the reference genomes of the mock microbiome. For the V5V6 and V4 methods, 19 ASVs were aligned, and only 6 ASVs were identified using the multiplex method. The analysis also highlighted the presence of background noise, including false positives, associated with these methods . The precision, accuracy, and recall (sensitivity) values for each method are provided in and .
3.2. Validation on Real Microbiome Samples
Taking into account the results of the benchmark study, the three different 16S rDNA amplicon-based approaches (i.e., the V3V4 single-region approach, see Discussion, the multiplex, and the full-length methods) were used to analyze 78 real samples obtained from the feces/intestinal content of a mouse model of intestinal inflammation.
4. Discussion
To study the prokaryotic microbiome, the most widely used approach is amplicon-based sequencing that focuses on one or a few regions of the 16S rRNA gene. This approach has been progressively improved following technological advances and the demands of modern research [11]. Currently, two amplicon-based approaches represent a possible alternative: the multiplex and the full-length methods [34,51,52]. The first represents a compromise that targets different regions of the marker gene relying on short-read sequencing, whereas the second takes advantage of long-read sequencing to cover the whole length of the gene.
NGS has lower analytical costs and uses established bioinformatic pipelines and databases for downstream analysis, but nowadays, TGS has become more competitive thanks to cost reduction and the development of efficient analysis methods [22,24]. In both cases, the experimental workflows lead to sample handling improvements by reducing execution times and contamination risk. The full-length method is especially advantageous because it handles pooled samples from the beginning of library preparation. Therefore, the main challenge remains to identify the amplicon-based sequencing approach that can best reliably characterize the microbiome up to lower taxonomic levels, such as the species level. So, the present study investigates the effectiveness of three different experimental approaches, the single-region and multiplex using the NGS platform Illumina MiSeq, and the full-length using the TGS platform PacBio. We first performed the microbial characterization on a prokaryotic mock community as a benchmark for both the experimental and bioinformatics steps. Then, we applied the same methods to a cohort of complex murine samples to validate the results.
A benchmarking study is crucial to understand the relative performance of taxonomic profiling methods for different purposes, providing a ”ground truth” to which results can be compared [53,54]. The prokaryotic mock microbiome considered is composed of 20 bacterial species, representative of 18 genera. The results show the different sensitivity of each hypervariable region of the 16S rRNA gene in microbiome profiling, supporting that the choice of a region can affect the results [21,22,55] and lead to intrinsic analytical limitations due to the sequence features of the reference collection. Indeed, it is evident, already at the genus level, that the taxa identified by the single-region approach vary depending on the hypervariable region chosen (V3V4, V5V6, and V4) . Nevertheless, the full-length approach identifies with high fidelity 18/18 genera, the same as the V3V4 and V4 methods. It is worthy to note that for both Clostridium beijerinckii and Schaalia odontolytica, the genera associated with the reference taxonomy were Clostridium senso stricto 1 and Actinomyces, respectively. This represents a classification issue and classifier performances are greatly affected by the reference database [56] and this becomes even more evident at the species level. The full-length approach identifies the highest number of species (15/20), overcoming the discrimination power of the others . Despite the sequenced ASVs matching with the available reference sequences of the mock community, some species are not classified. Indeed, the SILVA database [57], with more reference sequences than other 16S rRNA gene databases such as GreenGenes and RDP [58,59], includes a considerable number of taxonomies that do not have the resolution to the species level. Missing a species in the database can result in misclassification [60], limiting the classifiers’ performance. Therefore, the results at the species level in the mock community are not related to the experimental protocol but to a classification issue [56]. In addition, we detected unexpected ASVs not matching with mock reference genome sequences . These data support the use of ≥1% relative abundance as a suitable threshold able to eliminate the background noise in taxonomic analysis in the case of single-region analysis. This threshold may become ≤0.1% in the case of FL analysis, thus remarkably improving detection sensitivity and specificity.
Despite the multiplex approach being characterized by a higher gene coverage compared to the single-region approach, its resolution at the genus level was comparable to the V5V6 target, but its performance was limited compared to the other approaches and . This limitation could be related to both the experimental procedure and the different steps of the specific bioinformatic workflow. Indeed, when mapping the raw reads on reference mock genomes, we found that only 18 out of 20 species were detected and this supports the idea of a possible issue in multiplexed PCR efficiency. Regarding the bioinformatic approach, it involves the selection of 16S full-length sequences from a reference collection that have a higher probability to be observed in the analyzed dataset. This selection relies on the observed pASVs and in our analysis, we demonstrated the impact of the denoising step. Indeed, considering pASVs, only 12/20 species were detected and probably the implemented approach fails in discriminating among noise and real sequence variability and this results in an aggregation of sequences belonging to different taxa. A remarkable example is S. odontolytica for which we observed more sequences associated with pASVs compared to raw ones. These results support the thesis that issues in both experimental and bioinformatic steps limit the reliability of the multiplex approach. In our study, the mock community, which is usually used as an internal control to monitor the entire sequencing workflow, was sequenced and analyzed with real intestinal biological samples. This might influence the multiplex approach by affecting the capacity of DADA2 to distinguish real biodiversity from noise. Indeed, in our results, we clearly show how, following the denoising step, we were unable to map pASVs to eight species. Once again, it is crucial to note that denoising is preliminary to 16S full-length sequence selection, and the loss of data introduces biases in the subsequent analytical steps. Furthermore, the shown data represent cumulative results obtained following an experimental and bioinformatic workflow. Consequently, it is difficult to define whether we are observing, for instance, an issue related to primer amplification bias or to bioinformatic analysis.
The mock benchmark has enabled the optimization of amplicon-based sequencing workflows, but the microbiome communities of biological samples are much more complex than a mock community [24]. So, the previous results have been validated using a community of 78 fecal samples deriving from a specific stratified sampling. Considering the higher resolution achieved in the preliminary benchmark analysis by V3V4 compared to the other single regions, for the single-region approach, just the V3V4 region has been chosen as the target for the validation.
As explained by the rarefaction curves (Figure 3), both the full-length and the V3V4 approach can reliably estimate all the community diversity, instead of the unreached plateau for the multiplex data. These data are confirmed by Pielou’s index calculation which shows a greater evenness of the community for the full-length approach, which is therefore able to capture a higher biodiversity [61]. At the same time, the higher value of the inverse Simpson index recorded for the multiplex approach seems to not be associated with a real greater biodiversity but probably with the overestimation of ASVs (https://github.com/swiftbiosciences/16S-SNAPP-py3, accessed on 1 January 2020), as demonstrated by the related low Pielou’s index inference (Figure 4). This discrepancy is additionally confirmed by the α diversity analysis performed with both indexes at the genus level (Figure 5), highlighting the lower biodiversity captured by the multiplex method compared to the other methods.
Figure 3. Rarefaction curves of sequencing data. The figure shows the rarefaction curves of (a) V3V4 sequencing data with a threshold of 100,000 sequences; (b) multiplex sequencing data with a threshold of 52,000 sequences; (c) full-length sequencing data with a threshold of 7427 sequences. In brown are shown the rarefaction curves of feces samples, whereas in blue are shown the rarefaction curves of intestinal content samples.
Figure 4. Box plot of α diversity indices calculated for each sequencing method. (a) α diversity measured using the inverse Simpson index. (b) α diversity measured using Pielou’s evenness index. α diversity scores were calculated by using rarefied ASV counts for each approach. Box plots and points represent the overall data distribution and single samples, respectively. Yellow: full-length approach; violet: multiplex approach; water-green: V3V4 approach. The group means comparison was performed by using the paired Student’s t-test (“**”: p-value < 0.01; “***”: p-value < 0.001; “****”: p-value < 0.0001).
Figure 5. Box plot of α diversity indices calculated for each sequencing method at genus-level counts. (a) α diversity measured using the inverse Simpson index. (b) α diversity measured using Pielou’s evenness index. α diversity scores were calculated by using rarefied genus-level counts for each approach. Box plots and points represent the overall data distribution and samples, respectively. Yellow: full-length approach; violet: multiplex approach; water-green: V3V4 approach. The group means comparison was performed by using the paired Student’s t-test (“ns”: p-value > 0.05; “***”: p-value < 0.001; “****”: p-value < 0.0001).
The ability of each approach to reliably detect the specific biodiversity is also revealed by the β diversity analysis. The biodiversity measured by the full-length and V3V4 approaches overlaps at the higher taxonomic levels and shows a good separation only at the species level (Figure 6). On the contrary, the multiplex data cluster separately at all the taxonomic levels analyzed. They show a different community composition from the one identified by the other two methods (Figure 6). Focusing on the taxonomical analysis, the three methods identify the same five phyla, represented by Bacteroidota, Firmicutes, Verrucomicrobiota, Actinobacteriota, and Proteobacteria (Figure 7 and Figure 8). In the multiplex data, an increase in the phyla Proteobacteria and Actinobacteriota and a decrease in the phylum Verrucomicrobiota are observed. Moving to lower taxonomic levels, the relationship between the V3V4 and full-length methods becomes evident. In fact, at the family and genus level, all the taxa identified by the full-length method match with those found by the V3V4 method. On the contrary, the multiplex method shared only 9 taxa at the family and 10 taxa at the genus level with the V3V4 and full-length methods. Finally, the highest number of species were identified by the full-length method, of which 14 taxa were exclusive, 29 taxa were shared with the V3V4 method, and 3 taxa were shared with the multiplex method. In particular, the full-length method can classify two exclusive species represented by Bifidobacterium pseudolongum and Lactobacillus johnsonii, five species common to the V3V4 method as Akkermansia muciniphila, Bacteroides acidifaciens, Lactobacillus murinus, Lactobacillus reuteri, and one species (Muribaculum intestinale) also identified by the multiplex method. Within the genus Bifidobacterium, the full-length and V3V4 methods recognized two different species, represented by Bifidobacterium pseudolongum and Bifidobacterium choerinum, respectively. In the case of the multiplex, the exclusive species Lactobacillus vaginalis is misclassified to the genus Bacillus, creating a large variability in the overall composition of the microbial community, compared to the other methods. These data confirm the trend observed in the β diversity analysis and highlight the ability of the full-length method to capture the same microbiome profile identified by the single-region method but at the same time, overcoming the resolution of both the single-region and multiplex methods.
Figure 6. PCoA plot for taxonomic-level data. The figure shows the β diversity at different taxonomic levels, from phylum (a) to family (b), genus (c), and species (d) levels (Aitchison distance, using CLR-transformed sample abundances). Point colors represent the three sequencing method sets: the full-length approach in yellow; the multiplex approach in violet; the V3V4 approach in water-green, respectively.
Figure 7. Venn diagrams using filtered taxa with relative abundance ≥ 1%. Circular colors represent the three sequencing method sets: the full-length approach in yellow; the multiplex approach in violet; the V3V4 approach in water-green, respectively. Shared taxa are represented as overlapping circles with merged colors.
Figure 8. Donut charts of the taxonomic assignments at different taxonomic levels. For each sequencing method, single-region (V3V4), multiplex (MPX), and full-length (FL), the taxonomic assignments and the average relative abundances at phylum, family, genus, and species levels are plotted. Taxa with relative abundances < 1% are collapsed into “Others”.
The last important aspect to consider is the rate of unclassified taxa for each method (Figure 8). The full-length approach does not have any unclassified ASVs up to the family level and maintains the lowest rate at the genus level. The multiplex method, on the other hand, has a higher rate of unclassified taxa, starting from the phylum level. These last results confirm the limits associated with downstream analysis and the pipeline used for analyzing multiple small amplicons and reconstructing the 16S rRNA gene.
5. Conclusions
This comparative assessment between the three amplicon-based methods verified the reliability of the single-region-based study. However, it also demonstrated the better performance of the complete target analysis and its higher effectiveness on complex biological communities. In fact, the analysis, supported by a case study, highlighted the greatest discriminating power of the full-length 16S rRNA approach up to the species level. It also benefited from its less laboriousness, lower execution time, and contamination risk, at a similar cost to the standard single-region approach. Hence, the amplification of the whole 16S rRNA gene and the use of TGS demonstrated an improvement, in both experimental and downstream analysis, compared to previous methods. Despite the issues with reference databases [56], this approach was able to identify more species of the known composition benchmark and to exclusively classify B. pseudolongum and L. johnsonii within the real dataset, as compared to the short-read sequencing approaches. On the other hand, the multiplex method presented remarkable flaws, such as sequencing depth and sampling, inference of ASVs, and 16S reference collection (https://github.com/swiftbiosciences/16S-SNAPP-py3, accessed on 1 January 2020).
Considering all these factors, this study supports the transition from NGS to TGS for the study of the intestinal microbiome, opening a new frontier in biomedical research to revolutionize the way to act against disease conditions [62]. Further studies may be performed to demonstrate the effectiveness of this approach for samples of a different nature and taxonomic complexity, such as environmental samples.
References
- Berg, G.; Rybakova, D.; Fischer, D.; Cernava, T.; Vergès, M.-C.C.; Charles, T.; Chen, X.; Cocolin, L.; Eversole, K.; Corral, G.H.; et al. Microbiome Definition Re-Visited: Old Concepts and New Challenges. Microbiome 2020, 8, 103. [Google Scholar] [CrossRef]
- Malard, F.; Dore, J.; Gaugler, B.; Mohty, M. Introduction to Host Microbiome Symbiosis in Health and Disease. Mucosal. Immunol. 2021, 14, 547–554. [Google Scholar] [CrossRef]
- Durack, J.; Lynch, S.V. The Gut Microbiome: Relationships with Disease and Opportunities for Therapy. J. Exp. Med. 2019, 216, 20–40. [Google Scholar] [CrossRef] [PubMed]
- Thomas, A.M.; Segata, N. Multiple Levels of the Unknown in Microbiome Research. BMC Biol. 2019, 17, 48. [Google Scholar] [CrossRef] [PubMed]
- Pérez-Cobas, A.E.; Gomez-Valero, L.; Buchrieser, C. Metagenomic Approaches in Microbial Ecology: An Update on Whole-Genome and Marker Gene Sequencing Analyses. Microb. Genom. 2020, 6. [Google Scholar] [CrossRef]
- Zhang, L.; Chen, F.; Zeng, Z.; Xu, M.; Sun, F.; Yang, L.; Bi, X.; Lin, Y.; Gao, Y.; Hao, H.; et al. Advances in Metagenomics and Its Application in Environmental Microorganisms. Front. Microbiol. 2021, 12, 766364. [Google Scholar] [CrossRef]
- Amarasinghe, S.L.; Su, S.; Dong, X.; Zappia, L.; Ritchie, M.E.; Gouil, Q. Opportunities and Challenges in Long-Read Sequencing Data Analysis. Genome Biol. 2020, 21, 30. [Google Scholar] [CrossRef]
- Earl, J.P.; Adappa, N.D.; Krol, J.; Bhat, A.S.; Balashov, S.; Ehrlich, R.L.; Palmer, J.N.; Workman, A.D.; Blasetti, M.; Sen, B.; et al. Species-Level Bacterial Community Profiling of the Healthy Sinonasal Microbiome Using Pacific Biosciences Sequencing of Full-Length 16S RRNA Genes. Microbiome 2018, 6, 190. [Google Scholar] [CrossRef]
- Martin, T.C.; Visconti, A.; Spector, T.D.; Falchi, M. Conducting Metagenomic Studies in Microbiology and Clinical Research. Appl. Microbiol. Biotechnol. 2018, 102, 8629–8646. [Google Scholar] [CrossRef]
- Nygaard, A.B.; Tunsjø, H.S.; Meisal, R.; Charnock, C. A Preliminary Study on the Potential of Nanopore MinION and Illumina MiSeq 16S RRNA Gene Sequencing to Characterize Building-Dust Microbiomes. Sci. Rep. 2020, 10, 3209. [Google Scholar] [CrossRef]
- Bharti, R.; Grimm, D.G. Current Challenges and Best-Practice Protocols for Microbiome Analysis. Brief. Bioinform. 2021, 22, 178–193. [Google Scholar] [CrossRef] [PubMed]
- Ficetola, G.F.; Taberlet, P.; Coissac, E. How to Limit False Positives in Environmental DNA and Metabarcoding? Mol. Ecol. Resour. 2016, 16, 604–607. [Google Scholar] [CrossRef] [PubMed]
- Garrido-Sanz, L.; Àngel Senar, M.; Piñol, J. Drastic Reduction of False Positive Species in Samples of Insects by Intersecting the Default Output of Two Popular Metagenomic Classifiers. PLoS ONE 2022, 17, e0275790. [Google Scholar] [CrossRef] [PubMed]
- Bolyen, E.; Rideout, J.R.; Dillon, M.R.; Bokulich, N.A.; Abnet, C.C.; Al-Ghalith, G.A.; Alexander, H.; Alm, E.J.; Arumugam, M.; Asnicar, F.; et al. Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using QIIME 2. Nat. Biotechnol. 2019, 37, 852–857. [Google Scholar] [CrossRef]
- Wensel, C.R.; Pluznick, J.L.; Salzberg, S.L.; Sears, C.L. Next-Generation Sequencing: Insights to Advance Clinical Investigations of the Microbiome. J. Clin. Investig. 2022, 132, e154944. [Google Scholar] [CrossRef]
- Caporaso, J.G.; Lauber, C.L.; Walters, W.A.; Berg-Lyons, D.; Lozupone, C.A.; Turnbaugh, P.J.; Fierer, N.; Knight, R. Global Patterns of 16S RRNA Diversity at a Depth of Millions of Sequences per Sample. Proc. Natl. Acad. Sci. USA 2011, 108, 4516–4522. [Google Scholar] [CrossRef]
- Liu, X.; Fan, H.; Ding, X.; Hong, Z.; Nei, Y.; Liu, Z.; Li, G.; Guo, H. Analysis of the Gut Microbiota by High-Throughput Sequencing of the V5–V6 Regions of the 16S RRNA Gene in Donkey. Curr. Microbiol. 2014, 68, 657–662. [Google Scholar] [CrossRef]
- Sinclair, L.; Osman, O.A.; Bertilsson, S.; Eiler, A. Microbial Community Composition and Diversity via 16S RRNA Gene Amplicons: Evaluating the Illumina Platform. PLoS ONE 2015, 10, e0116955. [Google Scholar] [CrossRef]
- Hamad, I.; Abou Abdallah, R.; Ravaux, I.; Mokhtari, S.; Tissot-Dupont, H.; Michelle, C.; Stein, A.; Lagier, J.-C.; Raoult, D.; Bittar, F. Metabarcoding Analysis of Eukaryotic Microbiota in the Gut of HIV-Infected Patients. PLoS ONE 2018, 13, e0191913. [Google Scholar] [CrossRef]
- Tsang, C.-C.; Teng, J.L.L.; Lau, S.K.P.; Woo, P.C.Y. Rapid Genomic Diagnosis of Fungal Infections in the Age of Next-Generation Sequencing. J. Fungi 2021, 7, 636. [Google Scholar] [CrossRef]
- Fouhy, F.; Clooney, A.G.; Stanton, C.; Claesson, M.J.; Cotter, P.D. 16S RRNA Gene Sequencing of Mock Microbial Populations- Impact of DNA Extraction Method, Primer Choice and Sequencing Platform. BMC Microbiol. 2016, 16, 123. [Google Scholar] [CrossRef]
- Palkova, L.; Tomova, A.; Repiska, G.; Babinska, K.; Bokor, B.; Mikula, I.; Minarik, G.; Ostatnikova, D.; Soltys, K. Evaluation of 16S RRNA Primer Sets for Characterisation of Microbiota in Paediatric Patients with Autism Spectrum Disorder. Sci. Rep. 2021, 11, 6781. [Google Scholar] [CrossRef] [PubMed]
- Tremblay, J.; Singh, K.; Fern, A.; Kirton, E.S.; He, S.; Woyke, T.; Lee, J.; Chen, F.; Dangl, J.L.; Tringe, S.G. Primer and Platform Effects on 16S RRNA Tag Sequencing. Front. Microbiol. 2015, 6, 771. [Google Scholar] [CrossRef] [PubMed]
- Johnson, J.S.; Spakowicz, D.J.; Hong, B.-Y.; Petersen, L.M.; Demkowicz, P.; Chen, L.; Leopold, S.R.; Hanson, B.M.; Agresta, H.O.; Gerstein, M.; et al. Evaluation of 16S RRNA Gene Sequencing for Species and Strain-Level Microbiome Analysis. Nat. Commun. 2019, 10, 5029. [Google Scholar] [CrossRef] [PubMed]
- Xiao, T.; Zhou, W. The Third Generation Sequencing: The Advanced Approach to Genetic Diseases. Transl. Pediatr. 2020, 9, 163–173. [Google Scholar] [CrossRef] [PubMed]
- Athanasopoulou, K.; Boti, M.A.; Adamopoulos, P.G.; Skourou, P.C.; Scorilas, A. Third-Generation Sequencing: The Spearhead towards the Radical Transformation of Modern Genomics. Life 2021, 12, 30. [Google Scholar] [CrossRef]
- Hoang, M.T.V.; Irinyi, L.; Hu, Y.; Schwessinger, B.; Meyer, W. Long-Reads-Based Metagenomics in Clinical Diagnosis With a Special Focus on Fungal Infections. Front. Microbiol. 2022, 12, 708550. [Google Scholar] [CrossRef]
- Lloyd-Price, J.; Mahurkar, A.; Rahnavard, G.; Crabtree, J.; Orvis, J.; Hall, A.B.; Brady, A.; Creasy, H.H.; McCracken, C.; Giglio, M.G.; et al. Strains, Functions and Dynamics in the Expanded Human Microbiome Project. Nature 2017, 550, 61–66. [Google Scholar] [CrossRef]
- Algieri, F.; Tanaskovic, N.; Rincon, C.C.; Notario, E.; Braga, D.; Pesole, G.; Rusconi, R.; Penna, G.; Rescigno, M. Lactobacillus Paracasei CNCM I-5220-Derived Postbiotic Protects from the Leaky-Gut. Front. Microbiol. 2023, 14, 1157164. [Google Scholar] [CrossRef] [PubMed]
- Marzano, M.; Fosso, B.; Colliva, C.; Notario, E.; Passeri, D.; Intranuovo, M.; Gioiello, A.; Adorini, L.; Pesole, G.; Pellicciari, R.; et al. Farnesoid X Receptor Activation by the Novel Agonist TC-100 (3α, 7α, 11β-Trihydroxy-6α-Ethyl-5β-Cholan-24-Oic Acid) Preserves the Intestinal Barrier Integrity and Promotes Intestinal Microbial Reshaping in a Mouse Model of Obstructed Bile Acid Flow. Biomed. Pharmacother. 2022, 153, 113380. [Google Scholar] [CrossRef]
- The Human Microbiome Project Consortium Structure. Function and Diversity of the Healthy Human Microbiome. Nature 2012, 486, 207–214. [Google Scholar] [CrossRef] [PubMed]
- Piancone, E.; Fosso, B.; Marzano, M.; De Robertis, M.; Notario, E.; Oranger, A.; Manzari, C.; Bruno, S.; Visci, G.; Defazio, G.; et al. Natural and after Colon Washing Fecal Samples: The Two Sides of the Coin for Investigating the Human Gut Microbiome. Sci. Rep. 2022, 12, 17909. [Google Scholar] [CrossRef] [PubMed]
- Manzari, C.; Fosso, B.; Marzano, M.; Annese, A.; Caprioli, R.; D’Erchia, A.M.; Gissi, C.; Intranuovo, M.; Picardi, E.; Santamaria, M.; et al. The Influence of Invasive Jellyfish Blooms on the Aquatic Microbiome in a Coastal Lagoon (Varano, SE Italy) Detected by an Illumina-Based Deep Sequencing Strategy. Biol. Invasions 2015, 17, 923–940. [Google Scholar] [CrossRef]
- Callahan, B.J.; Wong, J.; Heiner, C.; Oh, S.; Theriot, C.M.; Gulati, A.S.; McGill, S.K.; Dougherty, M.K. High-Throughput Amplicon Sequencing of the Full-Length 16S RRNA Gene with Single-Nucleotide Resolution. Nucleic Acids Res. 2019, 47, e103. [Google Scholar] [CrossRef]
- Martin, M. CUTADAPT Removes Adapter Sequences from High-Throughput Sequencing Reads. EMBnet.J. 2011, 17, 10–12. [Google Scholar] [CrossRef]
- Callahan, B.J.; McMurdie, P.J.; Rosen, M.J.; Han, A.W.; Johnson, A.J.A.; Holmes, S.P. DADA2: High-Resolution Sample Inference from Illumina Amplicon Data. Nat. Methods 2016, 13, 581–583. [Google Scholar] [CrossRef]
- Pruesse, E.; Quast, C.; Knittel, K.; Fuchs, B.M.; Ludwig, W.; Peplies, J.; Glockner, F.O. SILVA: A Comprehensive Online Resource for Quality Checked and Aligned Ribosomal RNA Sequence Data Compatible with ARB. Nucleic Acids Res. 2007, 35, 7188–7196. [Google Scholar] [CrossRef] [PubMed]
- Wang, Q.; Garrity, G.M.; Tiedje, J.M.; Cole, J.R. Naïve Bayesian Classifier for Rapid Assignment of RRNA Sequences into the New Bacterial Taxonomy. Appl. Environ. Microbiol. 2007, 73, 5261–5267. [Google Scholar] [CrossRef]
- Edgar, R.C.; Flyvbjerg, H. Error Filtering, Pair Assembly and Error Correction for next-Generation Sequencing Reads. Bioinformatics 2015, 31, 3476–3482. [Google Scholar] [CrossRef]
- Li, H. Minimap2: Pairwise Alignment for Nucleotide Sequences. Bioinformatics 2018, 34, 3094–3100. [Google Scholar] [CrossRef]
- Danecek, P.; Bonfield, J.K.; Liddle, J.; Marshall, J.; Ohan, V.; Pollard, M.O.; Whitwham, A.; Keane, T.; McCarthy, S.A.; Davies, R.M.; et al. Twelve Years of SAMtools and BCFtools. GigaScience 2021, 10, giab008. [Google Scholar] [CrossRef]
- Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic Local Alignment Search Tool. J. Mol. Biol. 1990, 215, 403–410. [Google Scholar] [CrossRef] [PubMed]
- Altschul, S. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [PubMed]
- McGinnis, S.; Madden, T.L. BLAST: At the Core of a Powerful and Diverse Set of Sequence Analysis Tools. Nucleic Acids Res. 2004, 32, W20–W25. [Google Scholar] [CrossRef] [PubMed]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2012, 12, 2825–2830. [Google Scholar]
- Davis, N.M.; Proctor, D.M.; Holmes, S.P.; Relman, D.A.; Callahan, B.J. Simple Statistical Identification and Removal of Contaminant Sequences in Marker-Gene and Metagenomics Data. Microbiome 2018, 6, 226. [Google Scholar] [CrossRef]
- McMurdie, P.J.; Holmes, S. Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE 2013, 8, e61217. [Google Scholar] [CrossRef]
- Oksanen, J.; Guillaume, F.B.; Roeland, K.; Legendre, P.; Peter, M.; O’Hara, R.B.; Gavin, S.; Peter, S.; Stevenes, M.H.H.; Helene, W. Vegan: Community Ecology Package; R Package Version 2.5-6; The R Project for Statistical Computing: Ames, IA, USA, 2015. [Google Scholar]
- Gloor, G.B.; Macklaim, J.M.; Pawlowsky-Glahn, V.; Egozcue, J.J. Microbiome Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 2017, 8, 2224. [Google Scholar] [CrossRef]
- Lin, H.; Peddada, S.D. Analysis of Compositions of Microbiomes with Bias Correction. Nat. Commun. 2020, 11, 3514. [Google Scholar] [CrossRef]
- Nejman, D.; Livyatan, I.; Fuks, G.; Gavert, N.; Zwang, Y.; Geller, L.T.; Rotter-Maskowitz, A.; Weiser, R.; Mallel, G.; Gigi, E.; et al. The Human Tumor Microbiome Is Composed of Tumor Type–Specific Intracellular Bacteria. Science 2020, 368, 973–980. [Google Scholar] [CrossRef]
- Ciuffreda, L.; Rodríguez-Pérez, H.; Flores, C. Nanopore Sequencing and Its Application to the Study of Microbial Communities. Comput. Struct. Biotechnol. J. 2021, 19, 1497–1511. [Google Scholar] [CrossRef] [PubMed]
- Portik, D.M.; Brown, C.T.; Pierce-Ward, N.T. Evaluation of Taxonomic Classification and Profiling Methods for Long-Read Shotgun Metagenomic Sequencing Datasets. BMC Bioinform. 2022, 23, 541. [Google Scholar] [CrossRef] [PubMed]
- Tourlousse, D.M.; Narita, K.; Miura, T.; Ohashi, A.; Matsuda, M.; Ohyama, Y.; Shimamura, M.; Furukawa, M.; Kasahara, K.; Kameyama, K.; et al. Characterization and Demonstration of Mock Communities as Control Reagents for Accurate Human Microbiome Community Measurements. Microbiol. Spectr. 2022, 10, e01915-21. [Google Scholar] [CrossRef] [PubMed]
- Liu, P.-Y.; Wu, W.-K.; Chen, C.-C.; Panyod, S.; Sheen, L.-Y.; Wu, M.-S. Evaluation of Compatibility of 16S RRNA V3V4 and V4 Amplicon Libraries for Clinical Microbiome Profiling. BioRxiv 2020. [Google Scholar] [CrossRef]
- Hsieh, Y.-P.; Hung, Y.-M.; Tsai, M.-H.; Lai, L.-C.; Chuang, E.Y. 16S-ITGDB: An Integrated Database for Improving Species Classification of Prokaryotic 16S Ribosomal RNA Sequences. Front. Bioinform. 2022, 2, 905489. [Google Scholar] [CrossRef]
- Quast, C.; Pruesse, E.; Yilmaz, P.; Gerken, J.; Schweer, T.; Yarza, P.; Peplies, J.; Glöckner, F.O. The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools. Nucleic Acids Res. 2012, 41, D590–D596. [Google Scholar] [CrossRef] [PubMed]
- Maidak, B.L.; Cole, J.R.; Parker, C.T.; Garrity, G.M.; Larsen, N.; Li, B.; Lilburn, T.G.; McCaughey, M.J.; Olsen, G.J.; Overbeek, R.; et al. A New Version of the RDP (Ribosomal Database Project). Nucleic Acids Res. 1999, 27, 171–173. [Google Scholar] [CrossRef] [PubMed]
- DeSantis, T.Z.; Hugenholtz, P.; Larsen, N.; Rojas, M.; Brodie, E.L.; Keller, K.; Huber, T.; Dalevi, D.; Hu, P.; Andersen, G.L. Greengenes, a Chimera-Checked 16S RRNA Gene Database and Workbench Compatible with ARB. Appl. Environ. Microbiol. 2006, 72, 5069–5072. [Google Scholar] [CrossRef]
- Hiergeist, A.; Ruelle, J.; Emler, S.; Gessner, A. Reliability of Species Detection in 16S Microbiome Analysis: Comparison of Five Widely Used Pipelines and Recommendations for a More Standardized Approach. PLoS ONE 2023, 18, e0280870. [Google Scholar] [CrossRef]
- Finotello, F.; Trajanoski, Z. Quantifying Tumor-Infiltrating Immune Cells from Transcriptomics Data. Cancer Immunol. Immunother. 2018, 67, 1031–1040. [Google Scholar] [CrossRef]
- Hou, K.; Wu, Z.-X.; Chen, X.-Y.; Wang, J.-Q.; Zhang, D.; Xiao, C.; Zhu, D.; Koya, J.B.; Wei, L.; Li, J.; et al. Microbiota in Health and Diseases. Sig. Transduct. Target Ther. 2022, 7, 135. [Google Scholar] [CrossRef] [PubMed]