Pre-print Review of NanoAmpli-Seq

I was excited to see the preprint from Calus and colleagues describing NanoAmpli-Seq ¹. This is a method of sequencing long amplicons using the Oxford Nanopore sequencing platform. For my set of applications within microbial ecology, this exciting sequencing platform still appears to be a method in search of an application. This preprint lays out an improved method of sequencing full-length 16S rRNA genes. This is an important issue because (as they note) the number of full-length sequences going into our reference databases is slowing and is unlikely to be representative of the diversity we are now seeing in surveys using MiSeq to sequence fragments of the 16S rRNA gene. Further, we’d really like to have longer reads for improved classifications. Reading the Introduction one will see that my previous work developing methods for sequencing 16S rRNA genes using the MiSeq and PacBio figure prominently in their motivation. It should also be noted that I do not know the current status of this manuscript and have not been invited to review it for a journal.

The authors do an admirable job of tempering expectations and pointing out that the sequence quality is still not to the level that we find on other platforms. The authors mention that they get a sequencing accuracy of 99.5%, in contrast to the 99.98% accuracy we see with the other methods. In some ways this manuscript reads like, “We’ve done our best to solve the error rate problem, here’s where we are, take it from here.” These type of “landmark” papers are important, but I can’t help but think of things to try. Perhaps other approaches were attempted (they mention three INC-Seq aligners), but they don’t seem to be mentioned and there is not an extensive description of any parameter sweep tests.

I think it would be helpful if the authors could improve their legend for Figure 3 - this is the critical figure for describing the method. The authors should note that the A, B, C of the legend seem to correspond with the three shaded boxes, not the A, B, C, … J within those boxes. The method appears to run the output for the Nanopore sequencer through the INC-Seq software and use that as the starting point for their flow with chopSeq. My understanding of the first step in D is to re-orient the reads and trim the reads to start and end with the correct primers. They then remove the tandem repeats. Instead, I wonder why the authors didn’t start over with the INC-Seq software to make a better assembler that is aware of the primers and other issues from sequencing 16S rRNA genes. In our development of the PacBio pipeline, the creation of the consensus sequence made the biggest impact. As PacBio improved their assembler, the data quality far better than anything we could do. If they did this, the authors could calculate better quality scores, assess a aggregate consensus sequence quality score that could be used to filter the consensus sequences.

On P14 they state, “This suggests that consensus sequence accuracy is reliably high only for OTUs where a minimum of 50 reads are available for use in constructing the consensus sequence” and on the next page that they used a three concatemer threshold set for INC-Seq. Given the ability to generate massively long reads on the Nanopore, why not run the sequencer longer to sequence more concatemers? Also, what happens to the error rate when the authors require more concatemers? Again, the PacBio aggregate quality score for a consensus sequence is linked to the level of coverage. I’m wondering if such information could be obtained either from the INC-Seq software or from making their own version of the assembler.

As mentioned above, I found the overall description of the bioinformatics methods to be jargony and a bit glossy on details. First, I was a bit confused by the authors description of why they partitioned the consensus reads into thirds for the nanoClust step. I’m also not clear how this would work - did they cluster the three partitions separately and then bring them back together somehow? Second, they removed singletons, which probably deflates their error rate relative to my reported PacBio error rates. I know that this is contentious, but I think that removing singletons from a ‘real’ sample would be pretty risky and likely to create a bias against rare organisms in poorly sequenced samples. Third, I wonder why the authors didn’t align the sequences prior to getting a consensus sequence using something like a NAST-based profile alignment. They could then cluster similar sequences together using something like oligotyping or our pre-clustering method. This should be considerably faster (and I suspect more robust) than VSEARCH followed by MAFFT.

Another problem that the authors do not mention is the possible biased abundances generated by RCA. They assume that RCA followed by fragmentation and debranching would yield the same number of fragments per piece of circularized DNA. I don’t know that this is true. I wonder whether random barcodes could be added to the PCR primers so that when the fragments are amplified, circularized, fragmented, and sequenced, it would be possible to know which fragments came from the same RCA reaction. That way each RCA reaction could only be used once in downstream analyses.

I wonder whether the authors included chimeric sequences when calculating their error rates. Chimeras are not sequencing errors and should not be included in calculating the error rates. This may help to reduce their error rate a bit.

The authors are to be commended for providing their detailed methods as supplemental materials, this is excellent. One thing we learned from publishing our Kozich methods was that in addition to this, it would be great to provide a link to a GitHub page that has the “live” version of the method with any recent updates they’ve made to the protocol. We have a GitHub page for ours now, but wish we would have included the link to the page in our paper since the one in the supplement is now quite out of date.

Some smaller points…

There are other methods beside PacBio for generating full length 16S rRNA gene sequences using HiSeq. Perhaps it would be worth mentioning these in passing? They cite using EMIRGE to extract 16S rRNA gene sequences from metagenomic libraries, but it has also been used by stitching together short amplicon data (doi: 10.1371/journal.pone.0056018).
It’s hard to keep track of what generation we’re on! Instead of using “Second” and “Third” generation in the abstract and introduction, how about just using the platform names. Also, the generation model implies one generation is better than another when the authors’ data indicates that “better” still depends on the application.
The Abstract is jargony. There are a lot of terms used that are not defined when someone reads only the abstract. What is INC-Seq? The acronym is spelled out, but can the authors give a brief description of what it is? What is “chopSeq”? What is “nanoClust”?
On P11. “Inspections of the read to reference alignment length ratio indicated that the major source of sequence error for both INC-Seq and chopSeq corrected reads originated from deletions; i.e. percent similarity of the read to the reference decreased in proportion to the read to reference alignment ratio for all experiments and INC-Seq aligners us”. I don’t see how the “i.e.” explains the first sentence

I have posted a copy of this review at bioRxiv. Please post any comments there. ↩