CIDER-Seq stands for Circular DNA Enrichment Sequencing.
This is kind of obvious but it bears repeating. Viral nucleic acids in host organisms can be very diverse and this makes it difficult to sequence them accurately.
Currently there are two major methods to sequence viruses, a) short-read next generation sequencing (NGS) and b) cloning and Sanger sequencing. They both have drawbacks: short read NGS cannot perform de novo sequencing of full length viral genomes (see Q1); Sanger sequencing is terribly expensive and requires a lot of manual work. In this study we've come up with a method that we think balances (at least to some extent!) the depth/economy of short-read NGS with the accuracy of Sanger sequencing.
We had some design criteria in mind when coming up with CIDER-Seq. We don't want to limit users to having prior knowledge of a target virus, i.e. conserved sequences for PCR or restriction digests, and we don't want to rely on a pre-existing (usually Sanger sequenced) reference genome for assembly. In the end, our method does require some prior knowledge, primarily about the virus genome structure.
Our lab works on cassava, a very important tropical food crop. A majority of cassava production today occurs in Africa (though the crop is originally from South America). The -problem in Africa though is that cassava plants are vulnerable to periodic virus diseases, primarily the cassava mosaic disease. This disease is caused by the cassava geminiviruses, single-stranded DNA viruses that are actually composed of two genomic DNAs in a single infectious particle. For this study, we sampled symptomatic cassava leaves from 6 individuals in a field trial in Kenya (red circle on map) and attempted to sequence whole genomes of the infecting viruses.
Our new method has two main steps: enrichment, and sequencing. The pipeline described above allows us to selectively enrich viral DNA from total DNA extracts of the infected host. It does this via an automated size selection step followed by a random circular amplification and de-branching steps. Random circular amplification is a very popular technique that used the phage Phi29 DNA Polymerase to amplify circular molecules primed with random oligonucleotides. This results in long, branched amplicons which often comprise multiple concatenated copies of the original templates. After enrichment, we use long-read SMRT sequencing (Pacific Biosciences Inc.). We sequenced two libraries, one in which step 8 is skipped (called NSS for non-size selected) and one in which step 8 is performed (called SS for size-selected).
The raw sequence data we obtained from the SMRT run needed to be processed, mainly because we sequenced hyperbranched products arising from the random amplification step. Since no one had ever deep-sequenced these products directly before, we had to develop a custom analysis pipeline and de-concatenation algorithm (unimaginatively christened DeConcat) to get sensible looking genome sequences.
Looking at the SMRT sequencing reads that met our quality and size thresholds, we found that we'd attained nearly complete enrichment of viral DNA in both libraries! Looking at the size distributions of sequencing reads before and after the DeConcat algorithm we found that DeConcat works really effectively at reducing concatenated sequences into their individual, real, components.
Here's an overview of how DeConcat works. Basically we first divide our raw sequence into two parts, first at the 30th nucleotide from the start. Next we align these two fragments, either regularly or after reverse complementing one of the fragments. We record the alignment score and go back to step 1, repeating the cutting and alignment process at increments of 30 nucleotides. Once all possible iterations are completed we select the highest scoring alignment and proceed to the next step. Here, we observed that alignments can take the form of one of 8 possible cases. Each is resolved in the manner shown. The case-resolution is done so as to maintain the original sequence topology as much as possible. Once the resolution is completed, we pass the now shorter sequence back through the entire process starting with step 1.
From our dataset we obtained a total of 270 full length cassava geminiviral genomes. We phased these genomes so that they all start at the same position and then performed some taxonomic and phylogenetic analysis to identify the likely identities of the virus species and strains in our dataset.
Our marketing slide! We obtained full length genomes, each from sequencing single molecules. This means we've completely eliminated all assembly steps. A cost comparison with the previous Sanger sequencing approach used by our lab and others. Finally, in order to annotate our genomes, we did have to use the reference sequence, but this is not required for simply obtaining the genome sequences. And annotation can also be performed using distant 'reference' viruses.

CIDER-Seq: unbiased virus enrichment and single-read, full length genome sequencing

By Devang Mehta, Matthias Hirsch-Hoffmann, Andrea Patrignani, Wilhelm Gruissem, Hervé Vanderschuren

Link to Manscript


Questions: all,

Our method is optimised for circular DNA viruses, of which there are a lot (see below). In terms of impact, the most important potential target is the Papillomaviruses. However, we can also imagine sequencing plasmids with this method, perhaps directly from environmental samples. We also have some thoughts on adapting this method to other virus types including RNA viruses. 

Like with any NGS experiment, our method is not completely error-free. We apply a very high initial quality cut-off on our raw PacBio reads (99.9) which, for the PacBio literate means our reads have gone through Circular Consensus Sequencing with a minimum of 13 passes! 
However, we did find errors when we tried to annotate our sequences. These were mainly frameshift errors which we confirmed by analysing reads with lower quality thresholds. However, we only detect 1 frameshift error per sequence in our SS library, in line with the predicted accuracy of 99.9. 

More details in the preprint:

Short read metagenomic data is pretty great for profiling diversity, but you're limited to 250 base-pair long contigs, and even shorter raw reads. These then have to be assembled into full length genomes, often several thousand base-pairs long. Since, at least for the viruses we work on, individual strains differ by only around 100 base-pairs (and these of course are constantly evolving), it can be next to impossible to assemble individual strains correctly, starting with short sequencing reads. This problem has been recently recognised by several virologists in a consensus statement:

Simmonds, P. et al. Consensus statement: Virus taxonomy in the age of metagenomics. Nat. Rev. Microbiol. 15, 161–168 (2017). Download full-text here

Cassava is a fascinating crop, I recommend the following resources. 


A couple of recent reviews from our labs on cassava diseases and how to tackle them: 

1. Annual Review of Virology ($)

2. Current Opinion in Plant Biology ($)

Check out the attached pre-print, we have a pretty comprehensive methods section. We'll also be uploading protcols to soon. 

Yes, as long as you have access to some computing resources.! Take a look at our Github page which has complete installation and usage instructions for our customisable software.