CIDER-Seq stands for Circular DNA Enrichment Sequencing.
This is kind of obvious but it bears repeating. Viral nucleic acids in host organisms can be very diverse and this makes it difficult to sequence them accurately.
Currently there are two major methods to sequence viruses, a) short-read next generation sequencing (NGS) and b) cloning and Sanger sequencing. They both have drawbacks: short read NGS cannot perform de novo sequencing of full length viral genomes (see Q1); Sanger sequencing is terribly expensive and requires a lot of manual work.
In this study we've come up with a method that we think balances (at least to some extent!) the depth/economy of short-read NGS with the accuracy of Sanger sequencing.
We had some design criteria in mind when coming up with CIDER-Seq. We don't want to limit users to having prior knowledge of a target virus, i.e. conserved sequences for PCR or restriction digests, and we don't want to rely on a pre-existing (usually Sanger sequenced) reference genome for assembly.
In the end, our method does require some prior knowledge, primarily about the virus genome structure.
Our lab works on cassava, a very important tropical food crop. A majority of cassava production today occurs in Africa (though the crop is originally from South America). The -problem in Africa though is that cassava plants are vulnerable to periodic virus diseases, primarily the cassava mosaic disease. This disease is caused by the cassava geminiviruses, single-stranded DNA viruses that are actually composed of two genomic DNAs in a single infectious particle.
For this study, we sampled symptomatic cassava leaves from 6 individuals in a field trial in Kenya (red circle on map) and attempted to sequence whole genomes of the infecting viruses.
Our new method has two main steps: enrichment, and sequencing. The pipeline described above allows us to selectively enrich viral DNA from total DNA extracts of the infected host. It does this via an automated size selection step followed by a random circular amplification and de-branching steps. Random circular amplification is a very popular technique that used the phage Phi29 DNA Polymerase to amplify circular molecules primed with random oligonucleotides. This results in long, branched amplicons which often comprise multiple concatenated copies of the original templates. After enrichment, we use long-read SMRT sequencing (Pacific Biosciences Inc.). We sequenced two libraries, one in which step 8 is skipped (called NSS for non-size selected) and one in which step 8 is performed (called SS for size-selected).
The raw sequence data we obtained from the SMRT run needed to be processed, mainly because we sequenced hyperbranched products arising from the random amplification step. Since no one had ever deep-sequenced these products directly before, we had to develop a custom analysis pipeline and de-concatenation algorithm (unimaginatively christened DeConcat) to get sensible looking genome sequences.
Looking at the SMRT sequencing reads that met our quality and size thresholds, we found that we'd attained nearly complete enrichment of viral DNA in both libraries!
Looking at the size distributions of sequencing reads before and after the DeConcat algorithm we found that DeConcat works really effectively at reducing concatenated sequences into their individual, real, components.
Here's an overview of how DeConcat works. Basically we first divide our raw sequence into two parts, first at the 30th nucleotide from the start. Next we align these two fragments, either regularly or after reverse complementing one of the fragments. We record the alignment score and go back to step 1, repeating the cutting and alignment process at increments of 30 nucleotides. Once all possible iterations are completed we select the highest scoring alignment and proceed to the next step. Here, we observed that alignments can take the form of one of 8 possible cases. Each is resolved in the manner shown. The case-resolution is done so as to maintain the original sequence topology as much as possible. Once the resolution is completed, we pass the now shorter sequence back through the entire process starting with step 1.
From our dataset we obtained a total of 270 full length cassava geminiviral genomes. We phased these genomes so that they all start at the same position and then performed some taxonomic and phylogenetic analysis to identify the likely identities of the virus species and strains in our dataset.
Our marketing slide!
We obtained full length genomes, each from sequencing single molecules. This means we've completely eliminated all assembly steps.
A cost comparison with the previous Sanger sequencing approach used by our lab and others.
Finally, in order to annotate our genomes, we did have to use the reference sequence, but this is not required for simply obtaining the genome sequences. And annotation can also be performed using distant 'reference' viruses.