PhD Proposal: Genome scaffolding using emerging sequencing technologies and graph based methods

Jay Ghurye
04.21.2017 10:00 to 11:30
CBCB 3118

Genome assembly is a critical step in most biological sequence analysis as it gives a nearly complete picture of the genome sequence. All the current sequencing technologies share the fundamental limitation that sequences read from the genome using a sequencer are much smaller than an entire genome.

Next generation sequencing has dramatically reduced the cost of sequencing compared to the traditional whole genome sequencing approach, thereby enabling the generation of billions of reads at a very low cost per sequenced base. Despite recent progress made in de novo assembly algorithms, the quality of short read assemblies is far from what is necessary for effective further analysis due to the fundamental limit - the read length is shorter than repeat lengths for the majority of repeat families.two orders of magnitude longer than next generation methods.

Recent advances in single molecule sequencing technologies have provided reads almost 100 times longer than next generation methods. These long sequences have tremendous potential to dramatically improve genome and transcriptome assembly. Although long read technologies have made the resolution of highly repetitive regions possible, contigs generated from long read assemblies do not always span a complete chromosome or even an arm of the chromosome. To address this issue, we have developed a new method called SALSA that exploits the information of genomic proximity in Hi-C data sets for long range scaffolding of de novo genome assemblies.

The scaffolding problem is complicated in the context of microbial genome assembly where the goal is to assemble multiple genomes simultaneously. This problem is complicated due to several factors such as the presence of inter-genomic repeats, coverage ambiguities and polymorphic regions in the genomes. Most of the existing mate pair based scaffolding methods are designed to work with single genome assemblies, hence do not address the issues discussed above. We have developed Bambus 3, a scaffolding pipeline that accurately detects repeats and variations in the metagenomic sequences.

As genome assembly quality is instrumental for most aspects of bioinformatics, we envision these developments will help novel biological discovery.

Examining Committee:

Chair: Dr. Mihai Pop

Dept rep: Dr. Aravind Srinivasan

Members: Dr. Michael Cummings