Please use this identifier to cite or link to this item: http://hdl.handle.net/10125/33378

Multiple-Genome Annotation of Genome Fragments Using Hidden Markov Model Profiles

File SizeFormat 
ThesisMarkFinal.pdf2.63 MBAdobe PDFView/Open

Item Summary

Title: Multiple-Genome Annotation of Genome Fragments Using Hidden Markov Model Profiles
Authors: Menor, M.
Baek, K.
Poisson, G.
Issue Date: 01 Jan 2008
Series/Report no.: ICS2008-01-01
Abstract: To learn more about microbes and overcome the limitations of standard cultured methods, microbial communities are being studied in an uncultured state. In such metagenomic studies, genetic material is sampled from the environment and sequenced using the whole-genome shotgun sequencing technique. This results in thousands of DNA fragments that need to be identified, so that the composition and inner workings of the microbial community can begin to be understood. Those fragments are then assembled into longer portions of sequences. However the high diversity present in an environment and the often low level of genome coverage achieved by the sequencing technology result in a low number of assembled fragments (contigs) and many unassembled fragments (singletons). The identification of contigs and singletons is usually done using BLAST, which finds sequences similar to the contigs and singletons in a database. An expert may then manually read these results and determine if the function and taxonomic origins of each fragment can be determined. In this report, an automated system called Anacle is developed to annotate, following a taxonomy, the unassembled fragments before the assembly process. Knowledge of what proteins can be found in each taxon is built into Anacle by clustering all known proteins of that taxon. The annotation performances from using Markov clustering (MCL) and Self- Organizing Maps (SOM) are investigated and compared. The resulting protein clusters can each be represented by a Hidden Markov Model (HMM) profile. Thus a “skeleton” of the taxon is generated with the profile HMMs providing a summary of the taxon’s genetic content. The experiments show that (1) MCL is superior to SOMs in annotation and in running time performance, (2) Anacle achieves good performance in taxonomic annotation, and (3) Anacle has the ability to generalize since it can correctly annotate fragments from genomes not present in the training dataset. These results indicate that Anacle can be very useful to metagenomics projects.
URI/DOI: http://hdl.handle.net/10125/33378
Rights: CC0 1.0 Universal
Appears in Collections:Technical Reports



Items in ScholarSpace are protected by copyright, with all rights reserved, unless otherwise indicated.