Designing Filtering Strategies for Faster Protein and RNA Annotation
Thursday, April 17, 2008
9:45 a.m.-10:45 a.m.
Host: Rong Jin
With the availability of sequenced genomes for
multiple species, an urgent
task today is to decipher the biological functions of these sequences.
Annotating genomic sequence function helps us understand the genetic background
of complex diseases and thus aids drug design. The state-of-the-art method for
function annotation is to compare a query sequence against database of
sequences with known functions. However, the
high computational cost of comparison algorithms and the sheer amount of
genomic data pose a great challenge for genome function analysis. For example,
it takes several CPU months to compare a bacterial genome with a database of
noncoding RNA sequence families.
In this talk, I will present systematic filter design methods for accelerating protein and noncoding RNA function annotation. A filter excludes a large portion of the database that is unlikely to be related to the query and hence comparisons are only conducted on regions with functional similarity. The computational challenge lies in designing filters with optimal tradeoff between sensitivity and specificity from a large design space. I will first present our filters based on regular expression patterns and weight matrices for protein annotation. Then, I will focus on designing secondary structure profiles to accelerate noncoding RNA annotation. Our experiments demonstrate that, by using our designed filters, a protein sequence annotation program based on profile hidden Markov model can obtain 20 to 35 times speedup and a noncoding RNA annotation program based on stochastic context-free grammar can obtain over 100 times speedup on average. I will conclude with an overview of my research interests and plan of future works.