Most statistical gene prediction programs require a set of
parameters, estimated based on a training set of DNA sequences with genes
clearly marked. What are the two major experimental methods used to reliably
find a gene?
A single nucleotide substitution at which position in a codon would most likely have the
greatest impact on the function of the encoded protein: the first, the
second, or the third? Why?
Which of the following of point mutations would most likely have
the greatest impact on the function of the encoded protein: a single
nucleotide substitution mutation (i.e. A mutates to G) or a single
nucleotide deletion (i.e. A is deleted from the sequence)? Why?
Find the intron(s) in the "world's shortest
intron-containing gene". In addition, spell out the amino acid sequence
it encodes.
ATGCCGTCTAGGTAA
Although
the genetic code is universal, organisms usually have their own
preference for codon usage. For example, the web site http://www.molbiol.ox.ac.uk/~cocallag/refdata_html/codonusagetable.shtml
gives statistics on the codon usage of Escherichia coli. Your colleague has an EST fragment from E.
coli with the following sequence: AAGUCAUUAUUUUCG.
Assuming this is the coding strand, can you help her to identify the most likely
translation frame?
The
identification of exon-intron junctions is a major challenge to gene
prediction algorithms. Conventionally, position weight matrices (PWMs)
or profiles have been used for this task. One bioinformatics graduate
student, knowing that the branch point 20-50 upstream of the 3' splice
site is a more biologically important signal than the 3' slice site,
wants to build a PWM for the 60 nucleotides upstream of the 3' splice
site. The student wishes to capture the signal around the branch point, but
finds nothing. Can you explain why?
There
are two general strategies for performing gene prediction: similarity
based approaches and statistics-based approaches. Explain which genes
are likely to be missed by the statistics-based approach and by the
similarity-based approach.
Sequence
homology or similarity information is used in both the similarity
based and the comparative genomics approaches for gene prediction.
What is the difference between these two approaches?
What is
a pseudogene? Why gene prediction algorithms have difficulty to
discern pseudogenes from true genes?
Genscan
Genscan is one of the best gene finding algorithms. The UCSC Genome Browser
(http://genome.ucsc.edu) is a convenient graphic visualization tool for genome
annotations. In this practical exercise you are given 3 sequences from human genome (genomic_dna1.fa, genomic_dna2.fa, genomic_dna3.fa).
Your tasks will be to use the gene prediction program Genscan (http://genes.mit.edu/GENSCAN.html,
or http://bioweb.pasteur.fr/seqanal/interfaces/genscan.html)
to find potential genes in these genomic sequences, and validate
your findings against the annotations using the UCSC browser.
Run Genscan. Go to the Genscan web
site, submit your sequence, be sure to check "Print predicted coding
sequences (-cds)" so that you can get the predicted cds, and wait for the
result.
Display the genomic regions on UCSC genome browser.
You can display the predicted
cds on the UCSC genome browser. In order to display the exon structure in the
predicted cds, you will need to use the BLAT program by Jim Kent, which can
quickly look for a sequence in the
human genome and return the genomic regions
with high similarity to your query sequence.
Go to the UCSC genome browser
home page, select BLAT and human genome, paste your predicted cds and
submit. You will be brought to the actual display window of the genome
browser where your sequence from BLAT search is displayed together with
several other annotation tracks. You can play around the browser as it
integrates a lot of information. Click on your sequence from BLAT search track
to see the actual base-by-base display. Make sure you can recognize the signals (Start codon, Stop codon,
splicing signals, etc.)
Compare your predictions against the annotation
for known
genes and predictions by other gene-finding algorithms.
Compare your predictions against the known genes track.
If the known genes track is not shown, you can display it by using the drop down controls
under the browser window. Analyze prediction result for each sequence:
What is the performance at the exon level? i.e., how many exons
are predicted? How many are missed? How accurate are the predicted exon
boundaries?
Genscan gives a probability score for each predicted exon.
Are the
probability values for exons predicted by Genscan informative?
What is the accuracy at the nucleotide level? i.e., how many nucleotides in the
coding regions are correctly predicted? You can go to the browser home page, select Table
Browser and select the
knownGene table to extract the exact exon locations for the
known genes.
Both genomic
sequences 1 and 2 share the same gene. What is this gene? Did Genscan
successfully predict this gene in both cases? How are the results compared to
the Genscan gene track in the browser? Can you explain why Genscan
gives different answers on this gene?
Why is your Genscan result
for the third genomic sequence so different from the annotated known
genes? (hint: look at other annotation tracks.)
Algorithmic Questions
Smith and Waterman modified the global alignment algorithm by
allowing "free rides" from start to anywhere in the middle of
the alignment grid, yielding the local
alignment algorithm. When aligning two genomic sequences containing
orthologous genes from two organisms,
you may want to allow free rides
from one node to any node at its downstream (rightward or downward) in
the alignment grid
to jump over introns. Can you design an efficient dynamic programming
algorithm that allows any number of free rides? What about allowing
at most k free rides?
Finding a Family of Genes.
Suppose the genome of an
organism is just sequenced. Usually, a general purpose gene prediction
algorithms would be used for de novo
annotation of this new genome sequence. However, this time your task is to
design a computational strategy to find all genes in a family in this
genome. You could take advantage of all known gene sequences in the family
in other organisms. Read articles Manning et al, 2002 and Claudel-Renard et
al, 2003 and discuss how you would design your algorithm.
Suggested Reading
Burge, C. and Karlin, S. (1997) Prediction of complete
gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94.
Claudel-Renard
C, Chevalet C, Faraut T, Kahn D. (2003) Enyzme-specific profiles for genome
annotation: PRIAM. Nucleic Acids Res,
31(22):6633-9
E. Rivas and S.R. Eddy. (2001) Nocoding RNA gene
detection using comparative sequence analysis. BMC Bioinformatics,
2:8.