Dot matrix
plots provide a quick way to visualize the similarities between two sequences.
The following plots were made with a java applet available at http://arbl.cvmbs.colostate.edu/molkit/dnadot/.
In addition, a quick tutorial on dot plots is available on this website. Two
common parameters that are adjusted to increase the readability of the plots
are the window length and the number of allowable mismatches per window.
Comment on
the relative values of these parameters in the plots above. Which do you think
has a larger window (same number of mismatches per window)?
Would you
expect these sequences to have a strong (high scoring) global alignment? Would
a global alignment capture all significant similarities between these two
sequences? Can you draw the path that the global alignment would likely travel
through?
Why are
dot matrix plots created using a window instead of simply plotting a black
pixel for a match, grey pixel for a mismatch?
Why is
using percent identity alone not the best approach for assessing the quality of
a local alignment?
The two
alignments above were created using the same gap opening penalty, but different
gap extension penalties. What can you say about the relative values of the gap
extension penalties in the alignments above?
Describe
the alignment you would expect if the gap opening penalty was infinity. What
kind of alignment would you expect if the gap start penalty and gap extension
penalty were both zero?
You are shown the two alignments below. One is an alignment
of two DNA sequences with an identity of 36%. The other alignment is of two
amino acid sequences with an identity of 28%. Which of the two alignments
represents greater biological similarity between sequences?
Fill out
the following dynamic programming tables using the following parameters: (match
+1, mismatch -1, insertion/deletion -1). Write out the optimal alignment at
the end and compute its score. Show the path corresponding to the optimal
alignment in each case.
Global Alignment:
-
A
A
C
G
T
T
A
C
-
C
G
A
T
A
A
C
Local Alignment:
-
A
A
C
G
T
T
A
C
-
C
G
A
T
A
A
C
Discovering Similarities
between Oncogenes
Russell Doolittle (http://juno.ucsd.edu/)
pioneered the application of sequence analysis algorithms in the late '70s and
early '80s. Doolittle used an early database of biological sequences to run
queries to identify genes with similar functions. In the exercise below we
will follow the steps Doolittle took to discover functional roles for the v-mos
oncogene of the Moloney Murine Sarcoma
Virus. Not long after the v-mos gene was sequenced at the Salk Institute, a group studying the v-src oncogene of the Rous Sarcoma Virus
published their findings along with the sequence in 1980. An early attempt was
made by the group to find similarities between the sequences of the two genes
but none were found.
The biology workbench (http://workbench.sdsc.edu/) is a
suite of web-based programs for sequence-based analysis. You will be
required to create a new account. After creating an account log on. (NOTE:
Unless otherwise specified, use the default parameters for each algorithm)
Using the Nucleotide Tools
menu, upload the two sequences found in the files vmos.fasta
and src.fasta
on the class website. (vmos.fasta contains the v-mos oncogene from the Moloney Murine Sarcoma
Virus, src.fasta contains the v-src gene from the Rous Sarcoma Virus).
Using the ALIGN tool for
global alignment and then the LALIGN tool for local alignment, align the
sequences and comment on the results. Write down the percent identity and
the alignment score for each algorithm.
Would you consider the
sequences homologous based on your alignments from above (HINT: What do
you think the percent identity would be for random sequences?)?
Next we will translate the
nucleotide sequences into amino acid sequences one at a time. This is
accomplished using the SIXFRAME tool. Why does the SIXFRAME tool give
you six possible amino acid translations? Which should you choose and
why? Select the frame for each gene that you think is promising and
import them using the button at the bottom of the page.
Next align the two amino acid
sequences (one from each original gene) using the ALIGN and LALIGN
programs in the Protein Tools section. If you are unsure if you chose
the correct frame in step 5, select a different frame until you get the
best alignment score possible. Compare this alignment with the nucleotide
alignment. Would you consider this to serve as better evidence of
homology than the nucleotide alignment? Why or why not?
Global vs. Local Alignment
You have been given the amino acid sequence of an unknown
mouse gene. You have decided that the gene may have some similarities with two
human genes, DAPK1 and CDH1. Their definitions from LocusLink (www.ncbi.nlm.nih.gov/LocusLink)
are as below.
DAPK1 -
Death-associated protein kinase 1 is a positive mediator of gamma-interferon
induced programmed cell death. DAPK1 encodes a structurally unique 160-kD
calmodulin dependent serine-threonine kinase that carries 8 ankyrin repeats and
2 putative P-loop consensus sites. It is a tumor suppressor candidate.
CDH1 - This gene is
a classical cadherin from the cadherin superfamily. The encoded protein is a
calcium dependent cell-cell adhesion glycoprotein comprised of five
extracellular cadherin repeats, a transmembrane region and a highly conserved
cytoplasmic tail. Mutations in this gene are correlated with gastric, breast,
colorectal, thyroid and ovarian cancer. Loss of function is thought to
contribute to progression in cancer by increasing proliferation, invasion,
and/or metastasis. The ectodomain of this protein mediates bacterial adhesion
to mammalian cells and the cytoplasmic domain is required for internalization.
Identified transcript variants arise from mutation at consensus splice sites.
We will compare the unknown gene to each of the two given
genes to see if we can learn more about the function of the unknown gene.
Find the global alignment
(Protein Tools => ALIGN) between both unknownDAPK and unknownCDH1. Which of the sequences (DAPK or CDH1) is more similar (higher alignment score) to the unknown?
Next perform local alignments
between unknownDAPK and unknownCDH1. Local alignments can be performed with the LALIGN tool. Which
two genes give the best local alignment? Which tool (global or local
alignment) do you think is more useful for inferring function?
Pfam is a database of protein
families and alignment information. Go to the Pfam website (http://www.sanger.ac.uk/Software/Pfam/index.shtml).
Click on the "Protein Search" tab at the top of the screen. From the
best local alignment you generated from step 4, copy the sequence of one
of the proteins over the course of the alignment into the window on the
Pfam protein search page and click on "Search Pfam". Describe the result
you get. Is the result surprising given the definition of the genes
above?
Try plotting DAPK against
itself. What can you infer about the sequence of DAPK from the plot?
What can you say about repeats within this protein (how many, size)? Look
up DAPK on the Pfam website and give the name of the domain.