|
GAACTCATGGTG |
|
AAAAGCACGGTC |
|
TCAAAGCAAGGC |
|
CCTAATCAGGGC |
|
AAGTATGGACTC |
|
ACTAAGCAGGGT |
|
TCTCACGGCCCA |
|
CCTCGTGGTGGG |
|
TACCGTATGGTT |
|
ACCACTCGTCGA |
A biologist at your university has found 15 target genes that she thinks are co-regulated. She gives you 15 upstream regions of length 50 base pairs in FASTA format, file DNASample50.txt, and asks you to identify the motif, and if possible the potential regulating protein. She tells you the sequences are from Homo sapiens, and by intuition feels the motif is of length 8. She wants you to suggest only the best possible candidate motif.
Attach ALL output files with results. Record all your parameters and collect all output files. For each program, make a decision regarding the one motif that you think is best.
Consider all motifs generated, select the best motif and perform the following:
After you ran all the programs your biologist friend confesses that she is not sure if her intuition about the motif length was correct. Re-run all the tools above without knowledge of motif length. Do you get the same results?
Describe a biological experiment to validate your hypothesis. How many hours do would you estimate the experiment requires?
A popular experimental technique to confirm motif binding and determine protein-DNA interaction is chromatin immunoprecipitation (ChIP). A high-throughput variant of ChIP, ChIP on chip, was developed by Iyer et al (2001) and Ren et al (2000), and is reviewed by Nal et al (2001).
A simplifying assumption for motif finders is that nucleotide positions are independent. Several groups have developed approaches that do not require that independence. Refer to Barash et al (2003) and Keich et al (2002) for computational approaches to handle dependencies.
Suggested Reading
Bailey T. and Elkan C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36.
Barash Y., Elidan G., Friedman N., and Kaplan T. (2003) Modeling Dependencies in Protein-DNA Binding Sites. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology, 28-37.
Eskin E. and Pevzner P.A. (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics, 18, S354-63.
Hetz G.Z. and Stormo G.D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563-77.
Iyer V.R., Horak C.E., Scafe C.S., Botsein D., Synder M., and Brown P.O. (2001) Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature, 409, 533-8.
Keich U., and Pevzner P.A. (2002) Finding motifs in the twilight zone. Bioinformatics, 18, 1374-81.
Lawerence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F., and Wootton J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208-14. (available via JSTOR)
Mandel-Gutfreund Y., Baron A., and Margalit H. (2001) A structure-based approach for prediction of protein binding sites in gene upstream regions. Pac Symp Biocomput., 139-50.
Nal B., Mohr E., and Ferrier P. (2001) Location analysis of DNA-bound proteins at the whole-genome level: untangling transcriptional regulatory networks. Bioessays, 23, 473-6.
Orlando V. (2000) Mapping chromosomal proteins in vivo by formaldehyde-crosslinked-chromatin immunoprecipitation. Trends Biochem Sci., 25, 99-104.
Ren B., Robert F., Wyrick J.J., Aparicio O., Jennings E.G., Simon I., Zeitlinger J., Schreiber J., Hannett N., Kanin E., Volkert T.L., Wilson C.J., Bell S.P., and Young R.A. (2000) Genome-wide location and function of DNA binding proteins. Science, 290, 2306-9.
Schneider T.D. and Stephens R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Research, 18, 6097-100.
Matys V., Fricke E., Geffers R., Gossling E., Haubrock M., Hehl R., Hornischer K., Karas D., Kel A.E., Kel-Margoulis O.V., Kloos D.U., Land S., Lewicki-Potapov B., Michael H., Munch R., Reuter I., Roter S., Saxel H., Scheer M., Thiele S., and Wingender E.(2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Reseach, 31, 374-8.
Stormo GD. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16,15-23.