Biological Sequence Analysis
Biological sequence analaysis refers to the extraction and use of
information derived from the sequence of biological macromolecules: DNA (deoxyribonucleic acid), RNA (ribonucleic acid), and proteins are
macromolecules which are unbranched polymers built up from smaller
units. In the case of DNA these units are the 4 nucleotide residues A
(adenine), C (cytosine), G (guanine) and T (thymine) while for RNA the
units are the 4 nucleotide residues A, C, G and U (uracil).
For proteins the units are the 20 amino acid residues:
| A | (alanine) | I | (isoleucine) | R | (arginine) | G | (glycine) |
| C | (cysteine) | K | (lysine) | S | (serine) | P | (proline) |
| D | (aspartic acid) | L | (leucine) | T | (threonine) | Y | (tyrosine) |
| E | (glutamic acid) | M | (methionine) | V | (valine) | H | (histidine) |
| F | (phenylalanine) | N | (asparagine) | W | (tryptophan) | Q | (glutamine) |
While DNA serves largely as a means to store and transfer information,
proteins serve as units that perform the basic functions of life.
RNA serves at least two functions, on the one hand to transiently store
and transfer the information that is encoded in the DNA and on the other
hand to translate this information to the world of the proteins.
To a considerable extent, the biological functions of
DNA, RNA and protein molecules are encoded in the linear
sequence of these basic units: their primary structure.
Analysis of sequences of DNA, RNA or proteins can reveal a lot of
information about their function. Some common analyses include:
determination of the functional units (genes) from the raw sequence of a
completely sequenced genome, prediction of the three dimensional
structure and function of a protein from its primary structure.
Our current research in biological sequence analysis has two broad
thrusts: the development of general purpose tools, and the development of
tools specially tailored to the needs of collaborators in WEHI and
elsewhere. Examples of the former include the gene prediction program
Phat
and the coiled-coil prediction program Marcoil
, while we have yet to publish any examples of the latter.
A wide range of mathematical, statistical and computational problems arise
in our work in the area of biological sequence analysis. We make
extensive routine use of publicly available tools for standard problems,
and to help meet the needs of more specific multi-step problems.
These tools include Blast, FASTA and Clustal W fast local and careful multiple
alignment, MEME/MAST for motif identification, Genscan and GeneId for
single species gene prediction in human and mouse, SLAM and Twinscan for
joint human-mouse gene prediction, and a host of tools for the prediction
of transmembrane domains, promotors, and other special
motifs/domains/signals.
Selected Recent Publications
- Speed TP.
Biological sequence analysis
Proceedings of the International Congress of Mathematicians, volume II,
Higher Education Press, Beijing, 2002, pp97-106.
- Delorenzi M, Speed T.
An HMM model for coiled-coil domains and a comparison with PSSM-based
predictions.
Bioinformatics. 2002 Apr;18(4):617-25.
- Cawley SE, Wirth AI, Speed TP.
Phat--a gene finding program for Plasmodium falciparum.
Mol Biochem Parasitol. 2001 Dec;118(2):167-74.
Software
Phat
Phat is a program for finding genes in eukaryotic organisms.
Marcoil
MARCOIL is a hidden MARkov model-based program that predicts existence and
location of potential coiled-coil domains in protein sequences.
Genenest
GeneNest provides a visualization of gene indices of Human, Mouse, Arabidopsis, Zebrafish and Drosophila.
MEME/MAST
Discover motifs (highly conserved regions) in groups of related DNA or protein sequences using MEME and,
search sequence databases using motifs using MAST.
TWINSCAN
TWINSCAN is a modern gene prediction system designed to analyze eukaryotic genomic sequences.
SLAM
SLAM is a comparative-based annotation and alignment tool for syntenic genomic sequences
that performs gene finding and alignment simultaneously and predicts in both sequences symmetrically.
GENSCAN
Genscan predicts the locations and exon-intron structures of genes in genomic sequences from a variety of organisms.
GENEID
GENEID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure.
BLAT
BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. Protein BLAT
works in a similar manner.
TMpred
The TMpred program makes a prediction of membrane-spanning regions and their orientation.
MEROPS
The MEROPS database provides a catalogue and structure-based classification of proteases,
together with additional information about them.
Comments/Questions? Contact bioinf@wehi.edu.au.
Last modified:
|