Phat (a Pretty Handy Annotation Tool)

                Anthony Wirth, Simon Cawley and Terry Speed

       (Questions and comments are welcome, email simon_cawley@affymetrix.com)



INTRODUCTION

Phat is a HMM-based genefinder, originally developed for genefinding in
Plasmodium falciparum.  Care has been taken to make it easy to supply different
parameter files to run it on different organisms.  This release comes with
parameter sets ready-made for:
  Homo sapiens
  Plasmodium falciparum
  Plasmodium vivax

Phat has been tested on sequences up to just over 1M in length, and in principle
there is no limit to the size of the input sequence (though memory requirements
and execution time increase linearly with sequence length).

Phat comes with code to re-estimate parameters for other organisms.
(The code is in the subdirectory "Retrain" and some documentation may be
found in the file Retrain/DOC).

Phat comes in two flavours, one which operates on each strand independently
and one which processes both strands at the same time.  These are called
halfphat and fullphat respectively.  In the case of Plasmodium falciparum
fullphat performs significantly better.  Coding regions on one strand look
similar to coding sequence on the reverse strand and so halfphat often predicts
a gene on each strand for each real gene present.  Reducing these overlapping
predictions to a non-overlapping set is a difficult problem.  Fullphat gets
around this by considering genes on both strands in its model.  It takes
more time and memory to run, but comes up with better predictions.


UPDATES

release 20001212
  - Changed format of the posterior coding probability files.  The old
  format gave a probability for every base being coding, which could be
  a big file if the input sequence is big.  The new format only records
  the points at which the posterior probability changes.  Since it doesn't
  change much, this is a much more compact file.  For backwards compatability,
  the old format can still be produced by using the -oldprb option.


EXTRACTING AND COMPILING

  After downloading the code, extract it by typing

      gzcat phat.tar.gz | tar xvf -

  cd into the new directory Phat and compile the code by typing 
 
      make

  The code has been successfully compiled on a number of systems and
  hopefully will compile with no trouble.  If you experience problems
  compiling please email me.



RUNNING

  PART 1: running phat directly from the command line.

  Note that there is a Perl script phat.pl which extends the functionality
  of the program to include more input formats and multiple sequences per
  file.  See part 2 for details (requires both Perl and Bioperl)

  Phat takes its parameters from a bunch of files all located in the same
  directory.  One such directory is supplied with the code, it is called 
  Pars.P.falciparum.  (These parameters were estimated from the annotation
  of chromosomes 2 and 3 of Plasmodium falciparum, as found in Genbank with
  accessions NC_000910 and NT_002521)  

  There are two ways to tell the program which set of parameters to use.
  One is to specify the directory on the command line:

       fullphat -p Pars.P.falciparum myseq.fasta
       halfphat -p Pars.P.falciparum myseq.gbk

  Another way is to set the environment variable PHATPARS

       setenv PHATPARS Pars.P.falciparum
       fullphat myseq.fasta

  The -p option will override the environment variable if it is already set.

  As the examples imply, input sequences can be in fasta or genbank format.
  If a multiple fasta file is passed in only the first sequence in the
  file will be analyzed.

  Code to estimate Phat parameters from annotated genbank files is currently 
  being developed, and will shortly be part of the release.  The aim is to
  make it easy for users to retrain Phat for their own particular uses.
  


AN EXAMPLE

  There should be a directory called Sequence in the Phat directory, it 
  contains a genbank P. falciparum sequence of about 50kb in length
  (AF030694.gbk), and a GFF file specifying the correct exon locations
  in the file (AF030694.true.gff).

  After compiling the code, cd to the Sequence directory and run fullphat 
  on the sequence:

       ../fullphat -p ../Pars.P.falciparum AF030694.gbk -noextras

  This may take up to a minute or two, depending on the speed of the machine
  you're using.  The -noextras option limits the amount of output data 
  fullphat produces.

  There should now be a file called AF030694.gff containing the fullphat
  predictions.  If you have a copy of gff2ps handy, you can make a postscript
  plot of the predicted and true exons with the following command:

       gff2ps -v AF030694.true.gff AF030694.gff > AF030694.ps

  (download gff2ps from http://www1.imim.es/~jabril/GFFTOOLS/GFF2PS.html)
  The plot is written to AF030694.ps and can be viewed with a postscript
  viewer such as gv or ghostview.

       gv AF030694.ps
  

PHAT OUTPUT

  Phat writes a number of output files and prints a summary of the predicted 
  exons to the screen.  The base name of the output files will be the same as
  the base name of the input sequence.  The output files are:

  .rna  A multiple fasta format file of predicted spliced mRNAs (without UTRs).

  .pep  A multiple fasta format file of predicted protein sequences.

  .gff  A GFF (General Features Format) file of the exon predictions.
       (for a definition of GFF, see www.sanger.ac.uk/Software/formats/GFF/)
       The GFF format is very useful, a bunch of tools for processing it
       have been developed.  In particular, there's an excellent free program
       gff2ps for converting .gff formatted predictions to postscript plots.
       (see http://www1.imim.es/~jabril/GFFTOOLS/GFF2PS.html)
     
  .prb/.prf  Probabilities of being coding for each base.  Phat computes the
       probability of each base being some kind of coding exon.  In the future
       a plot of these probabilities will be produced rather than a text file
       of the values.  These values offer another method identifying coding
       regions, though they won't give a parse.  They could be useful in 
       identifying candidate regions in which to design PCR primers.  For 
       fullphat just one file (.prb) is produced.  For halfphat, .prf and .prb
       will contain the probabilities for the forward and backward strands
       respectively.

   Finally, phat will print output to the screen, something like the following:

  Gene Exon Dir Type   Start    Stop Length Fr Ph   Pr
     1    1   + init    1849    3279   1431  0  0 0.83
     1    2   + term    3417    3539    123  2  0 0.84
     2    1   + init    3716    3832    117  1  0 0.54
     2    2   + term    4103    4909    807  1  0 1.00
     3    1   - sing    5852    8053   2202  1  0 1.00
     4    1   - sing    8139    8597    459  2  0 0.79
     5    1   - term    9741   10469    729  2  0 0.99
     5    2   - init   10596   10625     30  2  0 0.96
     6    1   - sing   11754   12377    624  2  0 0.93
     7    1   + sing   16000   18597   2598  0  0 0.76

There is one line for each predicted exon.  The columns are defined as follows:
Gene:   Index of the gene to which the exon belongs.
Exon:   Index of the exon in its gene.
Dir:    Direction, + for the direct strand, - for the reverse.
Type:   One of sing/init/intl/term for a single/initial/internal/terminal exon.
Start:  The start coordinate of the exon.
Stop:   The stop coordinate of the exon. (Start is always less that Stop).
Length: Exon length = Stop-Start+1
Fr:     Frame: if the "rightmost" base of one of the exon's codons ends at 
          position k, the frame is (k mod 3).
Phase:  The number of bases to skip in the exon (going 5' -> 3') before 
          reaching the first base of the next codon.
Pr:     Exon probability: the probability of the prediction being a real
          exon, taking into account the entire sequence.  This is a good
          indicator of the exon being correct.
       


PHAT COMMAND-LINE OPTIONS

  [ -2 ]  (Only works with halfphat) Predict on both strands.  The default 
   is to predict in only the direct strand).

  [ -acc accfile ]  Use the acceptor splice site model in accfile, which should
    be in the parameter directory being used. The default is "acceptor.dat"
    Allows the use of alternative acceptor site models.

  [ -debug ]  Debug mode - prints *lots* of output to the screen.  This option
    was mainly used for debugging the code.  You probably don't want to use it,
    unless you're tinkering around with the code.

  [ -don donfile ]  Use the donor splice site model in donfile, which should
    be in the parameter directory being used. The default is "donor.dat"
    Allows the use of alternative donor site models.

  [ -f ]  Reverse complement ("flip") the sequence before predicting.

  [ -geo ]  Use geometric length distributions for exons, rather than the 
          general probability distributions specified in the parameter files
          exon.len.*.dat located in the parameter directory.  The mean of each
          geometric distribution used is chosen to be the same as the mean of
          the distribution supplied in the exon.len.*.dat file.  This doesn't
          make things run any faster or slower, its just available to make it 
          possible to compare the performance of HMM's with Generalized HMMs.

  [ -seq seqname ]  Use seqname (instead of the base name for output files)
          as the name of the sequence in the .gff file.  Can be useful (along
          with the -o option) if you're processing a large sequence in separate
          batches and then concatenating the resulting .gff files.

  [ -noext ]  Don't add the ".number" extension to the names given the objects
          (proteins, mRNAs, etc) in the output files.  The default is that
          they will be given a numeric extension.

  [ -norna ]  Don't print the .rna file containing the predicted spliced
          mRNA (without untranslated regions).  The default is to print 
          it (in multiple fasta format).

  [ -noextras ]  Don't print "extra" output files, just write the .gff file 
          and print the predictions to the screen.  If processing a large 
          sequence the output files can be very big, especially the exon
          probability file.

  [ -nopep ]  Don't print the .pep file containing the predicted peptide 
          sequences.  The default is to print it (in multiple fasta format).

  [ -noprob ]  Don't print the .prb file (in the case of fullphat) or the 
          .prf and .prb files (in the case of halfphat).  These files contain
          probabilities of each base in the sequence being some kind of 
          coding exon, and since there is one value per base they can be
          quite large.  The default is to print it or them.

  [ -o filename ]  Any output files will be written with the base name filename.
          The default is to use the base name of the sequence file being
          processed.

  [ -offset offset ]  Adds offset to all coordinates in the output written to
          stdout and to the .gff file.  Can be useful for changing coordinate 
          systems.  Reported frame information will be relative to the new 
          coordinate system.

  [ -okstops ]  (fullphat only)  If invoked, in-frame stop codons are given
          very low probabilities, but they're not completely forbidden.  You
          might want to use this if the sequence being analyzed is prone to
          errors, it could stop an erroneous stop codon breaking up an
          otherwise likely exon.  Without this option in-frame stops are 
          completely forbidden in fullphat.  halfphat will be modified to
          operate the same way in the future, currently it will allow
          in-frame stops.  For both phats, a stop codon spanning an intron
          can't be ruled out.  This is an unfortunate inherent feature of
          all current HMM genefinders.

  [ -p dir ]  Specifies the directory where the parameter files are located.
          dir can be an absolute or relative pathname and it doesn't matter
          whether or not it is terminated with a "/".  In addition to using 
          organism-specific parameters, you might want to use different 
          parameter sets for different regions.

  [ -r ]  (Only works with halfphat) Reverse complement the sequence before
          predicting.  All output coordinates will be in terms of the reverse 
          complemented sequence.

  [ -v ]  Run in verbose mode.  A fair bit of output will be printed to the 
          screen.  Most users can ignore this.


  PART 2: running phat.pl

  phat.pl is a perl script which uses bioperl modules to make it easier to
  run fullphat/halfphat, and to extend their functionality.  Advantages of
  using the phat.pl script are:
    1) Handles any input format that Bioperl can handle (Fasta,GenBank,GCG,etc)
    2) Handles multiple sequence entries per input file.
    3) Supports a simple type of parallel processing.

  You will need to have recent versions of Perl and Bioperl installed to run
  phat.pl.  Tests to see the absolute minimum required versions of Perl and
  Bioperl have not been conducted, but Perl 5.005_03 and Bioperl 0.6.1 or
  later should do.  They can be downloaded (for free):
    Perl     http://www.perl.com
    Bioperl  http://bio.perl.org

  The phat.pl script will probably need to be edited a little to reflect where
  you choose to store the fullphat/halfphat binaries on your system.  It also
  uses some rules determining which format is specified by which file
  extension, these are easy to customise in the script.  The default number 
  of CPUs is just 1, if you want to process sequences in parallel change
  the value either in the script or when calling phat.pl with the "-cpu"
  option.

  Output naming conventions: Phat output files will be written in the current
  directory.  If the input file contains one sequence only the phat output
  files will use the same base name.  If the input file (called "inFile") 
  contains multiple sequences (named "seq1", "seq2" and so on) then phat
  output files will have the base name inFile.seq1, inFile.seq2 and so on.

  If there are any whitespace characters in the name, the name will be 
  shortened by dropping everything after and including the first space.

  A word about parallel processing: it would be nicest to implement this
  using Perl threads, but this is still an experimental feature in Perl
  and is often not compiled in to Perl when it is installed.  Instead we
  choose to batch up to nCPUs jobs at a time and wait for them all to 
  finish.  True parallel processing wouldn't wait around for the other
  jobs to finish.  Nevertheless, what is implemented here should speed
  things up.  In the future this will be improved.

  EXAMPLES

    phat.pl Sequence/AF030694.gbk -ph "-noprob -norna"
      Runs fullphat on AF030694.gbk, suppresses generation
      of .prb and .rna files.

    phat.pl Sequence/AE00*.gbk -ph "-noextras" -cpu 4
      Runs fullphat on AE00*.gbk sequences, suppresses generation of 
      everything other than the .gff files, processes up to 4 sequences
      in parallel.

    phat.pl Sequence/AF030694.gbk -ph "-noprob -nopep" -type half
      Runs halfphat on AF030694.gbk, suppresses generation
      of .prb and .pep files.


  PHAT.PL COMMAND-LINE OPTIONS  (call with no options to see help info)

  [ -v ]  Verbose mode

  [ -h ]  Help - prints out some usage info

  [ -type half/full ]  Tells phat.pl whether to use halfphat or fullphat

  [ -ph "-opt1 -opt2 ... " ]  Passes on the specified options list to the
    call to halfphat or fullphat.

  [ -cpu nCPUs ] The max number of sequence that will be analyzed in parallel.