Phat (a Pretty Handy Annotation Tool) Anthony Wirth, Simon Cawley and Terry Speed (Questions and comments are welcome, email simon_cawley@affymetrix.com) INTRODUCTION Phat is a HMM-based genefinder, originally developed for genefinding in Plasmodium falciparum. Care has been taken to make it easy to supply different parameter files to run it on different organisms. This release comes with parameter sets ready-made for: Homo sapiens Plasmodium falciparum Plasmodium vivax Phat has been tested on sequences up to just over 1M in length, and in principle there is no limit to the size of the input sequence (though memory requirements and execution time increase linearly with sequence length). Phat comes with code to re-estimate parameters for other organisms. (The code is in the subdirectory "Retrain" and some documentation may be found in the file Retrain/DOC). Phat comes in two flavours, one which operates on each strand independently and one which processes both strands at the same time. These are called halfphat and fullphat respectively. In the case of Plasmodium falciparum fullphat performs significantly better. Coding regions on one strand look similar to coding sequence on the reverse strand and so halfphat often predicts a gene on each strand for each real gene present. Reducing these overlapping predictions to a non-overlapping set is a difficult problem. Fullphat gets around this by considering genes on both strands in its model. It takes more time and memory to run, but comes up with better predictions. UPDATES release 20001212 - Changed format of the posterior coding probability files. The old format gave a probability for every base being coding, which could be a big file if the input sequence is big. The new format only records the points at which the posterior probability changes. Since it doesn't change much, this is a much more compact file. For backwards compatability, the old format can still be produced by using the -oldprb option. EXTRACTING AND COMPILING After downloading the code, extract it by typing gzcat phat.tar.gz | tar xvf - cd into the new directory Phat and compile the code by typing make The code has been successfully compiled on a number of systems and hopefully will compile with no trouble. If you experience problems compiling please email me. RUNNING PART 1: running phat directly from the command line. Note that there is a Perl script phat.pl which extends the functionality of the program to include more input formats and multiple sequences per file. See part 2 for details (requires both Perl and Bioperl) Phat takes its parameters from a bunch of files all located in the same directory. One such directory is supplied with the code, it is called Pars.P.falciparum. (These parameters were estimated from the annotation of chromosomes 2 and 3 of Plasmodium falciparum, as found in Genbank with accessions NC_000910 and NT_002521) There are two ways to tell the program which set of parameters to use. One is to specify the directory on the command line: fullphat -p Pars.P.falciparum myseq.fasta halfphat -p Pars.P.falciparum myseq.gbk Another way is to set the environment variable PHATPARS setenv PHATPARS Pars.P.falciparum fullphat myseq.fasta The -p option will override the environment variable if it is already set. As the examples imply, input sequences can be in fasta or genbank format. If a multiple fasta file is passed in only the first sequence in the file will be analyzed. Code to estimate Phat parameters from annotated genbank files is currently being developed, and will shortly be part of the release. The aim is to make it easy for users to retrain Phat for their own particular uses. AN EXAMPLE There should be a directory called Sequence in the Phat directory, it contains a genbank P. falciparum sequence of about 50kb in length (AF030694.gbk), and a GFF file specifying the correct exon locations in the file (AF030694.true.gff). After compiling the code, cd to the Sequence directory and run fullphat on the sequence: ../fullphat -p ../Pars.P.falciparum AF030694.gbk -noextras This may take up to a minute or two, depending on the speed of the machine you're using. The -noextras option limits the amount of output data fullphat produces. There should now be a file called AF030694.gff containing the fullphat predictions. If you have a copy of gff2ps handy, you can make a postscript plot of the predicted and true exons with the following command: gff2ps -v AF030694.true.gff AF030694.gff > AF030694.ps (download gff2ps from http://www1.imim.es/~jabril/GFFTOOLS/GFF2PS.html) The plot is written to AF030694.ps and can be viewed with a postscript viewer such as gv or ghostview. gv AF030694.ps PHAT OUTPUT Phat writes a number of output files and prints a summary of the predicted exons to the screen. The base name of the output files will be the same as the base name of the input sequence. The output files are: .rna A multiple fasta format file of predicted spliced mRNAs (without UTRs). .pep A multiple fasta format file of predicted protein sequences. .gff A GFF (General Features Format) file of the exon predictions. (for a definition of GFF, see www.sanger.ac.uk/Software/formats/GFF/) The GFF format is very useful, a bunch of tools for processing it have been developed. In particular, there's an excellent free program gff2ps for converting .gff formatted predictions to postscript plots. (see http://www1.imim.es/~jabril/GFFTOOLS/GFF2PS.html) .prb/.prf Probabilities of being coding for each base. Phat computes the probability of each base being some kind of coding exon. In the future a plot of these probabilities will be produced rather than a text file of the values. These values offer another method identifying coding regions, though they won't give a parse. They could be useful in identifying candidate regions in which to design PCR primers. For fullphat just one file (.prb) is produced. For halfphat, .prf and .prb will contain the probabilities for the forward and backward strands respectively. Finally, phat will print output to the screen, something like the following: Gene Exon Dir Type Start Stop Length Fr Ph Pr 1 1 + init 1849 3279 1431 0 0 0.83 1 2 + term 3417 3539 123 2 0 0.84 2 1 + init 3716 3832 117 1 0 0.54 2 2 + term 4103 4909 807 1 0 1.00 3 1 - sing 5852 8053 2202 1 0 1.00 4 1 - sing 8139 8597 459 2 0 0.79 5 1 - term 9741 10469 729 2 0 0.99 5 2 - init 10596 10625 30 2 0 0.96 6 1 - sing 11754 12377 624 2 0 0.93 7 1 + sing 16000 18597 2598 0 0 0.76 There is one line for each predicted exon. The columns are defined as follows: Gene: Index of the gene to which the exon belongs. Exon: Index of the exon in its gene. Dir: Direction, + for the direct strand, - for the reverse. Type: One of sing/init/intl/term for a single/initial/internal/terminal exon. Start: The start coordinate of the exon. Stop: The stop coordinate of the exon. (Start is always less that Stop). Length: Exon length = Stop-Start+1 Fr: Frame: if the "rightmost" base of one of the exon's codons ends at position k, the frame is (k mod 3). Phase: The number of bases to skip in the exon (going 5' -> 3') before reaching the first base of the next codon. Pr: Exon probability: the probability of the prediction being a real exon, taking into account the entire sequence. This is a good indicator of the exon being correct. PHAT COMMAND-LINE OPTIONS [ -2 ] (Only works with halfphat) Predict on both strands. The default is to predict in only the direct strand). [ -acc accfile ] Use the acceptor splice site model in accfile, which should be in the parameter directory being used. The default is "acceptor.dat" Allows the use of alternative acceptor site models. [ -debug ] Debug mode - prints *lots* of output to the screen. This option was mainly used for debugging the code. You probably don't want to use it, unless you're tinkering around with the code. [ -don donfile ] Use the donor splice site model in donfile, which should be in the parameter directory being used. The default is "donor.dat" Allows the use of alternative donor site models. [ -f ] Reverse complement ("flip") the sequence before predicting. [ -geo ] Use geometric length distributions for exons, rather than the general probability distributions specified in the parameter files exon.len.*.dat located in the parameter directory. The mean of each geometric distribution used is chosen to be the same as the mean of the distribution supplied in the exon.len.*.dat file. This doesn't make things run any faster or slower, its just available to make it possible to compare the performance of HMM's with Generalized HMMs. [ -seq seqname ] Use seqname (instead of the base name for output files) as the name of the sequence in the .gff file. Can be useful (along with the -o option) if you're processing a large sequence in separate batches and then concatenating the resulting .gff files. [ -noext ] Don't add the ".number" extension to the names given the objects (proteins, mRNAs, etc) in the output files. The default is that they will be given a numeric extension. [ -norna ] Don't print the .rna file containing the predicted spliced mRNA (without untranslated regions). The default is to print it (in multiple fasta format). [ -noextras ] Don't print "extra" output files, just write the .gff file and print the predictions to the screen. If processing a large sequence the output files can be very big, especially the exon probability file. [ -nopep ] Don't print the .pep file containing the predicted peptide sequences. The default is to print it (in multiple fasta format). [ -noprob ] Don't print the .prb file (in the case of fullphat) or the .prf and .prb files (in the case of halfphat). These files contain probabilities of each base in the sequence being some kind of coding exon, and since there is one value per base they can be quite large. The default is to print it or them. [ -o filename ] Any output files will be written with the base name filename. The default is to use the base name of the sequence file being processed. [ -offset offset ] Adds offset to all coordinates in the output written to stdout and to the .gff file. Can be useful for changing coordinate systems. Reported frame information will be relative to the new coordinate system. [ -okstops ] (fullphat only) If invoked, in-frame stop codons are given very low probabilities, but they're not completely forbidden. You might want to use this if the sequence being analyzed is prone to errors, it could stop an erroneous stop codon breaking up an otherwise likely exon. Without this option in-frame stops are completely forbidden in fullphat. halfphat will be modified to operate the same way in the future, currently it will allow in-frame stops. For both phats, a stop codon spanning an intron can't be ruled out. This is an unfortunate inherent feature of all current HMM genefinders. [ -p dir ] Specifies the directory where the parameter files are located. dir can be an absolute or relative pathname and it doesn't matter whether or not it is terminated with a "/". In addition to using organism-specific parameters, you might want to use different parameter sets for different regions. [ -r ] (Only works with halfphat) Reverse complement the sequence before predicting. All output coordinates will be in terms of the reverse complemented sequence. [ -v ] Run in verbose mode. A fair bit of output will be printed to the screen. Most users can ignore this. PART 2: running phat.pl phat.pl is a perl script which uses bioperl modules to make it easier to run fullphat/halfphat, and to extend their functionality. Advantages of using the phat.pl script are: 1) Handles any input format that Bioperl can handle (Fasta,GenBank,GCG,etc) 2) Handles multiple sequence entries per input file. 3) Supports a simple type of parallel processing. You will need to have recent versions of Perl and Bioperl installed to run phat.pl. Tests to see the absolute minimum required versions of Perl and Bioperl have not been conducted, but Perl 5.005_03 and Bioperl 0.6.1 or later should do. They can be downloaded (for free): Perl http://www.perl.com Bioperl http://bio.perl.org The phat.pl script will probably need to be edited a little to reflect where you choose to store the fullphat/halfphat binaries on your system. It also uses some rules determining which format is specified by which file extension, these are easy to customise in the script. The default number of CPUs is just 1, if you want to process sequences in parallel change the value either in the script or when calling phat.pl with the "-cpu" option. Output naming conventions: Phat output files will be written in the current directory. If the input file contains one sequence only the phat output files will use the same base name. If the input file (called "inFile") contains multiple sequences (named "seq1", "seq2" and so on) then phat output files will have the base name inFile.seq1, inFile.seq2 and so on. If there are any whitespace characters in the name, the name will be shortened by dropping everything after and including the first space. A word about parallel processing: it would be nicest to implement this using Perl threads, but this is still an experimental feature in Perl and is often not compiled in to Perl when it is installed. Instead we choose to batch up to nCPUs jobs at a time and wait for them all to finish. True parallel processing wouldn't wait around for the other jobs to finish. Nevertheless, what is implemented here should speed things up. In the future this will be improved. EXAMPLES phat.pl Sequence/AF030694.gbk -ph "-noprob -norna" Runs fullphat on AF030694.gbk, suppresses generation of .prb and .rna files. phat.pl Sequence/AE00*.gbk -ph "-noextras" -cpu 4 Runs fullphat on AE00*.gbk sequences, suppresses generation of everything other than the .gff files, processes up to 4 sequences in parallel. phat.pl Sequence/AF030694.gbk -ph "-noprob -nopep" -type half Runs halfphat on AF030694.gbk, suppresses generation of .prb and .pep files. PHAT.PL COMMAND-LINE OPTIONS (call with no options to see help info) [ -v ] Verbose mode [ -h ] Help - prints out some usage info [ -type half/full ] Tells phat.pl whether to use halfphat or fullphat [ -ph "-opt1 -opt2 ... " ] Passes on the specified options list to the call to halfphat or fullphat. [ -cpu nCPUs ] The max number of sequence that will be analyzed in parallel.