MkMat Manual





Return to MkMat

Using MakeMatrix mkmat

The MakeMatrix program is used to create gene-finding matrices from sample protein-coding and noncoding sequences for use with GeneMark and GeneMark.hmm analysis.
A CAUTIONARY NOTE: Please read these manual pages completely before generating your own matrix files for use with GeneMark -- the program is very sensitive to the quality of infomation used to prepare the matrices and failing to follow the guidelines outlined here will yield very poor gene prediction results.

The general format for executing MakeMatrix is:

mkmat [-x] <coding file> <noncoding file> <order> <output file>

Options

-x Generate matrices using the newer matrix file format. Systems with older versions of GeneMark will not be able to read this format, other systems may require it. On some systems, mkmat will only generate this format of matrix.

Input Files

The mkmat program requires two input files: a coding file containing sample coding sequences, and a noncoding file containing sample noncoding sequences. The sequences must all come from the same species and should be as free of errors as reasonably possible.

The format for the sequence files is quite simple and based on the popular FASTA file format. A file may contain 1 or more sequence records, each representing a different sequence fragment. Each sequence record should begin with a comment starting with '>', followed by the sequence (numbers and white- space characters are ignored). There should be one or more blank lines between each record. For example:
  • coding gene GT-0032A
1 ATGCGATCGA ATGCGATCGA ATGCGATCGA

31 ATGCGATCGA ATGCGATCGA ATGCGATCGA

61 ATGCGATCGA ATGCGATCGA ATGCGATCGA

91 ATGCGATCGA ATGCGATCGA ATGCGATCGA

121 NNNNNN

Lines may be of any length, and symbols representing ambiguous nucleotide assignments are allowed. Any numbers, punctuation, or whitespace characters are ignored.

The Coding File

The coding file must contain sample protein coding sequences from the subject organism. Ideally, these sequences should be experimentally verified as protein coding or cDNA sequences. In a pinch, you may be able to extract large open reading frames from long high-fidelity contiguous sequences, but this is generally not advised. The following considerations should be taken into account when generating your coding sample file:
o Avoid including "putative" coding regions in your set of sample coding sequences.

o Include PROTEIN CODING REGIONS ONLY. The program is searching for bases that are ultimately transcribed and translated. Including extraneous noncoding data will interfere with the gene-finding algorithm.

o Don't include sequences with in-frame stops. This program will ignore sequences that contain them and spit out a warning.

o Coding samples should appear "in-frame" -- the first base of each sample sequence should represent the first reading frame (though, it need not start with the start codon).

o Keep in mind that some organisms may have different classes of genes, sometimes associated with local GC content. Splitting your sample by GC content or some a priori method of classification may improve gene-finding performance.

o Also, be sure to avoid multiply including sequences in the sample. Inadvertant over-representation of sequence patterns by including them in the coding sample set may inaccurately bias the matrix.

The Noncoding File

The noncoding file should contain samples of sequence known, or reasonably believed to not code for proteins. This may include introns, etc. The actual format of data in this file is the same as that of the coding file (see above). NOTE: short sequences, those less than thirty bases in length will be ignored during the matrix calculation procedure.

As with the coding sample data, the data should not include unrepresentative repetitions of similar sequences as this may inaccurately bias the resulting matrix.

Matrix Order

The third parameter to the matrix generation program is the order of the matrix to be generated. Things to consider when selecting a matrix order:
o Higher order matrices generally yield better gene prediction results.

o Higher order matrices require more sample information, so prediction accuracy will degrade if there is insufficient information to support creation of the model.
IMPORTANT: In order to create a matrix for order n, 90 * 4n+1  bases of coding sequence and 30 * 4n+1  bases of noncoding sample sequence are required (e.g., for a 2nd order matrix, you would need at least 5760 bases of coding data and 1920 bases of noncoding data). Using smaller samples will generate less accurate predictions.

BUGS

Although this program has been tested thoroughly, you can never be sure you are error-free. If you experience any bugs in the program or wish to offer any further suggestions in improving the program, please email us at:

custserv@genepro.com


Gene Probe, Inc.
1106 Wrights Mill Court
Atlanta, GA 30324

PH: +1 (404) 579 - 2975
FX: +1 (404) 255 - 2067

Technical Support
Licensing Support