GeneMark Manual














Return to GeneMark

The GeneMark Output gm

GeneMark has two basic forms of output, a text report and Postscript graphics . The options associated with generating this output are discussed on the page " GeneMark Options ".

Interpretting the GeneMark Report

GeneMark can be instructed to generate reports of open-reading frames, regions of interest, and estimated exon boundaries. ORF reports may include RBS site evaluations and frame-shift indications.

The Report Header

Each report generated by GeneMark has a header describing the parameters and matrix used in the analysis. This information is purely for recordkeeping purposes. Here's a sample header:

GENEMARK PREDICTIONS
Sequence file: cya
Sequence length: 2100
GC Content: 51.65%
Window length: 96
Window step: 12
Threshold value: 0.500
---
Matrix: E. coli (NCBI/FR-1), Order - 4
Matrix author: JDM (Amiga-TransMatrix)
Matrix order: 4

The Open Reading Frames List

If you have selected to have GeneMark indicate regions of interest , areas between in-frame stop codons where a high coding potential occurred, you will see a list such as the following:

List of Open reading frames predicted as CDSs, shown with alternate starts
(regions from start to stop codon w/ coding function >0.50)

Left Right DNA Coding Avg Start RBS RBS RBS
end end Strand Frame Prob Prob Prob Site Seq
---- ----- ---------- ----- ---- ---- ---- ---- ------

3

308

direct

fr 3

0.82

....

0.00<

0

....
195 308 direct fr 3 0.60 0.04 0.74 177 CCGCAG

348

668

complement

fr 2

0.90

0.96

0.98

680

CAGGAT

1368

2102

direct

fr 3

0.90

0.98

0.96

1359

TTGGAG
1371 2102 direct fr 3 0.91 0.96 0.96 1359 TTGGAG
1386 2102 direct fr 3 0.93 0.63 0.91 1367 AATGAT
1410 2102 direct fr 3 0.96 0.90 0.76 1401 AACGAT
1509 2102 direct fr 3 0.98 0.27 0.51 1490 AGGGTT
1578 2102 direct fr 3 0.97 0.11 0.73 1567 ATGGCA
1620 2102 direct fr 3 0.97 0.11 0.16 1601 GCGCTG

The 'Left end' and 'Right end' columns denote the ends of the indicated open reading frame relative to the begining of the sequence (5' end of the direct strand). 'DNA Strand' indicates which strand the signal was found on, and 'Coding Frame' indicates the reading frame relative to the beginning of the sequence in which the signal was found. The 'Avg Prob' column denotes the average coding potential over the indicated range. NOTE: GeneMark does not indicate if an ORF extends past the sequence provided, so ORF positions from 1-3 and from length-2 - length may indicate the ORF extends past the ends of the sample sequence.

The 'Start Prob' column is an assessment of the likelihood that the start of the open reading frame is the actual start. This value is equal to the coding potential 1/2 window into the ORF multiplied by 1 minus the coding potential 1/2 window before the ORF. If no value is given, then them start appears too close to the end of the sequence in order to calculate the value.

If you pecified an RBS pattern file to be used, RBS site evaluation is also performed. The 'RBS Prob' value is a score indicating how-well the RBS pattern was matched upstream of the putative start site. The position of the best match for the indicated start and the sequence are indicated in the next two columns. If the start site is adjacent to the end of the sequence, it is not possible to evaluate the RBS site and null data are given (see the first ORF in the table above).

The Regions of Interest List

If you have selected to have GeneMark indicate regions of interest , areas between in-frame stop codons where a high coding potential occurred, you will see a list such as the following:

List of Regions of interest
(regions from stop to stop codon w/ a signal in between)

LEnd REnd Strand Frame
-------- -------- ----------- -----
3 308 direct fr 3
348 686 complement fr 2
1092 1334 direct fr 3
1365 2102 direct fr 3

The 'LEnd' column indicates the left end of the region (5' end on the direct) and 'REnd', the right end of the region. The 'Strand' column indicates whether the region is indicated on the direct or reverse complement strand. The 'Frame' column indicates the reading frame on the indicated strand in which the signal occured.

Possible Frameshift Detection

When a GeneMark report is generated, it may contain a section similar to this:

POSSIBLE SEQUENCE FRAMESHIFTS DETECTED
From To  
Frame Frame At base...
----- ----- ----------
2 1 31152 +/- 11 bp (complement)
2 1 63372 +/- 11 bp (direct)
3 2 75528 +/- 11 bp (complement)

Such a notice indicates a sudden shift in coding potential from one reading frame to another. This situation may occur when there is an insertion or deletion in the middle of a coding region. The table indicates the frame the signal started in, the frame the signal constinues in, and the approximate location of the error (the precision of which is determined by the step size parameter used).

The Approximate Exon Location List

The current version of GeneMark uses a "coding potential only" exon designation that can indicate approximate exon boundaries and suggest exon locations (a modified version of GeneMark with more accurate exon prediction will be available in the near future). Exons are denoted by two pairs of putative acceptor/donor sites and the mean coding potential between those sites:

List of Protein-Coding Exons
(regions between acceptor and donor site w/ coding function >0.50)

Left Right      
End End Strand Frame Prob
------- ------- ----------- ----- ------
50 300 direct fr 3 0.8566
63 247     0.9998
         
365 666 complement fr 2 0.9415
378 657     0.9780
         
1201 1277 direct fr 3 0.8722
1225 1254     0.9986
         
1377 1377 direct fr 3 0.9085
1434 2042     0.9780
         

In general, this approach is quite good at finding larger exons. However, searching for smaller exons requires using smaller window sizes (decreasing the accuracy of prediction, but allowing smaller exons to be detected) and good matrix data .

 

Viewing and Printing the Postscript Graphics

The Postscript graphics generated by GeneMark may be viewed using any Postscript previewer. On the Solaris platform the applications imagetool (/usr/openwin/bin/imagetool) or pageview (/usr/openwin/bin/pageview) can be used to view the graphic output. You can also use the lp command to send the graphics to a Postscript printer (check with your system administrator to make sure you have a printer that supports this feature).

If you are interested in viewing or printing the Postscript graphics on another platform or do not have the imagetool or pageview application installed, we suggest that you download Aladdin Ghostscript from the Internet; it is available for a variety of computing platforms and allows you to print the Postscript graphics on non-Postscript printers.

An example of the graphical output generated by GeneMark is given below with each feature indicated in red. The coding potential function is plotted in 6 frames, 3 direct and 3 reverse complement. High coding potential represenets the high likelihood of protein coding in that region.

[Example GeneMark Output]


previous: GeneMark Transition Matrices
next: GeneMark Resources



Gene Probe, Inc.
1106 Wrights Mill Court
Atlanta, GA 30324

PH: +1 (404) 579 - 2975
FX: +1 (404) 255 - 2067

Technical Support
Licensing Support