3.4. Input file formats

3.4.1. General Feature Format

The General Feature Format is a textual format for describing annotations on biological sequences. The format is specified by the Sanger Center at GFF (General Feature Format) Specifications Document.

We assume that the reader is familiar with the GFF record format as described in the above, as we do not repeat this information here. A compliant GFF file for Caryoscope consists of a list of GFF records organized into three consecutive sections:

  1. Chromosome definition;
  2. Centromere definition; and
  3. Feature definition.

The sections may not be interspersed or reordered. For each section, we define the interpretation of each record belonging to that section below.

3.4.1.1. Chromosome definition

<seqname>

The name of a chromosome

<source>

Ignored

<feature>

Should always be equal to chromosome

<start>

Should always be equal to 1

<end>

The length of the chromosome

<score>

Ignored

<strand>

Ignored

<frame>

Ignored

<attributes>

Ignored

<comments>

Ignored

3.4.1.2. Centromere definition

<seqname>

The name of the chromosome on which the centromere resides

<source>

Ignored

<feature>

Should always be equal to centromere

<start>

The 1-based starting location of the centromere

<end>

The 1-based ending location of the centromere

<score>

Ignored

<strand>

Ignored

<frame>

Ignored

<attributes>

Ignored

<comments>

Ignored

3.4.1.3. Feature definition

<seqname>

The name of the chromosome on which the feature resides

<source>

Ignored

<feature>

A string identifying the "type" of the feature, such as exon

<start>

The 1-based starting location of the feature

<end>

The 1-based ending location of the feature

<score>

Ignored

<strand>

Ignored

<frame>

Ignored

<attributes>

Parsed as described below

<comments>

Ignored

The "<attributes>" data is a set of keywords, each of which may be associated with one or more values. These are parsed as follows:

  1. All keywords values are parsed as annotations on the data, and may be used in the Feature tooltip expression and Feature URL expression.

  2. Any value which parses correctly as a numerical value is added to a dataset named after the keyword, and values associated with it are available for display (see Section 2.5, “Choose a dataset”).

3.4.2. Spreadsheet-compatible text

Caryoscope accepts text files in comma- or tab-delimited format, as are usually exported or imported by popular spreadsheet software. Caryoscope expects the columns in these files in a format very similar to the GFF file described in Section 3.4.1, “General Feature Format”.

A compatible text file has three sections, as does the GFF case, for chromosomes, centromeres and features, respectively. Some columns are mandatory, as described below, while further columns to the right are considered annotations, and are treated in the same way that the "<attributes>" field is used in the GFF case.

SEQNAME

The name of the chromosome on which the feature resides

SOURCE

Ignored

FEATURE

A string identifying the "type" of the feature, such as exon

START

The 1-based starting location of the feature

END

The 1-based ending location of the feature

SCORE

Ignored

STRAND

Ignored

FRAME

Ignored

COMMENTS

Ignored

subsequent columns

Annotation columns