MacVector icon

Types of annotation

As for the Features table MacVector follows the Genbank format very strictly to ensure that any annotated sequences are generally compatible with other sequence analysis applications and with the various sequence databases such as at the NCBI.

Locus

The locus line contains the LOCUS NAME which is an unique identifier for the sequence, the sequence length, the molecule type, the Genbank division to which a record belongs is indicated with a three letter abbreviation and the modification date.

definition

This is a short description of the sequence entry. It starts with the common name of the source organism, followed by the criteria that distinguish this sequence from other parts of the source genome (gene name and what the gene codes for, the protein name and mRNA, or a description of the sequence's function if it is a noncoding region). The definition line of coding regions can end with a completeness qualifier such as complete sequence, complete genome, or cds (complete coding sequence). MacVector limits this annotation to 254 characters.

accession

This contains one or more accession numbers that apply to the sequence entry. Accession numbers are assigned by GenBank and you should ordinarily change them only if you knew there was an error in an existing number or if it was a new sequence and you had just received the accession number assignment directly from GenBank. Each accession number consists of an alphabetic character, followed by five digits. A space character is used to separate multiple accession numbers. The first accession number listed is unique to this particular sequence entry. MacVector limits this annotation to 254 characters.

version

This field contains a compound identifier, consisting of the primary accession number and a numeric version number associated with the current version of the sequence data in the record. This is followed by an integer key (a "GI") assigned to the sequence by NCBI. Mandatory keyword/exactly one record.

NID/PID

An alternative method of presenting the NCBI GI identifier (described above) for nucleic acids and proteins. NID and PID are now obsolete, but are included for backwards compatibility.

DBLink

The DBLink line (DR line in SwissProt) is used as a pointer to information related to SwissProt and GenBank entries and found in other data collections.

keywords

This field is not found in not in sequence files extracted from the floppy disk versions of the databases, but text files extracted from the Internet versions of GenBank will have it. It consists of short phrases that provide information about the sequence entry. Use semicolons to separate the keyword phrases and a period after the last keyword phrase. MacVector limits this field to 254 characters.

segment

This field is found only in segmented entries. It is used if two or more sequence entries of known relative orientation are separated by a short (less than 10 kb) segment of DNA. The format for the annotation is: n of total, where n is the segment number of the current entry and total is the total number of segments. Usually, you would not use this annotation.

source

This field has up to three subfields:

- Source may contain an abbreviated form of the organism name and a molecule type.

- Organism contains the scientific name for the source organism (genus and species, where applicable). Sequence files from the tape version of GenBank will also contain list in the Organism subfield a list of all the taxonomic classification levels for the organism, separated by semicolons and ending with a period.

- If the sequence is from a parasitic organism, you can enter the name of the parasite's host organism in the optional Host subfield. MacVector limits each subfield of the Source annotation to 254 characters.

reference

This field has six subfields:

- The optional Title subfield contains the title of the cited reference. This subfield will not be present in files that were extracted from the floppy disk version of GenBank.

- Type a number in the Reference subfield (references are numbered sequentially in a GenBank file, starting with 1).

- The Author subfield is a list of authors in the order that they appear in the cited reference. Each name is listed in the form "lastname, A.A". The names are separated by a comma followed by a space. There is no comma after the penultimate name and the final name is preceded by the word "and".

- The Journal subfield contains the name of the journal, book, or thesis where the citation was published, or unpublished if the sequence has not been published. MacVector limits each subfield to 254 characters.

- The MEDLINE ID subfield contains the MEDLINE references.

- The REMARK subfield contains accredited comments that have been added by the database managers.

comment

This field is an optional, free-form text section. MacVector limits this annotation to 32,767 characters.

base count

The contents of this section cannot be changed by the user - the base count line can only be added to or removed from the file. If the base count field is present, MacVector automatically update the base count whenever you edit a sequence.

origin

This field specifies how the first base is located within the genome. The Origin field for pBR322, for example, reads EcoRI site. This field may be left blank or the word Unreported may be entered. MacVector limits this field to 254 characters.

Related Topics.

Single Sequence Files

Single Sequence editor

Features table

Annotations

Editing annotations