MacVector icon

Aligning sequences with Align to Reference

Quick Start

You can align sequences against a reference by:

The Algorithm

The first step during an Align to Reference alignment, is to find all the short sequences of perfect "Hash Value" matches and then extend each match left and right to the end of the perfect alignment - this is a "diagonal". It is scored based on how many residues of perfect matches are found and then the best scoring diagonal is extended either side past the mismatches at each end to see to get the best possible score. An extension is terminated at the end of one of the sequences, or if the "current" score is "X-dropoff" below the best score.

The cDNA alignment algorithm takes a more complex approach. It needs to deal with multiple segments, so instead of just taking the best diagonal, it keeps "Max Diagonals" in an array. Each of these diagonals is separately extended to find the best possible score for that diagonal. Note that the most CPU intensive part of the algorithm is the extension of the "perfect" diagonals, accounting for mismatches, insertions/deletions etc. With a lot of diagonals, this can take a substantial amount of time, especially with a high "sensitivity" value. The "Minimum Score" for a diagonal is the score which must be exceeded before a diagonal is saved. With the current default of 25, where the residue "Match" is 2, that means we need at least 13 perfect residues before a diagonal is saved.

The default values for cDNA alignments are different to sequence confirmation. The Hash value is bigger (6) so the algorithm needs at least 6 perfect matches before a diagonal is extended - this speeds the initial search somewhat. The Sensitivity is also reduced to 3, to speed up the extensions. The Mismatch and Gap Penalty are also higher to decrease the likelihood of ragged ends. However, it should be noted that this can be disadvantageous if you were aligning trace files of cDNA clones against a genome, so these should be dropped back to 3 or 4.

The CRISPR INDEL Detection preset differs in that it ensures gaps are kept together. This does slow down performance when you have a large coverage depth over shorter references. So only use this preset when required.

The Consensus is also generated during the inital alignment.

Parameters

Before aligning a dialog appears with the following parameter settings;

- Match. (default 2). This is the score the algorithm assigns when it finds a match between a residue in the sample sequence and a residue in the reference sequence. Only AGCT matches get this score.

- Mismatch. (default -3). This score is assigned whenever two residue mismatch.

- Ambiguous Match. (default 0). This is the score assigned whenever there is an ambiguous match such as N vs A or G vs R. The standard IUPAC ambiguity code is used.

- Gap Penalty (default 4). This score is assigned whenever a gap had to be inserted into either the sample or the reference sequence to maintain an optimal alignment.

- Hash Value (default 8). This is used to create a lookup table to speed up comparisons. The value translates to the number of perfect matches required between reference and sample before the algorithm tries to extend the match. Smaller values create slower, more sensitive searches. Larger values are faster and somewhat less sensitive, but require more memory.

- Sensitivity (default 6). This value determines how far ahead the algorithm should look when it encounters a mismatch to work out the best way to extend an alignment. A value of 4 means that the algorithm can handle a stretch of 4 mismatches and still maintain the optimal alignment. Longer stretches of mismatches/gaps require a higher value. Because the algorithm is recursive, a high value can have an exponential slow down on the speed of the alignment. Exceeding a value of 8 is not recommended unless you are aligning just one or two sequences, have a VERY fast machine, or are extremely patient!

- Score Threshold (default 50). This is the minimum score that an alignment must have before the algorithm considers that a sample sequence truly aligns to the reference. This ties in closely with the Match/Mismatch scores. The defaults (Match=2, Score Threshold=50) mean that the alignment must have the equivalent of 25 perfect consecutive matches (2 x 25 = 50) to be considered valid. If a mismatch or gap interrupts the alignment, then this must be compensated by a additional matches to bring the overall score up to or above 50.

- X Dropoff. (default 25). When the algorithm is extending an alignment, it needs to know when to give up if it encounters a region of dissimilarity. It does this by making a note of the best score it was able to achieve and giving up when the current score drops X Dropoff below the best score. Setting this to a low value speeds up computation, but the algorithm may miss alignments that extend through a region of poor similarity.

a

cDNA Alignment extra parameters

- Max. Diagonals (Valid range 1 - 1000, default 50) This is the maximum number of diagonals that MacVector should keep and evaluate as potential exons during cDNA alignment. If this is set too low, MacVector may miss some short exon sequences. Setting this to a large value will lead to longer computation times.

- Minimum Score. (Valid range 1 - 1000, default 25) This is the minimum acceptable score for a diagonal to be saved. The default settings of 25, in conjunction with the default residue Match score of 2, means that a diagonal must span at least 13 perfect matches before it is saved for later evaluation.

Clicking on the Defaults button resets the parameters to their default values. Clicking on Cancel dismisses the dialog without affecting the alignment. Clicking on the OK button initiates the calculation.

During alignment, a progress dialog is displayed. If you click on the cancel button, the alignment will be aborted and the original assembly will be restored. Otherwise, when the calculation is complete, the Sequence Confirmation window will refresh with the new alignment.

If a sample sequence could not be aligned, it appears in the assembly in Italic text. If the ends of a sample sequence did not have a good match to the reference, these are indicated by greyed out text in the display. These masked sequences, and the unaligned sequences, are not included in consensus calculations.

Related Topics.

Align to Reference

The main window

Editing assemblies

Saving assemblies

Consensus

Heterozygote Analysis of Sanger Trace files