|
IMSA V1.00
(Online Documentation)
Sumedha Gunewardena,
Oxford University Computing Laboratory,
Wolfson Building, Parks Road,
Oxford, OX1 3QD , UK.
Sumedha.Gunewardena@comlab.ox.ac.uk
Contents
Introduction
IMSA is a multiple sequence alignment tool that allows as input, prior
knowledge of known sequences of homology or known structural or
functional elements. The program annotates the input sequences
based on this knowledge, which is then used to perform a smart
alignment of the sequences. The program tries to capture two
biologically reasonable conjectures that can vastly improve the
sensitivity of the alignments. The first of these ideas is based
on the need to preserve certain biologically distinguishable
structures during the alignment process. The second idea is
based on the need to align residues of certain distinguishable
segments of sequence with each other, with higher probability
than otherwise specified by the substitution matrix.
The multiple sequence alignment algorithm used in IMSA is modified
from the standard iterative pair-wise alignment algorithm. We
use what we call 'sequence tags' to tag the input sequence. This
is an efficient and robust method to tag biological sequences
that we developed for this purpose.
Downloading and Installing IMSA
IMSA is written in ANSI C. It consists of the files imsa.c, utility.c,
autobind.c, readfile.c, tag.c and the header file imsa.h. These
files and their Makefile can be downloaded from here. Download
these files to your UNIX directory. Use the make command to
compile the source files.
IMSA Input File Format
Following is an illustration of the file format used by IMSA. Strings
appearing within [ ] are identifiers, and must appear as they are
(including the square brackets). Strings appearing within { }
define motif sets. Where |text| is present, it must be
substituted with either a digit, character or string. The
scoring matrix should be defined as a left triangular matrix
with its elements corresponding to the order in which the
alphabet is defined. Text appearing in italics is comments and
not a part of the actual input file.
| [SPS] |
|
|
| { |motif| |motif| · ·
· |motif| } |
|
|
| { |motif| |motif| · ·
· |motif| } |
|
|
|
·
· |
. . . . . . . . . . . . . . . . . . . . . . . |
% Motif sets |
|
·
· |
|
|
| { |motif| |motif| · ·
· |motif| } |
|
|
| [E-SPS] |
|
|
| [MAT] |
|
|
| { |char| |char| · ·
· |char| } |
. . . . . . . . . . . . . . . . . . . . . . . |
% The alphabet |
| |digit| |
|
|
| |digit| |digit| |
|
|
|
·
· |
. . . . . . . . . . . . . . . . . . . . . . . |
% The substitution matrix |
|
· · |
|
|
| |digit| |digit| · ·
· |digit| |
|
|
| [E-MAT] |
|
|
| > |string| |
. . . . . . . . . . . . . . . . . . . . . . . |
% Sequence name |
| |string| |
. . . . . . . . . . . . . . . . . . . . . . . |
% Sequence |
| > |string| |
. . . . . . . . . . . . . . . . . . . . . . . |
% Sequence name |
| |string| |
. . . . . . . . . . . . . . . . . . . . . . . |
% Sequence |
|
·
· |
|
|
|
·
· |
|
|
| > |string| |
. . . . . . . . . . . . . . . . . . . . . . . |
% Sequence name |
| |string| |
. . . . . . . . . . . . . . . . . . . . . . . |
% Sequence |
Example:
Following is an example file containing 8 protein sequences to be
aligned. There are 7 motif sets with 8 motifs in each set given
as input.
[SPS]
{ VLTQPP VLTQPP QLVQSG QLEQSG SVFLFP SVTLFP SVFPLA QVYTLP }
{ TISCTG TISCSG RLSCSS SLTCTV EVTCVV TLVCLI ALGCLV SLTCLV }
{ NVKWY TVNWY AMYWV YWTWV KFNWY TVAWK TVSWN AVEWE }
{ SVSKS SGSKS TISRN TMLVN KTKPR GVETT GVHTF NYKTT }
{ TSATLAI ASASLAI NTLFLQM NQFSLRL VVSVLTV ASSYLSL LSSVVTV LYSKLTV }
{ YYCQSY YYCAAW YFCARD YYCARN YKCKVS YSCQVT YICNVN FSCSVM }
{ VFG VFG YWG VWG IEK VEK VDK TQK }
[E-SPS]
[MAT]
{ C S T P A G N D E Q H R K M I L V F Y W }
14
-1 6
-5 2 7
-6 1 -1 10
-5 2 2 1 6
-8 1 -3 -3 1 8
-8 2 0 -3 -1 -1 7
-11 -1 -2 -4 -1 -1 4 8
-11 -2 -3 -3 0 -2 1 5 8
-11 -3 -3 -1 -2 -5 -1 1 4 9
-6 -4 -5 -2 -5 -7 2 -1 -2 4 11
-6 -1 -4 -2 -5 -8 -3 -6 -5 1 1 10
-11 -2 -1 -4 -4 -5 1 -2 -2 -1 -3 3 8
-11 -4 -2 -6 -3 -8 -5 -8 -6 -2 -7 -2 1 13
-5 -4 -1 -6 -3 -7 -4 -6 -5 -5 -7 -4 4 2 9
-12 -7 -5 -5 -5 -8 -6 -9 -7 -3 -5 -7 -6 4 2 9
-4 -4 -1 -4 0 -4 -5 -6 -5 -5 -6 -6 -6 1 5 1 8
-10 -5 -6 -9 -7 -8 -6 -11 -11 -10 -4 -7 -11 -2 0 0 -5 12
-2 -6 -6 -11 -6 -11 -3 -9 -7 -9 -1 -10 -10 -8 -4 -5 -6 6 13
-13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7 1 -9 -11 -12 -7 -14 -2 -2 19
[E-MAT]
> FABVL
ASVLTQPPSVSGAPGQRVTISCTGSSSNIGAGHNVKWYQQLPGTAPKLLIFHNNARFSVSKSGTSATLAITGLQA
EDEADYYCQSYDRSLRVFGGGTKLTVLR
> FB4VL
QSVLTQPPSASGTPGQRVTISCSGTSSNIGSSTVNWYQQLPGMAPKLLIYRDAMRPSGVPDRFSGSKSGASASLA
IGGLQSEDETDYYCAAWDVSLNAYVFGTGTKVTVLGQ
> FB4VH
EVQLVQSGGGVVQPGRSLRLSCSSSGFIFSSYAMYWVRQAPGKGLEWVAIIWDDGSDQHYADSVKGRFTISRNDS
KNTLFLQMDSLRPEDTGVYFCARDGGHGFCSSASCFGPDYWGQGTPVTVSS
> FABVH
AVQLEQSGPGLVRPSQTLSLTCTVSGTSFDDYYWTWVRQPPGRGLEWIGYVFYTGTTLLDPSLRGRVTMLVNTSK
NQFSLRLSSVTAADTAVYYCARNLIAGGIDVWGQGSLVTVSS
> FCCH2
PSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPQVKFNWYVDGVQVHNAKTKPREQQYNSTYRVVSVLTVLHQN
WLDGKEYKCKVSNKALPAPIEKTISKAKG
> FABCL
QPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADGSPVKAGVETTTPSKQSNNKYAASSYLSLTP
EQWKSHKSYSCQVTHEGSTVEKTVAPTSCS
> FABCH1
ASTKGPSVFPLAPTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYIC
NVNHKPSNTKVDKKVEPKSA
> FCCH3
QPREPQVYTLPPSREEMTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFFLYSKLTVDKSRW
QQGNVFSCSVMHEALHNHYTQKSLSL
Parameters
The input to IMSA is a set of parameters followed by the file
containing the sequences to be aligned. There are five
parameters one could specify at the command line, these are:
- Ggap start-up cost (g) :
The gap start-up cost is the first of a two-tire gap cost. It specifies the cost of starting a new gap in the alignment.
- Gap extension cost (s) :
The gap extension cost is the second of the two-tire gap cost. It specifies the cost of aligning a residue(s) with a gap(s). It is usually set to a value very much smaller than the gap start-up cost.
- Binding penalty (b) :
The binding penalty specifies how tight the given motifs are to be preserved throughout the alignment. Higher the binding penalty, the more tighter motifs are preserved.
- Attraction weight (w) :
The attraction weight defines the strength of attraction between residues of motifs in the same motif set.
- Number of realignments (r) :
The number of realignments specifies the number of times each sequence should be removed from the alignment and realigned afresh with the existing set of sequences.
Running IMSA
IMSA is run from the command line. The program name should be followed
by the list of parameters (optional) and the input file
name. Here is an example command line statement to run IMSA on
the sequence file "alignment_file.msa".
$imsa -g8 -s4 -b5 -w25 alignment_file.msa
Please send Questions, Comments, and Bug Reports to
author.
|