Tuesday, March 20, 2007

Sequence Alignment - An introduction


Sequence alignment is one of the most important and basic concepts in computational genomics. Given 2 sequences, what is the best way to arrange the letters of the sequences one below the other so that there are maximum matches between them? Putting it another way, given 2 sequences s and t, find s' and t' such that
1. by removing the gaps from s' and t', we can retrieve s & t.
2. there is no i such that s[i] and t[i] are gaps.
3. length(s') = length(t') >= max(length(s),length(t)) and <= sum(length(s),length(t)) In the above definition, s[i] means the letter in sequence s at the ith position. length(s') means length of sequence s' max(length(s),length(t)) means maximum of length of s and length of t sum(length(s),length(t)) means sum of the lengths of s and t I think things will get absolutely clear with some examples here. Mind you, I am trying to explain keeping a layman in mind. Some of the terms used might sound too technical to some while to others my explanations might seem too kiddish. I hope to attain the right balance. Let us assume we have 2 DNA sequences s = ACCT and t = ACGT. For those who dont know about DNA click here.

When we try to align these 2 sequences,

A C C T
| | : |
A C G T

The A,C,T match whereas there is a mismatch at position 3. Naturally, there are many alignments possible for any such pair. (Real data would consist of much longer sequences). The alignment chosen is the one with the highest alignment score.

The score is calculated adding the scores for the matches and mismatches(usually negative) and penalizing for gaps. For example, if match(M) = +1, mismatch(m) = -1 and gap penalty(g) = 2,

A C C - T
| | |
A C - G T

"-" are gaps. The alignment scores for the 2 alignments are

1. M+M+m+M = 1+1-1+1 = 2
2. M+M-g-g+M = 1+1-2-2+1 = -1

Clearly, the first option is preferable.

There are 2 types of alignments based on number of sequences to align:
1. Pairwise alignments
2. Multiple Sequence Alignments

There are 2 types of alignments based on parts of the sequences aligned:
1. Global alignments - align entire sequences
2. Local Alignments - align regions of the sequences

I will cover each of these types in latter posts.


References and Further Reading:
1. Bioinformatics - Sequence Analysis by David Mount
2. Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443-453

No comments: