Kodomo

Пользователь

(Simple?..) Approach to inferring Karlin-Atschul statistics

Problem in two words

Speaking informally, the problem is as follows: in a real pairwise alignment of two protein sequences of lengthes M and N, you've observed two similar segments of consecutive aminoacids and the degree of similarity is characterized by so-called weight of this pair of segments S: it's a number and the higher is S - the higher is similarity. Now, you ask a question: given two randomly and independently generated sequences of lengthes M and N, what is the probability to find within them a pair of similar segments with weight greater or equal to S? If this probability turns to be very low, you believe that there is a homology between the observed segments in real sequences, otherwise, you are not sure, that the observed segments are related, because you could've found such a level of similarity just by chance in any two sequences.

More detailed definition of the problem

Suppose, that you have two randomly generated sequences of letters, so that each next letter in every sequence is independently chosen from an alphabet An={a1, a2, a3,..., an} with probabilities Pn={p1, p2, p3,..., pn}. For protein sequences, there are 20 letters-aminoacids and the probabilities, for instance, can be determined as frequencies of aminoacids in the wild.

For instance, you've generated sequences HANDEL and VIVALDI (you can calculate the probability of generating these sequences as p(H)*p(A)*p(N)*p(D)*p(E)*p(L) and p(V)*p(I)*p(V)*p(A)*p(L)*p(D)*p(I)). We're interested in detection of similar stretches of consecutive letters in our sequences, let's call them segments. For instance, you see a pair of segments AND - ALD (hANDel vivALDi) within your sequences, which, you think, are similar. You want to calculate the weight of pair of segments - S, which is the measure of similarity of segments.

This weight of pair of segments is calculated as the sum of weights of pairs of letters, corresponding to each other within the segments: S(AND-ALD) = w(A,A) + w(N,L) + w(D,D).

Weight of a pair of letters w(X,Y) reflects their similarity. For instance w(A,D) = -3, because these letters-aminoacids are not similar, w(I,L) = +3, because they are highly similar, and w(L,L)=+5, because identical aminoacids are very highly similar :). So, for the proteins there are 20 possible letters-aminoacids and 400 possible pairs of letters with 400 respective weights, which are given to you as a table, e.g. BLOSUM matrix.

In two real protein sequences of lengths M and N you've observed pair of segments with weight S. You want to calculate the probability p(M,N,S), that in two randomly generated sequences of lengths M and N, you'll find a pair of segments with weight not less than S. Here I'll describe, how to do that.

Inferring the Karlin-Altschul statistics

There will be 3 major steps in the discourse.

1(*I don't know, how to do this yet). I'll show how to reduce the problem of search of the desired probability to a problem of estimation of probability to find a segment with score higher than S in ONE randomly generated sequence of i.i.d. random variables (where each random variable can take one of 400 values, each of them is a weight of pair of letters).

2. I'll show that the second probability can be found as the sum of probabilities of states of Markov process, which generates all the sequences of i.i.d. random variables, that don't contain segment pairs with weights >= S, after generation of that sequences. I'll show that the transition matrix of that Markov process is close to Toeplitz matrices, which, in turn, can be asymptotically approximated by Circulant matrices.

3(*I don't know, how to do this yet). I'll show, that the ratio between the main eigenvalues of Toeplitz and Circulant matrices, upon the weight S approaching ininity is C*exp(e-S).



1. From two sequences to one sequence

First of all, instead of searching for p(M,N,S), let's look for 1-p(M,N,S). That is the probability, that there is no pair of segments with score higher than S in a pair of randomly generated sequences.

Secondarily, let's exhaust all the possible segment pairs. If you've already generated a pair of sequences, you can assign one sequence to another one in all the possible ways, let's call them shifts:

      HANDEL
VIVALDI

     HANDEL
VIVALDI

    HANDEL
VIVALDI

   HANDEL
VIVALDI

  HANDEL
VIVALDI

 HANDEL
VIVALDI

HANDEL
VIVALDI

HANDEL
 VIVALDI

HANDEL
  VIVALDI

HANDEL
   VIVALDI

HANDEL
    VIVALDI

HANDEL
     VIVALDI

Let's say we have shift 1, shift 2 and so on , totally N+M-1 shifts. Within each shift there are some positions, where sequences are assigned corresponding to each other, for instance for

 HANDEL
VIVALDI

there are 5 such positions, let's call it length of shift. So, each letter pair of a shift is assigned a weight and probability.

In the example above we've first fixed the sequences' pair and then exhausted all the possible shifts. But, in order to calculate 1-p(M,N,S) we can do that in reverse order: first fix a shift and for each shift check all the possible sequences generated not to contain any segment pairs with score > S.

And so, 1-p(M,N,S) is multiplication of probabilities, that you don't get the pair of segments with weight > S for each shift: 1-p(M,N,S) = (1-p(M,N,S, shift 1)) * (1-p(M,N,S, shift 2) ...).

We'll consider the probabilities of obtaining no segments with too high scores within each shift independent. !!!This is the weak point, which needs to be proved - I believe, that this is nearly obvious for continuous distribution, because if you know that weight of certain position is 4, you can say nothing about the pair of residues, which was present there!!!

2. Calculate the probability of obtaining no high scoring segment pairs within one sequence

Now, we have a sequence of i.i.d. random variables, each of them represents aminoacid pairs and can take values s(i,j) from BLOSUM matrix with probability p(i)*p(j). We need to calculate the probability, that it contains no segments with weight equal to or higher than S.

For example, let S equal to 15 and consider a sequence, which satisfies this condition:

A N I G N E C I
A R L F W Q G L
4 4 6 3 0 2 0 2

w(A,A) = 4; w(N,R) = 0; w(I,L) = 2; w(G,F) = -3; w(N,W) = -4; w(E,Q) = 2; w(C,G) = -3; w(I,L) = 2

Under each aminoacid pair you can see a number, which is the weight of the highest scoring segment, which ends on current position. The pair N-W has weight -4, so note that the highest scoring segment, which ends on the N-W position has the weight 0, since it doesn't contain any pairs in fact (if it contained some, it would've had negative weight, which is worse than just starting over again).

So, the process of generation of all the possible sequences of pairs of aminoacids with no segment pairs > S is, in fact, a Markov process with the number of states equal to S; each state is the weight of highest scoring segment, which ends on current position, so the allowed states are weights from zero to S-1 (if the matrix contains integer numbers only).

To explain, how does the transition matrix of this process look like, let's consider a very primitive alphabet: let our alphabet contain only 3 symbols: Y-Y, which gets the weight 2 and has probability 0,16 , R-R with the weight 1 and probability 0,36 and Y-R == R-Y with the weight -3 and probability 0,48. Let our weight limit S be 7 and let's consider the transition matrix.

If at the previous moment of time we were at the state 0, with probability 0,16 we'll transfer to the state 2, with probability 0,36 to the state 1 and with probability 0,48 we'll stay at the same place, because instead of getting -3 we return to zero. Nearly the same will happen, if the previous state was 1, 2 or 3. If it were 4, than from this state we'll move to 1 with probability 0,48. If it were 5, than we can't transfer to 7, because it's our limit, so the Y-Y pair can not be created, if we're in this state. For the state 6 no pairs with positive weights are allowed and the only option is to go back to state 3. So the transition matrix looks like (the state-vector is a right vector for this matrix):

http://kodomo.cmm.msu.ru/~thewatcher/ConciseMath/KarlinAltschul/Transition_matrix_Altschul.gif

Note that except by the leftmost column of several first elements, the matrix is a banded Toeplitz matrix - each row of it is a repetition of the previous one, such that the elements are shifted by one to the right. If we increase S, the number of matrix's dimensions will increase by one, too, but its structure will stay the same. If S approaches infinity, the norm of this matrix is known to approach the norm of respective circulant matrix:

http://kodomo.cmm.msu.ru/~thewatcher/ConciseMath/KarlinAltschul/Circulant_matrix_Altschul.gif

http://ee.stanford.edu/~gray/toeplitz.pdf

Circulant matrices describe the circular convolution, their eigenvectors consist of the natural powers of roots of unity. Their eigenvalues are dot products of eigenvectors and the first row of the matrix. As the 1st row of the matrix represents the distribution of a single random variable (probability of weight of a pair to be equal to something), the eigenvalues of the circulant matrix are Discrete Fourier Transformations of our distribution.

If the values of BLOSUM matrix were not the integer values, we would've used continuous state space and infinite matrix, but its overall structure would've stayed the same. As n-th row of the matrix represents the distribution of a single random variable, shifted by n positions to the right, the eigenvalues of the circulant matrix in the continuous case would represent the continuous Fourier integrals.

As the number of i.i.d. random variables in our segment pair approaches infinity, the probability of obtaining no interval of weight >= S approaches lambda(S)length of series of i.i.d.r.v.*coefficient of main eigenvector in Fourier series * sum of coordinates of the main eigenvector(S), where lambda is the main eigenvalue of our transition matrix, which approaches the main eigenvalue of circulant matrix, and length of series of i.i.d.r.v. is M (length of the shorter sequence) in the case of BLAST. As we need to ensure, the absence of weight > S for all the N shifts, our probability reaches

coefficient of main eigenvector in Fourier series * sum of coordinates of the main eigenvector(S) * lambda(S)NM