A preliminary study on Blast algorithmBasic Local Alignment Search Tool

A preliminary study on Blast algorithmBasic Local Alignment Search Tool

Sequence Databst Search

BLAST is a heuristic algorithm, that is, it does not ensure that it can find the optimal solution, but it tries its best to find a good enough solution in a shorter time.

全局比对:dp
F ( 0 , 0 ) = 0 F ( i , j ) = max ⁡ { F ( i − 1 , j − 1 ) + s ( x i , y j ) F ( i − 1 , j ) + d F ( i , j − 1 ) + d G l o b a l A l i g n m e n t \begin{array}{l} F(0,0)=0 \\ F(i, j)=\max \left\{\begin{array}{l} F(i-1, j-1)+s\left(x_{i}, y_{j}\right) \\ F(i-1, j)+d \\ F(i, j-1)+d \end{array}\right. \end{array} Global Alignment F(0,0)=0F(i,j)=maxF(i1,j1)+s(xi,yj)F(i1,j)+dF(i,j1)+dGlobalAlignment

Why local comparison is proposed:

  1. Functionally related PRs are very different in overall sequence, but they have the same functional domain. The sequence fragments can exert independent biological functions and cannot be compared globally.
  2. connotation

Local comparison blast: reduce the dp matrix, calculate the optimal comparison path, find the local optimum , limit the lowest score, and calculate the local similarity
F (0, 0) = 0 F (i, j) = max ⁡ { F ( i − 1 , j − 1 ) + s ( xi , yj ) F ( i − 1 , j ) + d F ( i , j − 1 ) + d 0 L ocal A lignment \begin{array}{l} F(0,0)=0 \\ F(i, j)=\max \left\{\begin{array}{l} F(i-1, j-1)+s\left(x_{i} , y_{j}\right) \\ F(i-1, j)+d \\ F(i, j-1)+d \\ 0 \end{array}\right. \end{array} LocalAlignmentF(0,0)=0F(i,j)=maxF(i1,j1)+s(xi,yj)F(i1,j)+dF(i,j1)+d0L o c a l A l i g n m e n t
Insert image description here

Traceback: Decode the Local Alignment.

Trace back begins at the highest score in the matrix and continues until you reach 0.

There may be multiple local matching results.

img

Blast Ideas:Seeding and extending

  1. Find matches(seed)between the query and subject.

    (Find the seed first)

  2. Extend seed into High Scoring Segment Pairs (HSPs)-Run Smith-Waterman algorithm on the specified region only.

    (Extend to both ends based on the seed and construct a comparison)

  3. Assess the reliability of the alignment.

    (calculate statistical significance)

Insert image description here

Seeding

For a given word length w (usually 3 for proteins and 11 fornucleotides), slicing the query sequence into multiple continuous “seed words”

Given length w

Insert image description here

Speedup: Index database

The database was pre-indexed to quickly locate all positions in the database for a given seed. (Approximate constant time/linear)

Align seeds to pre-indexed sequences

Insert image description here

Diagonal and Two-hits

The optimal comparison path should definitely be parallel to the main diagonal (with the largest score). Scattered hits can be removed, and consecutive hits >2 can be retained to reduce the search space.

Insert image description here

Based on the hit cluster, extend and expand in the left and right directions until the total score decreases beyond x and then stop.

The algorithm in the lower right corner can be used to expand the area.

Insert image description here

Speedup: mask low-complexity

Low complexity
sequences yield false positives.

  • CACACACACACACACA,K=0.36
  • KLKLKLKLKLKL

Mask repetitive low-complexity areas to avoid generating too many false positive hits

Insert image description here

I can’t decide in advance, so it may not be suitable when I use it.

neighbourhood words

To improve sensitivity, in addition to the seedword itself, the BLAST also use these highly similar"neighbourhood words" (based on thesubstitution matrix) for seeding.

Specifically, scores are calculated based on the substitution matrix for all possible variations of the seed word.

  • DKT seed
  • DRT=6+2+5=13, etc. Only the current version score >= 11 will be taken into account (reduce false positives)

Assess statistical significanceQuality Assessment

Given the large data volume, it’s critical to provide some measures for assessing the statistical significance of a given hit.

Evaluate QA after obtaining the final comparison

Ensure that the alignment is not caused by random factors (when the database is large enough, randomly generated sequences can also match the results)

E-Value: How a match is likely to arise by chance

The expected number of alignments with a given score that would be expected to occur at random in the database that has been searched

Under random circumstances, obtain the number of alignments with equal or higher scores than the current alignment score.

  • e.g. if E=10, 10 matches with scores this high are expected to be found by chance

E = k m n e − λ S E=k m n e^{-\lambda S} E=k m n eλS

  • Expectation>1
  • m是query sequence length
  • n is database size
  • s is a fraction
  • k and λ are related to the scoring matrix and are equivalent to the normalization factor.

The larger the n database, the greater the possibility of random matching; the e value is also proportional to m (query sequence length), because blast is a local alignment that does not require full-length matching ; e and s are negatively correlated, that is to say, the higher the score, The smaller the probability of random encounter; k and λ balance the impact of different scoring matrices and search spaces on the results.

E expects and p to be converted

Insert image description here

For the convenience of explanation, we can further convert the p value and the E value into each other. As can be seen from the figure, when it is less than 0.1, the E value and the p value, that is, the probability value, are almost equal. In particular, when p is 0.05, the corresponding E-value is 0.0513, so some people often use 0.05 as the cut-off of the E-value.

Different from dynamic programming-based algorithms such as Needleman-Wunsch and Smith-Waterman, BLAST is a heuristic algorithm, that is, it does not guarantee to find the optimal solution, but tries its best to find a good enough solution in a shorter time. untie. Specifically, BLAST only applies the dynamic programming algorithm in a limited area by applying the Seeding-and-extending strategy, thereby effectively reducing the amount of calculation and increasing the calculation speed. However, the increase in speed comes at the cost of a decrease in sensitivity, which is also a trade-off common to a series of heuristic algorithms.

Reference: Peking University Bioinformatics

Guess you like

Origin blog.csdn.net/qq_52038588/article/details/128039166