Chou和Fasman提出了二级结构的经验规则,其基本思想是在序列中寻找规则二级结构的成核位点和终止位点。在具体预测二级结构的过程中,首先扫描待预测的氨基酸序列,利用一组规则发现可能成为特定二级结构成核区域的短序列片段,然后对于成核区域进行扩展,不断扩大成核区域,直到二级结构类型可能发生变化为止,最后得到的就是一段具有特定二级结构的连续区域。下面是4个简要的规则:
1、α螺旋规则
a ,寻找核,沿着蛋白质序列寻找α螺旋核,相邻的6个残基中如果有至少4个残基倾向于形成α螺旋,即有4个残基对应的Pα>100,则认为是螺旋核。注意,是6个中的任意4个大于100即可,不需要连续。 b,扩展。从螺旋核向两端延伸,直至四肽a片段Pα的平均值小于100为止,注意,这里是指连续四个残基的平均值。
2、β折叠规则
a:寻找核,如果相邻5个残基中若有3个倾向于形成β折叠,即有3个残基对应的Pβ>100,则认β为是折叠核。b,扩展,折叠核向两端延伸直至4个残基Pβ的平均值小于100为止。
3、重叠规则
假如预测出的螺旋区域和折叠区域存在重叠,则按照重叠区域Pα均值和Pβ均值的相对大小进行预测,若Pα的均值大于Pβ的均值,则预测为螺旋;反之,预测为折叠。如果折叠区域重新分配后,剩下的螺旋或者折叠的长度小于5,则取消其分配的二级结果。
4、转角规则
转角的模型为四肽组合模型,要考虑每个位置上残基的组合概率,即特定残基在四肽模型中各个位置的概率。在计算过程中,对于从第i个残基开始的连续4个残基片段,将上述概率相乘,根据计算结果判断是否是转角。如果a,f(i)×f(i+1)×f(i+2)×f(i+3)>7.5×10-5,b, 四肽片段Pt的平均值大于100,c,并且Pt的均值同时大于Pα的均值以及Pβ的均值,则可以预测这样连续的4个残基形成转角。
Chou-Fasman预测方法原理简单明了,二级结构参数的物理意义明确,该方法中二级结构的成核、延伸和终止规则基本上反映了真实蛋白质中二级结构形成的过程。该方法的预测准确率在50%左右。
具体代码见:
#! /usr/bin/env python
# Summary:
# An implementation of the Chou-Fasman algorithm
# Authors:
# Samuel A. Rebelsky (layout of the program see:
# http://www.cs.grinnell.edu/~rebelsky/ExBioPy/Projects/project-7.5.html)
# Nicolas Girault
import string
import sys
protein1 = 'MKIDAIVGRNSAKDIRTEERARVQLGNVVTAAALHGGIRISDQTTNSVETVVGKGESRVLIGNEYGGKGFWDNHHHHHH'
protein2 = 'MRRYEVNIVLNPNLDQSQLALEKEIIQRALENYGARVEKVAILGLRRLAYPIAKDPQGYFLWYQVEMPEDRVNDLARELRIRDNVRRVMVVKSQEPFLANA'
protein3 = 'MVGLTTLFWLGAIGMLVGTLAFAWAGRDAGSGERRYYVTLVGISGIAAVAYVVMALGVGWVPVAERTVFAPRYIDWILTTPLIVYFLGLLAGLDSREFGIVITLNTVVMLAGFAGAMVPGIERYALFGMGAVAFLGLVYYLVGPMTESASQRSSGIKSLYVRLRNLTVILWAIYPFIWLLGPPGVALLTPTVDVALIVYLDLVTKVGFGFIALDAAATLRAEHGESLAGVDTDAPAVAD'
# The Chou-Fasman table, with rows of the table indexed by amino acid name.
# Data copied, pasted, and reformatted from
# http://prowl.rockefeller.edu/aainfo/chou.htm
# Columns are SYM,P(a), P(b),P(turn), f(i), f(i+1), f(i+2), f(i+3)
CF = {}
CF['Alanine'] = ['A', 142, 83, 66, 0.06, 0.076, 0.035, 0.058]
CF['Arginine'] = ['R', 98, 93, 95, 0.070, 0.106, 0.099, 0.085]
CF['Aspartic Acid'] = ['N', 101, 54, 146, 0.147, 0.110, 0.179, 0.081]
CF['Asparagine'] = ['D', 67, 89, 156, 0.161, 0.083, 0.191, 0.091]
CF['Cysteine'] = ['C', 70, 119, 119, 0.149, 0.050, 0.117, 0.128]
CF['Glutamic Acid'] = ['E', 151, 37, 74, 0.056, 0.060, 0.077, 0.064]
CF['Glutamine'] = ['Q', 111, 110, 98, 0.074, 0.098, 0.037, 0.098]
CF['Glycine'] = ['G', 57, 75, 156, 0.102, 0.085, 0.190, 0.152]
CF['Histidine'] = ['H', 100, 87, 95, 0.140, 0.047, 0.093, 0.054]
CF['Isoleucine'] = ['I', 108, 160, 47, 0.043, 0.034, 0.013, 0.056]
CF['Leucine'] = ['L', 121, 130, 59, 0.061, 0.025, 0.036, 0.070]
CF['Lysine'] = ['K', 114, 74, 101, 0.055, 0.115, 0.072, 0.095]
CF['Methionine'] = ['M', 145, 105, 60, 0.068, 0.082, 0.014, 0.055]
CF['Phenylalanine'] = ['F', 113, 138, 60, 0.059, 0.041, 0.065, 0.065]
CF['Proline'] = ['P', 57, 55, 152, 0.102, 0.301, 0.034, 0.068]
CF['Serine'] = ['S', 77, 75, 143, 0.120, 0.139, 0.125, 0.106]
CF['Threonine'] = ['T', 83, 119, 96, 0.086, 0.108, 0.065, 0.079]
CF['Tryptophan'] = ['W', 108, 137, 96, 0.077, 0.013, 0.064, 0.167]
CF['Tyrosine'] = ['Y', 69, 147, 114, 0.082, 0.065, 0.114, 0.125]
CF['Valine'] = ['V', 106, 170, 50, 0.062, 0.048, 0.028, 0.053]
aa_names = ['Alanine', 'Arginine', 'Asparagine', 'Aspartic Acid',
'Cysteine', 'Glutamic Acid', 'Glutamine', 'Glycine',
'Histidine', 'Isoleucine', 'Leucine', 'Lysine',
'Methionine', 'Phenylalanine', 'Proline', 'Serine',
'Threonine', 'Tryptophan', 'Tyrosine', 'Valine']
Pa = { }
Pb = { }
Pturn = { }
F0 = { }
F1 = { }
F2 = { }
F3 = { }
# Convert the Chou-Fasman table above to more convenient formats
# Note that for any amino acid, aa CF[aa][0] gives the abbreviation
# of the amino acid.
for aa in aa_names:
Pa[CF[aa][0]] = CF[aa][1]
Pb[CF[aa][0]] = CF[aa][2]
Pturn[CF[aa][0]] = CF[aa][3]
F0[CF[aa][0]] = CF[aa][4]
F1[CF[aa][0]] = CF[aa][5]
F2[CF[aa][0]] = CF[aa][6]
F3[CF[aa][0]] = CF[aa][7]
def CF_find_alpha(seq):
"""Find all likely alpha helices in sequence. Returns a list
of [start,end] pairs for the alpha helices."""
start = 0
results = []
# Try each window
while (start + 6 < len(seq)):
# Count the number of "good" amino acids (those likely to be
# in an alpha helix).
numgood = 0
for i in range(start, start+6):
if (Pa[seq[i]] > 100):
numgood = numgood + 1
if (numgood >= 4):
[estart,end] = CF_extend_alpha(seq, start, start+6)
#print "Exploring potential alpha " + str(estart) + ":" + str(end)
#if (CF_good_alpha(seq[estart:end])):
if [estart,end] not in results:
results.append([estart,end])
# Go on to the next frame
start = start + 1
# That's it, we're done
return results
def CF_extend_alpha(seq, start, end):
"""Extend a potential alpha helix sequence. Return the endpoints
of the extended sequence.
"""
# We extend the region in both directions until the average propensity for a set of four
# contiguous residues has Pa( ) < 100, which means we assume the helix ends there
# seq[end-3:end+1] is: x | x | x | END
while ( float(sum([Pa[x] for x in seq[end-3:end+1]])) / float(4) ) > 100 and end < len(seq)-1:
end += 1
# seq[start:start+4] is: START | x | x | x
while ( float(sum([Pa[x] for x in seq[start:start+4]])) / float(4) ) > 100 and start > 0:
start -= 1
return [start,end]
#def CF_good_alpha(subseq):
# """Determine if a subsequence appears to be an alpha helix."""
# sum_Pa = 0
# for aa in subseq:
# sum_Pa = sum_Pa + Pa[aa]
# ave_Pa = sum_Pa/len(subseq)
# # Criteria need to be extended
# return (ave_Pa > 100)
def CF_find_beta(seq):
"""Find all likely beta strands in seq. Returns a list
of [start,end] pairs for the beta strands."""
start = 0
results = []
# Try each window
while (start + 5 < len(seq)):
# Count the number of "good" amino acids (those likely to be
# in an beta sheet).
numgood = 0
for i in range(start, start+5):
if (Pb[seq[i]] > 100):
numgood = numgood + 1
if (numgood >= 3):
[estart,end] = CF_extend_beta(seq, start, start+5)
#print "Exploring potential alpha " + str(estart) + ":" + str(end)
#if (CF_good_alpha(seq[estart:end])):
if [estart,end] not in results:
results.append([estart,end])
# Go on to the next frame
start = start + 1
# That's it, we're done
return results
def CF_extend_beta(seq, start, end):
"""Extend a potential beta helix sequence. Return the endpoints
of the extended sequence.
"""
# We extend the region in both directions until the average propensity for a set of four
# contiguous residues has Pa( ) < 100, which means we assume the helix ends there
# seq[end-3:end+1] is: x | x | x | END
while ( float(sum([Pb[x] for x in seq[end-3:end+1]])) / float(4) ) > 100 and end < len(seq)-1:
end += 1
# seq[start:start+4] is: START | x | x | x
while ( float(sum([Pb[x] for x in seq[start:start+4]])) / float(4) ) > 100 and start > 0:
start -= 1
return [start,end]
def CF_find_turns(seq):
"""Find all likely beta turns in seq. Returns a list of positions
which are likely to be turns."""
result = []
for i in range(len(seq)-3):
# CONDITION 1
c1 = F0[seq[i]]*F1[seq[i+1]]*F2[seq[i+2]]*F3[seq[i+3]] > 0.000075
# CONDITION 2
c2 = ( float(sum([Pturn[x] for x in seq[i:i+4]])) / float(4) ) > 100
# CONDITION 3
c3 = sum([Pturn[x] for x in seq[i:i+4]]) > max(sum([Pa[x] for x in seq[i:i+4]]),sum([Pb[x] for x in seq[i:i+4]]))
if c1 and c2 and c3:
result.append(i)
return result
def region_overlap(region_a, region_b):
"""Given two regions, represented as two-element lists, determine
if the two regions overlap.
"""
return (region_a[0] <= region_b[0] <= region_a[1]) or \
(region_b[0] <= region_a[0] <= region_b[1])
def region_merge(region_a, region_b):
"""Given two regions, represented as two-element lists, return
the minimum region that contains both regions.
"""
return [min(region_a[0], region_b[0]), max(region_a[1], region_b[1])]
def region_intersect(region_a, region_b):
"""Given two regions, represented as two-element lists, return
the intersection of the two regions.
"""
return [max(region_a[0], region_b[0]), min(region_a[1], region_b[1])]
def region_difference(region_a, region_b):
"""Given two regions, represented as two-element lists, return
the part of region_a which in not in region_b.
It can be one or two regions depending on the position
of region_b and its size.
"""
# region_a start before region_b and stop before region_b
if region_a[0] < region_b[0] and region_a[1] <= region_b[1]:
return [[region_a[0], region_b[0]-1]]
# region_a start after region_b and stop after region_b
elif region_a[0] >= region_b[0] and region_a[1] > region_b[1]:
return [[region_b[1]+1,region_a[1]]]
# region_b is included in region_a => return 2 regions
elif region_a[0] < region_b[0] and region_a[1] > region_b[1]:
return [[region_a[0], region_b[0]-1],[region_b[1]+1,region_a[1]]]
# region_a is included in region_b
else:
return []
def ChouFasman(seq):
"""Analyze seq using the Chou-Fasman algorithm and display
the results. A represents 'alpha helix'. B represents
'beta strand'. T represents "turn". Space represents
'coil structure'.
"""
# Find probable locations of alpha helices, beta strands,
# and beta turns.
alphas = CF_find_alpha(seq)
#print "Alphas = " + str(alphas)
betas = CF_find_beta(seq)
#print "Betas = " + str(betas)
turns = CF_find_turns(seq)
#print "Turns = " + str(turns)
# Handle overlapping regions between alpha helix and beta strands
# SEE COMMENT IN MY REPORT: WHY I DONT MERGE THE ALPHA AND BETA REGIONS TOGETHER
'''# First we merge the alpha helix regions together
x = 0
while x < len(alphas)-1:
if region_overlap(alphas[x],alphas[x+1]):
alphas[x] = region_merge(alphas[x],alphas[x+1])
alphas.pop(x+1)
else:
x += 1
print "Potential alphas = " + str(alphas)
# The same for beta strand regions
x = 0
while x < len(betas)-1:
if region_overlap(betas[x],betas[x+1]):
betas[x] = region_merge(betas[x],betas[x+1])
betas.pop(x+1)
else:
x += 1
print "Ptential betas = " + str(betas)'''
# Then it's really messy!
alphas2 = []
alphas_to_test = alphas
betas_to_test = betas
while len(alphas_to_test) > 0:
alpha = alphas_to_test.pop()
# a_shorten record if the alpha helix region has been shorten
a_shorten = False
for beta in betas_to_test:
if region_overlap(alpha,beta):
inter = region_intersect(alpha,beta)
print 'Now studying overlap: '+str(inter)
sum_Pa = sum([Pa[seq[i]] for i in range(inter[0],inter[1]+1)])
sum_Pb = sum([Pb[seq[i]] for i in range(inter[0],inter[1]+1)])
if sum_Pa > sum_Pb:
# No more uncertainty on this overlap region: it will be a alpha helix
diff = region_difference(beta,alpha)
print '\tAlpha helix WIN - beta sheet region becomes: '+str(diff)
for d in diff:
if d[1]-d[0] > 4:
betas_to_test.append(d)
betas_to_test.remove(beta)
else:
# No more uncertainty on this overlap region: it will be a beta strand
a_shorten = True
diff = region_difference(alpha,beta)
print '\tBeta sheet WIN - alpha helix region becomes: '+str(diff)
for d in diff:
if d[1]-d[0] > 4:
alphas_to_test.append(d)
if not a_shorten:
alphas2.append(alpha)
alphas = alphas2
betas = betas_to_test
print 'final alphas: '+str(alphas)
print 'final betas: '+str(betas)
# Build a sequence of spaces of the same length as seq.
analysis = [' ' for i in xrange(len(seq))]
# Fill in the predicted alpha helices
for alpha in alphas:
for i in xrange(alpha[0], alpha[1]):
analysis[i] = 'A'
# Fill in the predicted beta strands
for beta in betas:
for i in xrange(beta[0], beta[1]):
analysis[i] = 'B'
# Fill in the predicted beta turns
for turn in turns:
analysis[turn] = 'T'
# Turn the analysis and the sequence into strings for ease
# of printing
astr = string.join(analysis, '')
sstr = string.join(seq, '')
print astr
if len(sys.argv) != 2:
print 'Usage: %s [protein1, protein2, protein3]' %sys.argv[0]
elif sys.argv[1] == '-h' or sys.argv[1] == '--help':
print 'Usage: %s [protein1, protein2, protein3]' %sys.argv[0]
elif sys.argv[1] == 'protein1':
seq = protein1
ChouFasman(seq)
elif sys.argv[1] == 'protein2':
seq = protein2
ChouFasman(seq)
elif sys.argv[1] == 'protein3':
seq = protein3
ChouFasman(seq)
else:
print 'Usage: %s [protein1, protein2, protein3]' %sys.argv[0]