敏感性、特异性、假阳性、假阴性（sensitivity and specificity）

医学、机器学习等等，在统计结果时时长会用到这两个指标来说明数据的特性。

定义

敏感性：在金标准判断有病（阳性）人群中，检测出阳性的几率。真阳性。（检测出确实有病的能力）
特异性：在金标准判断无病（阴性）人群中，检测出阴性的几率。真阴性。（检测出确实没病的能力）
假阳性率：得到了阳性结果，但这个阳性结果是假的。即在金标准判断无病（阴性）人群中，检测出为阳性的几率。（没病，但却检测结果说有病），为误诊率。
假阴性率：得到了阴性结果，但这个阴性结果是假的。即在金标准判断有病（阳性）人群中，检测出为阴性的几率。（有病，但却检测结果说没病），为漏诊率。

计算方法

Sensitivity and specificity：完整定义

 
           True Positive （真正, TP）被模型预测为正的正样本；可以称作判断为真的正确率 
          
           True Negative（真负 , TN）被模型预测为负的负样本 ；可以称作判断为假的正确率 
          
           False Positive （假正, FP）被模型预测为正的负样本；可以称作误报率 
          
           False Negative（假负 , FN）被模型预测为负的正样本；可以称作漏报率 
          
           True Positive Rate（真正率 , TPR）或灵敏度（sensitivity）  
          
           TPR = TP /（TP + FN）  
          
           正样本预测结果数 / 正样本实际数  
          
           True Negative Rate（真负率 , TNR）或特指度（specificity）  
          
           TNR = TN /（TN + FP）  
          
           负样本预测结果数 / 负样本实际数  
          
           False Positive Rate （假正率, FPR）  
          
           FPR = FP /（FP + TN）  
          
           被预测为正的负样本结果数 /负样本实际数 
          
           False Negative Rate（假负率 , FNR）  
          
           FNR = FN /（TP + FN）  
          
           被预测为负的正样本结果数 / 正样本实际数

假阳性率=假阳性人数÷金标准阴性人数

即：假阳性率=b÷（b+d）

		金标准	金标准
		阳性（+）	阴性（-）	合计
某筛检方法	阳性（+）	a	b	a+b
某筛检方法	阴性（-）	c	d	c+d
合计		a+c	b+d	N

公式为:假阳性率=b/(b+d)×100%

(b:筛选为阳性,而标准分类为阴性的例数;d:阴性一致例数)

假阴性率=假阴性人数÷金标准阳性人数

即： β=c÷（a+c）

终于要用到这个玩意了，很激动，主要统计假阴假阳性率。

我的任务：

1. 评估Pacbio MHC variation calling 结果（CCS/non-CCS）与Hiseq数据结果的一致性。
2. 分别在不同深度梯度的区域完成以上评估，推断PB MHC做variation calling的最低深度。

这里要将一个位点分为SNP、REF 和 LowQual，然后只去 SNP 和 REF 进行统计。

这是我一下午写出来的统计代码：

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

 
           #!/usr/bin/env python 
          
 
           # Author: LI ZHIXIN 
          

              
          
 
           import 
           sys 
          
 
           import 
           pysam 
          
 
           from 
           collections  
           import 
           OrderedDict 
          

              
          
 
           def 
           classify_DP(depth): 
          
 
                
           if 
           depth >  
           101 
           : 
          
 
                    
           return 
           21 
          
 
                
           return 
           ((depth 
           - 
           1 
           ) 
           / 
           / 
           5 
           + 
           1 
           ) 
          

              
          
 
           def 
           parse_rec(rec): 
          
 
                
           sample  
           = 
           list 
           (rec.samples)[ 
           0 
           ] 
          
 
                
           # filter the Invalid line 
          
 
                
           if 
           not 
           ( 
           'GQ' 
           or 
           'GT' 
           or 
           'DP' 
           )  
           in 
           rec.samples[sample].keys()  
           or 
           len 
           (rec.alleles) < 
           = 
           1 
           : 
          
 
                    
           # continue 
          
 
                    
           return 
           1 
           ,  
           "LowQual" 
           , rec.pos 
          
 
                
           # filter the LowQual 
          
 
                
           if 
           rec.samples[sample][ 
           'GQ' 
           ] <  
           30 
           : 
          
 
                    
           return 
           rec.samples[sample][ 
           'DP' 
           ],  
           "LowQual" 
           , rec.pos 
          
 
                
           # filter the indel 
          
 
                
           flag  
           = 
           0 
          
 
                
           for 
           one  
           in 
           rec.alleles: 
          
 
                    
           if 
           len 
           (one) ! 
           = 
           len 
           (rec.ref): 
          
 
                        
           flag  
           = 
           1 
          
 
                
           if 
           flag  
           = 
           = 
           1 
           : 
          
 
                    
           return 
           rec.samples[sample][ 
           'DP' 
           ],  
           "LowQual" 
           , rec.pos 
          
 
                
           if 
           rec.samples[sample][ 
           'GT' 
           ] ! 
           = 
           ( 
           0 
           ,  
           0 
           ):  
           # rec.qual > 30 
          
 
                    
           # variation_dict[rec.pos] = ["snp", rec.alleles] 
          
 
                    
           return 
           rec.samples[sample][ 
           'DP' 
           ],  
           "snp" 
           , rec.pos   
          
 
                
           elif 
           rec.samples[sample][ 
           'GT' 
           ]  
           = 
           = 
           ( 
           0 
           ,  
           0 
           ): 
          
 
                    
           # variation_dict[rec.pos] = ["ref", rec.alleles] 
          
 
                    
           return 
           rec.samples[sample][ 
           'DP' 
           ],  
           "ref" 
           , rec.pos 
          

              
          
 
           def 
           read_gvcf(gvcf_file_path): 
          
 
                
           variation_dict  
           = 
           OrderedDict() 
          
 
                
           for 
           i  
           in 
           range 
           ( 
           1 
           , 
           22 
           ): 
          
 
                    
           variation_dict[i]  
           = 
           {} 
          
 
                    
           for 
           j  
           in 
           ( 
           'LowQual' 
           ,  
           'snp' 
           ,  
           'ref' 
           ): 
          
 
                        
           variation_dict[i][j]  
           = 
           [] 
          
 
                
           # pos_list = [] 
          
 
                
           gvcf_file  
           = 
           pysam.VariantFile(gvcf_file_path) 
          
 
                
           for 
           rec  
           in 
           gvcf_file.fetch( 
           'chr6' 
           , 
           28477796 
           , 
           33448354 
           ): 
          
 
                    
           DP, pos_type, pos  
           = 
           parse_rec(rec) 
          
 
                    
           if 
           DP <  
           1 
           or 
           DP >  
           20 
           : 
          
 
                        
           continue 
          
 
                    
           # DP = classify_DP(DP) 
          
 
                    
           variation_dict[DP][pos_type].append(pos) 
          
 
                    
           # print(pos, DP, pos_type) 
          
 
                
           gvcf_file.close() 
          
 
                
           # return variation_dict, pos_list 
          
 
                
           return 
           variation_dict 
          

              
          
 
           def 
           read_hiseq_gvcf(gvcf_file_path): 
          
 
                
           variation_dict  
           = 
           OrderedDict() 
          
 
                
           # for i in range(1,22): 
          
 
                
           # variation_dict[i] = {} 
          
 
                
           for 
           j  
           in 
           ( 
           'LowQual' 
           ,  
           'snp' 
           ,  
           'ref' 
           ): 
          
 
                    
           variation_dict[j]  
           = 
           [] 
          
 
                
           # pos_list = [] 
          
 
                
           gvcf_file  
           = 
           pysam.VariantFile(gvcf_file_path) 
          
 
                
           for 
           rec  
           in 
           gvcf_file.fetch( 
           'chr6' 
           , 
           28477796 
           , 
           33448354 
           ): 
          
 
                    
           DP, pos_type, pos  
           = 
           parse_rec(rec) 
          
 
                    
           DP  
           = 
           classify_DP(DP) 
          
 
                    
           variation_dict[pos_type].append(pos) 
          
 
                    
           # print(pos, DP, pos_type) 
          
 
                
           gvcf_file.close() 
          
 
                
           # return variation_dict, pos_list 
          
 
                
           return 
           variation_dict 
          

              
          
 
           def 
           show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2): 
          
 
                
           for 
           DP  
           in 
           range 
           ( 
           1 
           , 
           21 
           ): 
          
 
                    
           Hiseq_snp  
           = 
           set 
           (Hiseq_unified_variation_dict[ 
           'snp' 
           ]) 
          
 
                    
           Hiseq_ref  
           = 
           set 
           (Hiseq_unified_variation_dict[ 
           'ref' 
           ]) 
          
 
                    
           Hiseq_lowqual  
           = 
           set 
           (Hiseq_unified_variation_dict[ 
           'LowQual' 
           ]) 
          
 
                    
           PB_snp  
           = 
           PB_non_CCS_variation_dict[DP][ 
           'snp' 
           ] 
          
 
                    
           PB_ref  
           = 
           PB_non_CCS_variation_dict[DP][ 
           'ref' 
           ] 
          
 
                    
           PB_lowqual  
           = 
           PB_non_CCS_variation_dict[DP][ 
           'LowQual' 
           ] 
          
 
                    
           total  
           = 
           set 
           (PB_snp  
           + 
           PB_ref  
           + 
           PB_lowqual) 
          
 
                    
           Hiseq_snp  
           = 
           total & Hiseq_snp 
          
 
                    
           Hiseq_ref  
           = 
           total & Hiseq_ref 
          
 
                    
           Hiseq_lowqual  
           = 
           total & Hiseq_lowqual 
          
 
                    
           PB_snp  
           = 
           set 
           (PB_snp) 
          
 
                    
           PB_ref  
           = 
           set 
           (PB_ref) 
          
 
                    
           PB_lowqual  
           = 
           set 
           (PB_lowqual) 
          
 
                    
           a  
           = 
           len 
           (Hiseq_snp & PB_snp) 
          
 
                    
           b  
           = 
           len 
           (Hiseq_ref & PB_snp) 
          
 
                    
           c  
           = 
           len 
           (Hiseq_lowqual & PB_snp) 
          
 
                    
           d  
           = 
           len 
           (Hiseq_snp & PB_ref) 
          
 
                    
           e  
           = 
           len 
           (Hiseq_ref & PB_ref) 
          
 
                    
           f  
           = 
           len 
           (Hiseq_lowqual & PB_ref) 
          
 
                    
           g  
           = 
           len 
           (Hiseq_snp & PB_lowqual) 
          
 
                    
           h  
           = 
           len 
           (Hiseq_ref & PB_lowqual) 
          
 
                    
           i  
           = 
           len 
           (Hiseq_lowqual & PB_lowqual) 
          
 
                    
           Low_total  
           = 
           (g 
           + 
           h 
           + 
           i) 
           / 
           (a 
           + 
           b 
           + 
           c 
           + 
           d 
           + 
           e 
           + 
           f 
           + 
           g 
           + 
           h 
           + 
           i) 
          
 
                    
           if 
           (a 
           + 
           b)  
           = 
           = 
           0 
           : 
          
 
                        
           PPV  
           = 
           "NA" 
          
 
                    
           else 
           : 
          
 
                        
           PPV  
           = 
           a 
           / 
           (a 
           + 
           b) 
          
 
                        
           PPV  
           = 
           "%.4f" 
           % 
           (PPV) 
          
 
                    
           if 
           (a 
           + 
           d)  
           = 
           = 
           0 
           : 
          
 
                        
           TPR  
           = 
           "NA" 
          
 
                    
           else 
           : 
          
 
                        
           TPR  
           = 
           a 
           / 
           (a 
           + 
           d) 
          
 
                        
           TPR  
           = 
           "%.4f" 
           % 
           (TPR) 
          
 
                    
           print 
           ( 
           str 
           (DP) 
           + 
           " :\n" 
           , a,b,c, 
           "\n" 
           ,d,e,f, 
           "\n" 
           ,g,h,i, 
           "\n" 
           ,  
           file 
           = 
           outf2, sep 
           = 
           '\t' 
           , end 
           = 
           '\n' 
           ) 
          
 
                    
           print 
           (DP, TPR, PPV,  
           "%.4f" 
           % 
           Low_total,  
           file 
           = 
           outf, sep 
           = 
           '\t' 
           , end 
           = 
           '\n' 
           ) 
          

              
          
 
           with  
           open 
           ( 
           "./depth_stat.txt" 
           ,  
           "w" 
           ) as outf: 
          
 
                
           print 
           ( 
           "Depth" 
           ,  
           "TPR" 
           ,  
           "PPV" 
           ,  
           "Low_total" 
           ,  
           file 
           = 
           outf, sep 
           = 
           '\t' 
           , end 
           = 
           '\n' 
           ) 
          
 
                
           outf2  
           = 
           open 
           ( 
           "raw.txt" 
           ,  
           "w" 
           ) 
          
 
                
           Hiseq_unified_variation_dict  
           = 
           read_hiseq_gvcf( 
           "./hiseq_call_gvcf/MHC_Hiseq.unified.gvcf.gz" 
           ) 
          
 
                
           PB_non_CCS_variation_dict  
           = 
           read_gvcf( 
           "./non_CCS_PB_call_gvcf/MHC_non_CCS.unified.gvcf.gz" 
           ) 
          
 
                
           show_dict_diff_DP(Hiseq_unified_variation_dict, PB_non_CCS_variation_dict, outf, outf2) 
          
 
                
           outf2 
          

又碰到一个高级python语法：在双层循环中如何退出外层循环？我用了一个手动的flag，有其他好方法吗？

如何统计下机数据的覆盖度和深度？当然要比对之后才能统计，而且还要对比对做一些处理。

在计算一个位点是否是SNP、indel、Ref时，不仅要考虑ref、alts、qual、GQ，而且必须要把GT、DP考虑在内，所以说还是比较复杂的。

最后如何分析第二个问题，call variation的最低深度？

统计不同深度下的假阴假阳性率，看在什么深度下其达到饱和。

敏感性、特异性、假阳性、假阴性（sensitivity and specificity）

定义

计算方法

猜你喜欢