Functional enrichment analysis Overview

Enrichment analysis of gene function has become a routine means of high-throughput omics data analysis, to reveal the molecular mechanism of biological medicine is important. About GO, KEGG, GSEA, etc. These words, there are many online tutorials to teach you how to do GO analysis, how do GSEA analysis and so on. But we must not only know these, but also know why. Here, I found a review of two enrichment analysis, we will study together with it.

As usual, first give these two articles
Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges
gene function analysis research progress enrichment


First, why should the functional enrichment analysis?

With the development of high-throughput technologies, biomedical research into the field of genomics era, the study of individual genes can not meet the needs of researchers. However, such a large data enables efficient extraction and analysis of information brings new challenges. With sequencing data, for example, sequencing analysis or protein tend to get a list of differentially expressed genes. But for many researchers, the long list of these genes or proteins linked to a biological phenomenon to be studied and its potential mechanism is very difficult. A method of meeting this challenge is to list a gene or protein into a plurality of portions, thereby reducing the complexity of the analysis. The researchers divided the class in order to resolve what, developed a number of comments databases. To address how into different categories, researchers will usually be enriched gene function analysis, it was found desirable to play a key role in the biological processes of biological pathways, thereby revealing underlying molecular mechanisms and understanding of biological processes, the development in the process a variety of software.

Enrichment analysis function hundreds genes, proteins, or other molecules may be assigned to a different path, to reduce the complexity of analysis. Further, under two different experimental conditions, activation pathway is clearly more convincing than a simple list of a gene or protein.

Second, the enrichment of gene function and gene function database analysis software

Common Notes database: GO , KEGG , Reactome , Biocarta , MsigDB etc;
commonly used software:

 
Figure 1

 

 

Third, the analysis of gene function enrichment

Currently, the main method of enrichment analysis function into four categories:

ORA: over-representation analysis over-expression analysis
FCS: functional class scoring function scoring set
PT: pathway topology passage topology
NT: network topology network topology

 
Figure 2

1. ORA method

Also known as "2X2 Law";
First of all, get a set of genes of interest (typically differentially expressed genes), and then given a list of genes and gene sets a path to do the intersection, which identify common genes and whether the count value (count value), and finally the use of statistical tests will be assessed through observation is significantly higher than random, i.e. the test feature set whether the gene list significantly enriched. The most commonly used statistical tests include: the hypergeometric distribution, chi-square test, binomial distribution.

Here commonly used software or site DAVID etc;

advantage

Based on comprehensive statistical theory, the result is robust, reliable advantages

Shortcoming

(1)仅使用了基因数目信息,而没有利用基因表达水平或表达差异值,而为了获得感兴趣或者差异表达基因,需要人为的设置阈值;
(2)ORA法通常仅使用最显著的基因,而忽略差异不显著的基因。在获得感兴趣的基因时, 往往需要选取合适的阈值, 而这样有可能会丢失显著性较低但比较关键的基因, 导致检测灵敏性的降低;
(3)将基因同等对待,ORA法假设每个基因都是独立的,忽视了基因在通路内部生物学意义的不同(如调控和被调控基因的不同)及基因间复杂的相互作用;
(4)ORA假设通路与通路间是独立的,但这个前提假设是错误的。

2. FCS法

首先根据案例和对照状态下的基 因表达谱对基因组中所有基因表达水平的差异值进行打分或排序,或直接输入排序好的基因表达谱;其次是把待测基因功能集中的每个基因的分数通过特定的统计模型转换为待测基因功能集的分数或统计值;最后利用随机抽样获得的待测基因功能集统计值的背景分布来检验实际观测的统计值的显著水平,并判断待测基因功能集在案例和对照实验状态下是否发生了统计上的显著变化。

除了上述处理和对照组比较的方法外,FCS还有一类基于单样本的分析方法,如PLAGE/ZSCORE/SSGSEA,这些方法的一大优点是可以通过调整相关协变量,相对简单地分析一些非常复杂的,如包含时间进度的多样本设计。

优点

总体而言, FCS 相较于 ORA 方法 在理论上有明显突破, 考虑到了基因表达值的属性 信息, 而且以待测基因功能集为对象来进行检验, 也 使得检验结果更加灵敏.

缺点

(1)与ORA类似,FCS仍独立分析每一条通路,但同一个基因可能涉及多条通路,所以不同通路间的基因出现重叠,所以别的通路可能由于重叠的基因,也出现显著富集;
(2)FCS 方法仍然把待测基因功能集中的每个基因作为独立的个体, 忽略了基因的生物学属性和基因间的复杂相互作用关系。


3. PT法

ORA和FCS方法在进行通路的富集分析时, 都将通路中的每个基因视作独立个体,而实际上通路内的基因需要通过调控、被调控、相互作用等复杂的关系一起来影响细胞的发育、分化或疾病等生物学过程。因而,在进行通路的富集分析时,尤其是基因表达的通路富集分析时,有必要考虑到通路中基因的生物学属性。例如,在一个调控通路中,上游基因的表达水平改变显然要远大于下游基因的表达水平改变对整个通路的影响。基于通路拓扑结构的PT富集分析方法就是把基因在通路中的位置(上下游关系),与其他基因的连接度和调控作用类型等信息综合在一起来评估每个基因对通路的贡献并给予相应的权重,然后再把基因的权重整合入功能富集分析。不同的PT方法在具体的权重打分时,采用了不同的方式。

 GO 等注释数据库中基因功能集中不包含任何拓扑结构信息,仅提供了可能属于同一通路的所有基因列表。
所以,PT方法不能被用于GO通路的富集分析。

优点

对于研究较完善、拓扑结构完整的通路,基于PT的基因功能富集算法会有更强的显著性;

缺点

对于通路拓扑结构存在依赖性,该类方法对于研究较少、信息不完善的通路稳健性较差,因此目前通路注释的不完善也是限制基于PT的基因功能富集分析方法进一步发展的重要因素。


4. NT法

NT present method have some different ideas:
(1) some biological enrichment analysis method based on network topology, which use the database to genes indirectly the interaction between the biological properties of gene integrated into the enrichment analysis . The main idea of these methods is to use an existing group-wide gene biological networks, such as the HPRD, FunCoup, STRING and the like, to extract the interaction relationship between genes, including genes from the connectivity and the like in the network, is calculated a given gene list and a test data set of gene function relationships in the network connection, so to speculate whether a given gene test gene function set list is closely related to; such as "NEA / EnrichNet software"
(2) another method is to use some of the network topology to calculate the importance of a particular biological pathway genes and appropriate weights then ORA or using conventional methods to assess the degree of enrichment FCS specific biological pathway, such as LEGO GANPA and the like;
( 3) some of these methods directly to the list of genes enriched in question using the network function into functional gene enrichment problems, such as NOA like.

advantage

Compared with traditional methods, based enrichment analysis of gene function network method of adding the system level and the degree of importance of genes associated information, making predictions more accurate and reliable.

Shortcoming

More information can easily lead to join algorithms are too complex, the slower computing speeds.


Note that a different approach has its advantages and disadvantages, researchers should have some understanding of the basis for enrichment analysis, choose the appropriate method.

The above is nonsense, if I so choose well, that there will be no more than 100 enrichment analysis software site. More often, we would common GO, GSEA analysis almost enough.



Author: Seng Credit family
link: https: //www.jianshu.com/p/5a4bda169247
Source: Jane book
Jane book copyright reserved by the authors, are reproduced in any form, please contact the author to obtain authorization and indicate the source.

Guess you like

Origin www.cnblogs.com/wangshicheng/p/11131086.html