Introduction to WGCNA
- WGCNA (Weighted Gene Co-Expression Network Analysis, weighted gene co-expression network analysis), identify gene sets with similar expression patterns (module), analyze the relationship between gene sets and sample phenotypes, and map the regulatory network between gene sets Identify key regulatory genes
- WGCNA is suitable for complex transcriptome data (large number of samples)
- Study the developmental regulation of different organs/tissue types and different stages, and different time response mechanisms of biotic and abiotic stress
WGCNA principle
Principle summary
1.Construction of gene co-expression network: First construct a gene co-expression network, usually use the expression pattern between two genes to calculate a correlation coefficient between them, and then construct a gene network based on the correlation coefficient;
2.Identify module: After constructing the genetic relationship, we use the threshold to delineate those genetic relationships that are relatively close, and we divide the close relationship into a module
3.Associate module with external information: Do some feature analysis on the module, including assigning feature values to it, and GO enrichment of the genes in the module to explore its function
4.Study the relationship between modules: Screen the key module
5 related to biological issues through the expression pattern of the module and the function of the module .Identify regulatory genes in key modules: Analyze the internal genes of key modules, including looking at the functions of internal annotation genes and a relationship between their regulatory levels, etc. to identify some of the key regulatory genes in the module
Build a genetic relationship network
Calculate correlation between genes
- Similarity between genes: Calculate the correlation between any two genes based on the expression of genes in different samples, using Pearson correlation coefficient
- Gene co-expression similarity matrix: S = [S ij ] (S ij represents the Pearson correlation coefficient of gene i and gene j)
- Hard threshold : one size fits all, to judge whether two genes are similar, set a threshold, such as the threshold value is 0.8, the correlation coefficient is greater than 0.8 is similar, 0.79 is not similar, the one-size-fits-all method is not suitable for studying biological problems, because 0.79 And 0.81 is not much different in biology
- Soft threshold : transform the correlation coefficient through a weighting function (adjacent function) to form an adjacency matrix (Adjacency Matrix), and the elements in the matrix are continuous
- Adjacency function : power function (power exponential function)
a ij = power (S ij , β) = lS ij l β
The key point is to determine the parameter (β) of the adjacency function , based on the principle of scale-free network, that is, gene expression network conforms to the power function distribution of scale-free network.
Power exponential function can make the elements in the matrix conform to the principle of scale-free network
Scale-free network (power exponential function conforms to biological significance)
- The network in mathematics: the point in the network diagram refers to each node in the network, and the degree refers to the number of passes connected to the point (the number of connections)
- Random network : Random network: the degree of each node is relatively average, the number of connections of each node in the random network conforms to the Poisson distribution, and the number of connections of most nodes is in the middle. This median is called the scale of the random network
- Scale-free networks : Scale-free network, the node having a small number of points significantly higher than the general degree, these points are called hub , a hub is associated with a small number of other nodes, ultimately constitute the entire network, we can see that the scale-free networks have primary and secondary And importantly, hub is some of its important nodes . Many networks in life are non-scale networks, such as flight routes.
- In the biological regulatory network, there are a few genes that play very important regulatory roles, while other genes are not as high as their regulatory levels. The advantage of this network is that if only unimportant marginal genes are destroyed , The main function of the organism will not be destroyed. This means that in the face of some stress or external damage, the organism has the ability to respond within a certain period of time.
- The establishment of the adjacency function will make the gene expression matrix conform to the scale-free network,
-== the characteristics of the scale-free network and why the adjacency function makes the gene expression matrix conform to the scale-free network principle? ? ? ==
[1] The power-law distribution of a scale-free network: the number of nodes h where the number of node connections is k, k is inversely proportional to h, and negatively correlated . Most points have few connections, and a few points have many connections. This network There is no scale to measure the distance between nodes in the network
[2] Gene correlation. After the power function is processed, a small number of strong correlations are not affected or have a small impact, while the weak correlations are taken to the power of n, the correlation is significantly reduced
[ 3] Scale-free: The number of connections of each node in the random network conforms to the Poisson distribution, and the number of connections of most nodes is in the middle. This median is called the scale of the random network. Most points in the non-scale network have very few connections. A few points have many connections, the network does not have a scale to measure the distance between nodes in the network
Determine the key parameter β
-
Find the appropriate parameter β so that the gene expression relationship conforms to the scale-free network, the number of nodes with high degree is less, and the number of nodes with less degree is more
-
And degree k having the node number of the node degrees h obey a power law profile
power law ; having a variable distribution properties, as long as the distribution function is a power function cloudy (due to the distribution density function necessarily meet the "return all", so here The power function of is generally specified to be less than negative 1), and it can be said to satisfy the power law distribution law. This distribution is a common phenomenon in nature. For example, the size of an earthquake, usually the smaller the magnitude, the greater the frequency, and the larger the magnitude, the smaller the frequency. Taking the magnitude of the earthquake as the independent variable and the frequency (or probability) of its occurrence as the dependent variable, conforming to the (negative) power function. -
WGCNA has a model that can try the value of β, starting from 1, β=1, 2. . . . When, what is the power function, one by one, use the model to calculate, after the calculation, determine which β is better,
-
The method for judging the appropriateness of β
After taking a certain β parameter, specifically calculate the logarithm log(k) of the number of nodes with degree k, which is negatively correlated with the logarithm log(p(k)) of the probability of occurrence of the node Generally, the correlation coefficient is set to be greater than 0.8. When the
β parameter is set to 8, the node and the degree are more consistent with the scale-free network.
In order to check whether the set parameter β satisfies the scale-free network, plot log 10 (p(k)) as For better evaluation, square the correlation coefficient between the two, namely R 2 , if the model R 2 is close to 1, then there is a good linear relationship between the two
Calculate the expression relationship between genes (indirect relationship)
- Before, we only considered the relationship between genes
- The relationship between genes in organisms: direct relationship + indirect relationship
- TOM: Use topological overlap measure (TOM) to calculate the degree of association between genes. In addition to analyzing the relationship between two genes, it also considers the connection between these two genes and other genes, which has biological significance.
- To establish the TOM matrix, in addition to considering the direct relationship between the two genes on the basis of the adjacency function, the indirect relationship is also considered
In the TOM formula, calculating the relationship between genes i and j, not only considers the direct relationship between i and j, but also considers the indirect relationship between the third gene u
Building gene modules
On the basis of the Tom value, the method of dynamic tree creation is adopted to build the gene module
Hierarchical clustering
- The division of gene modules is based on the sparsity of connections between genes (understood as the degree of sparseness between genes), and the TOM matrix (Similarity) is transformed into a dissimilarity matrix (Dissimilarity):d ij w = 1 - w ij(For the convenience of drawing the matrix)
- Constructing Trees by Hierarchical Clustering of Suitability Based on TOM Value
- Brief description of methods: static shearing tree and dynamic shearing tree (dynamic tree method and dynamic mixed shearing method) WGCNA generally uses dynamic shearing tree, R package uses dynamic mixed shearing method
Static cut tree: Cut a continuous branch on the cluster tree into a single cluster by a defined fixed height , which has good specificity for the identification of gene modules , but has low sensitivity , and it is easy to miss genes at the edges of gene modules
Dynamic tree method: "From top to bottom", several larger modules are obtained by the static method, and the final module is identified through continuous decomposition and combination (repeated iterative calculation process)
Dynamic mixed shear:
[1] Identify the primary modules that meet the set conditions
(1) Meet the minimum number of genes predefined by the module
(2) Genes that are too far away from the cluster, even if they are in the same branch of the cluster, also remove
(3) Each cluster and others The surrounding clusters are significantly different
(4) The core genes of each cluster at the tip of the tree branch are closely connected.
[2] Test step
(1) Test the unassigned genes, and if they are close enough to a primary cluster, assign them
(2) Usually WGANA uses dynamic mixed shearing method to establish
Parameters of the establishment process
- The minimum number of genes in the module (miniModuleSize)
- The minimum distance of merging modules (minicutHeight): Calculate the eigenvalues of the modules, use the eigenvalues of the modules to build a tree, and merge the modules with very close distances, such as the height value <0.2
- Module characteristic value (Epigengene): Principal component analysis (PCA) is performed on all genes in the module. The value of the first principal component is Epigengene, which represents the overall level of gene expression (expression pattern) in the module. The module can be regarded as a gene, then the characteristic value of the module is It can be regarded as the expression value of this gene.
The characteristic value of each module will be used to build a tree to build the correlation between the modules, and the modules will be merged according to the height value between the modules.
Screening gene modules
method one:Expression pattern analysis-Analyze the expression pattern of each module in all samples
Method Two:Phenotypic association analysis-Analyze the relationship between gene modules and phenotype data (calculate the correlation coefficient between the two)
Method three:Enrichment analysis-Perform GO and KEGG function enrichment analysis of genes in the module
Method four:Target gene-Screen the module based on the target gene of interest
Method 1: Analysis of the eigenvalue expression pattern of the module
-
Module expression pattern analysis: the abundance of the characteristic value of the module in each sample
-
Module feature value (Epigengene): All genes in the module are subjected to principal component analysis (PCA), the value of the first principal component is Epigengene, which represents the overall level of gene expression (expression pattern) in the module, and the module can be regarded as a gene , Then the characteristic value of the module can be regarded as the expression value of this gene
-
If the positive or negative expression of the characteristic value of the module in the sample is high, it means that the module is closely related to the sample
Method 2: Association analysis of modules and phenotypic traits
- Module significance (Module significance, MS) : the average value of the gene significance values of all genes in the module
- Gene significance (GS) : The correlation coefficient between the gene expression level and the dependent variable level, which can be understood as the correlation coefficient between the expression level of this gene and a phenotype. Use T test to calculate the significance test P value (Pearson correlation coefficient) of the differential expression of each gene in different phenotypic sample groups. Usually, the P value is defined as the gene significance GS with the logarithm based on 10
- Calculate the MS value of each module and a certain phenotypic trait. If the MS value of one module is significantly higher than that of other modules, there is an association relationship between this module and the trait
- Modular eigenvalue significance (Epigengene significance, ES) : The correlation coefficient between the eigenvalue of the module and a trait, and the module with the highest correlation with the trait is selected
Method 3: Functional enrichment analysis of modular genes
- Perform GO and KEGG function enrichment analysis for each module, and find the module with the strongest correlation with our research traits for in-depth exploration
Method 4: Screening modules based on target genes
- According to the research purpose, previous research results and published literature, there are target genes that are of great concern, and the gene module where the target gene is located can be directly screened for the next step of analysis
Identify key genes
method one:Analysis of gene connectivity within the module
Method Two:Specific function (type) gene analysis
Method three:Target gene association analysis
Method 1: Analysis of gene connectivity within the module
- Connectivity (degree) : the sum of all other genes connected to a gene (direct connection + indirect connection), and describes the degree of association between a gene and other genes, generally expressed by K value
- Intramodular connectivity module internal connectivity IC : the degree of association between a gene in a module and other genes in the module (degree of co-expression), expressed by KIM value, which can be used to measure module membership (MM)
- Module Menbership MM or Epigengene-bsaed connectivity KME : Module identity, which uses the correlation between the expression profile of a gene in all samples and the expression profile of a certain module feature value to measure the identity of this gene in this module
- The KME value is close to 0, indicating that this gene is not a member of the module; KME is close to 1 or -1, indicating that this gene is closely related to the module (positively or negatively)
- It is possible to calculate the KME value relative to a module for all genes, not necessarily members of the module
- The difference between KME and KIM : IC measures the identity of genes in a specific module, and MM measures the position of genes in the global network
- KME and KIM are highly correlated : the hub gene with a high KIM value in a certain module must have a high KME in that module
Screen key genes:
【1】
- Only the two genes whose TOM value (weight value in the module regulation relationship table) is greater than the threshold (default is 0.15) are considered to be related, and then the connection degree of each gene is calculated, that is, the relationship with sufficient strength is first screened, and then the connection is calculated degree
- The gene of the internal connectivity of the module, ranking the top 30% or 10% in the module (KME or KIM)
- Cytoscape generally uses weight value (TOM value) to draw network diagrams
[2] - Make a scatter plot of the gene module identity MM relative to the gene significance GS, and select the genes with high MM and GS in the upper right corner for further analysis
- Gene significance (GS): The correlation coefficient between the gene expression level and the level of the dependent variable, which measures the degree of association between genes and phenotypic traits. The higher the GS, the more relevant the phenotype and the more biologically meaningful. GS Can be positive or negative (positive correlation or negative correlation)
Method 2: Gene analysis for specific functions (types)
- Genes with high connectivity are generally located upstream of the regulatory network; genes with low connectivity are generally located downstream of the regulatory network
- The upstream of the regulatory network is generally a regulatory factor, such as a transcription factor; the downstream is generally a functional enzyme or protein molecule
- Focus on genes with regulatory functions, typically transcription factors, these genes are often key genes
Method 3: Target gene association analysis
- According to the research purpose, select genes closely related to the target gene, such as screening the top 10 genes with TOM value of the target gene, or genes with TOM value greater than 0.2 (threshold can be set)
- Can accurately screen candidate genes that have upstream and downstream regulatory relationships with target genes
- When the target gene is not highly connected, you can select genes that have a high TOM with the target gene and are also highly connected