What you need to know about population genetics
Today I will share a note about population genetics, which mainly refers to online public information and published literature, including an overview of population genetics, research methods, application fields, analysis processes, statistical principles, population structure assessment, etc.
What is the difference between a group and an individual?
In genetics, groups and individuals are two important concepts. A population refers to a group of individuals with common genetic characteristics, while an individual refers to a single organism.
First of all, a group is composed of multiple individuals, and an individual refers to a single organism. Genetic exchange and gene flow can occur between individuals in a population, which can lead to changes in gene frequencies in the population.
Secondly, population genetics studies the distribution and change patterns of genes inpopulation, while individual genetics studiesHereditary characteristics and genetic variation in individuals.
Population genetics focuses on the frequency and distribution of genes in a population, and understands the genetic structure and evolution of the population by studying the genetic composition of the population.
The study of organisms at the molecular level mainly looks at changes in single genes and changes in whole transcripts at the individual level. Based on the study of individuals, research on the group level began. Population genetics mainly studies the genetic rules of groups composed of different individuals.
Why do population genetic research?
Theoretical system
Before the rapid development of sequencing technology, groups were mainly studied based on phenotype. For example, 13 birds in the Galapagos Islands had different beaks. Darwin believed that this was the result of natural selection.
The corresponding point of view of Darwin's theory of evolution can be simply summarized as "natural selection, survival of the fittest", which is also the most popular theory of evolution.
It was not until 1968 that Japanese geneticists proposed the neutral evolution theory, also called the neutral evolution theory.
The neutral theory can be understood like this: a group of people draw a lottery, and if there is no inside information, everyone has an equal probability of winning the first prize. This possibility has nothing to do with the height, age, hobbies and other factors of the people participating in the lottery. . Neutral theory is often used as a hypothesis theory in population genetic studies to calculate various other statistical indicators.
technical means
The price of sequencing has dropped significantly. According to data released by NIH, sequencing technology has become popular in recent years. Second-generation high-throughput sequencing has become an essential means for genetic research. The technical conditions are fully equipped to realize genetic analysis of population resources. parse.
Population genetics based on resequencing
Resequencing can obtain genotype information of certain samples and identify key sites of variation. Through resequencing, the frequency distribution and changes of certain genes in the population can be analyzed, and the secrets contained in population genetics can be analyzed.
Type of genetic variation
Common mutation types include SNP, IdDel, SV, CNV, etc. The most concerned about resequencing is SNP, followed by InDel. There are not many studies on other structural variations. (Structural variation often needs to be studied separately and will not be expanded here)
Whole genome resequencing
Whole-genome sequencing of species with a reference genome is called resequencing, while whole-genome sequencing of species without a reference genome requires de novo assembly. As the price of sequencing decreases, reference genomes of more and more species have been sequenced and assembled.
In population genetics research, more species have reference genomes. Common plants include Arabidopsis thaliana, rice, wheat, and corn.
Resequencing analysis process
group evolutionary selection
positive selection
Positive selection can be better explained by natural selection: If a gene or locus can make an individual have stronger viability or fertility, this will make the individual have more offspring. In this way, this gene or There are more and more sites in the group.
Positive selection can spread beneficial mutation sites in the population, but at the same time reduce the polymorphism level of this site in the population.
In other words, the nucleotide composition around this site was originally diverse. After positive selection, the diversity of nucleotides around this site gradually became homogeneous.
This is like a field that originally contained rice, barnyard grass and other weeds. As the adaptability of barnyard grass increases, the barnyard grass gradually increases, the rice gradually decreases, and finally only barnyard grass remains.
This reduction in polymorphism after selection is called Selective Sweep.
negative selection
Negative selection and positive selection are exactly the opposite. If an individual in the population develops a fatal mutation, causing himself or his offspring to be eliminated from the population, this will also lead to a reduction in the polymorphism of that site in the population.
Just like I have 100 rice plants, and one of them suddenly disappears during the growth process, then for my small rice population, the unique site of the missing rice is missing in the population, and the overall Polymorphism is reduced.
Balanced choice
Balancing selection refers to the retention of multiple alleles in a population's gene pool at a higher frequency than expected from genetic drift, such as heterozygous advantage.
The balanced selection detection algorithm BetaScan2 is a Python script, and the input file only requires filtered SNP data.
Population Genetics Statistical Indicators
population polymorphism parameters
Parameter definition: where is the effective population size, is the mutation rate of each site.
Number of separation sites
The number of segregating sites is an estimate of and represents the position where the relevant gene exhibits polymorphism in a multiple sequence alignment.
where is the number of separation sites, such as the number of SNPs. is the sum of the reciprocals of the number of individuals.
Nucleic acid diversity
Refers to nucleotide diversity. The larger the value, the higher the nucleotide diversity. It is often used to measure nucleotide diversity within a population and can also be used to deduce evolutionary relationships.
can be understood as finding in pairs within the group, and then calculating the mean of the group. The commonly used software is .
vcftools
As shown in the example above, the nucleic acid diversity of the Sh4 gene (controlling rice grain shattering) is reduced in all subpopulations, indicating that this gene is under selection in all subpopulations, which may be related to artificial breeding selection.
within-group selection test
Tajima's D
It is a statistical test method proposed by Japanese scholar Tajima Fumio in 1989 to test whether DNA sequences follow a neutral evolution model during the evolution process.
The D value has the following three biological meanings:
-
D > 0: Balanced selection, sudden contraction. [Rare alleles exist at low frequency] -
D < 0: Experiencing bottleneck effect, followed by population expansion. [Rare alleles exist at high frequency] -
D = 0: balanced evolution, no evidence of selection
Disagreement between groups
It is called the fixed differentiation index, which is used to estimate the difference between the average polymorphism size between subpopulations and the average polymorphism size of the entire population, reflecting changes in the population structure.
The value range of is [0,1]. When=1, it indicates that there is obvious population differentiation between subgroups. The higher the value, the higher the degree of differentiation.
Under neutral evolutionary conditions, the size of mainly depends on factors such as genetic drift and migration. Assuming that a certain allele in the population has higher fitness for a specific environment and undergoes adaptive selection, the frequency of the gene will increase in the population, the level of differentiation of the population will increase, and the populationRise.
The value can be analyzed together with the GWAS results. Regions exceeding a certain threshold are often consistent with the sites screened by GWAS.
As shown above in the resequencing population genetic analysis of cotton, the GWAS significant peak signal overlaps with the peak signal of , confirming each other.
group disagreement test
ROD can identify and select models based on the difference in nucleotide polymorphism parameters between the wild population and the domesticated population, and can also measure the loss of the domesticated population compared with the wild-type population. morphology.
ROD, like Fst, can be combined with GWAS analysis. Usually, the corresponding nucleic acid diversity and selection differentiation index around an important site with significant correlation have obvious changes, and they are closely linked.
Group structure analysis
Evolutionary trees, PCA and population hierarchical diagrams are the three common musketeers in population genetic analysis. Their purpose is to display population structure information, such as groupings between materials, kinship relationships, clustering information, etc.
evolutionary tree
An evolutionary tree is a diagram that connects individuals according to distant relationships. A rooted tree means that all individuals have a common ancestor. The closer the lines are, the more similar the genetic relationships of the samples are, as shown below:
Out-group rooting method: When the differences between individuals in the group are small, other species can be introduced as roots.
An unrooted tree only shows the distance between individuals and has no common ancestors. The topology can be freely reconstructed to modify the shape of the tree, as shown in the following figure:
Drawing method: Commonly used drawing software are Phylip and Snpphylo. Software for evolutionary tree modification include MEGA, ggtree, etc. The recommended web version tool iTOL can be operated online.
PCA principal component analysis
PCA is a very common dimensionality reduction method. It can clearly see the distribution between samples. The closer the straight line distance between the points in the scatter plot, the closer the relationship. There are many software for PCA calculation, and plink can directly calculate PCA using vcf files.
Grouping based on PCA
Materials are divided according to the scatter information in the PCA plot. For example, in the following figure attached to the article about soybean resequencing, the points of different colors obviously show different distribution patterns, each representing different subgroups.
Outlier detection based on PCA
An outlier sample is a sample that looks very different from other samples in the PCA plot. It may be that the genetic background of this sample is very different from other samples, or it may be that the sample is confused, such as labeling the wild-type sample as domesticated species for sequencing.
Infer subgroup evolutionary relationships based on PCA
The distribution relationship between different individuals can be seen through PCA analysis, which is usually related to geographical factors. For example, due to the spatial distance between Europe and Asia, the difference between the two subgroups is large, and the point distance displayed in the PCA results is relatively large. Far.
group stratification diagram
Evolutionary trees and PCA can tell whether a group is hierarchical, but they cannot know how many groups are appropriate to divide the group into, nor can they see the genetic exchange between groups. Don't be afraid, the group stratification diagram will take action.
The essence of the population stratification diagram is a stacked column diagram. Each column is a sample. You can see the bloodline composition of a sample. The number of colors indicates that the sample comes from several ancestors.
If there is only one color, it means that the individual is pure. If there is a piece of uniform color, it means that the samples in this piece all contain similar bloodlines and should belong to the same subgroup.
linkage disequilibrium analysis
Linkage disequilibrium (LD) consists of two nouns, 连锁
+不平衡
. The two are a unified relationship of opposites. From a certain From a perspective, it represents the correlation of variation. This correlation can be measured using the correlation coefficient.
LD is an indicator that measures whether the genotypic changes of two molecular markers are in sync and there is correlation. If two SNP markers are located adjacently, genotypes will also appear to be in lockstep in the population. For example, there are two loci, corresponding to two alleles: A/a
and B/b
.
If two loci are linked, we will see that certain genotypes tend to be co-inherited, i.e. certain haplotypes will be more frequent than expected.
LD calculation method
usually uses and to represent the LD level between two loci. If two linked loci A and B are alleles Genes are A, a, B, b, and the corresponding frequencies are represented by plus subscripts. For example, represents the frequency corresponding to haplotype Ab. (There are 4 alleles, and 4 haplotypes)
Then the difference between the actual observed haplotype frequency and the expected haplotype frequency is calculated as:
The calculation method of correlation coefficient is:
The calculation method is:
LD decay analysis
As the distance between markers increases, the average LD degree will decrease and show an attenuation state. This situation is called LD attenuation.
LD decay can be used to judge the diversity difference of the population. Generally, the LD decay of the wild-type population is faster than that of the domesticated population. Whether the number of markers used in GWAS is sufficient is judged by the LD attenuation distance and the average distance between markers.
GWAS genome-wide association analysis
Genome-wide association analysis is commonly used in medicine and agriculture. Simply understood, it means performing correlation analysis on genetic markers such as SNP and phenotypic data, detecting loci related to the phenotype, and then going back to find the corresponding genes to study their impact on the phenotype. In medicine, these studied phenotypes are often disease phenotypes; in agronomy, they are often agronomic traits of concern, such as rice plant height, yield, number of grains per panicle, etc.
GWAS mathematical model
The above is only a brief introduction. Please refer to relevant information for specific mathematical models and methods.
GWAS result information
GWAS result files usually only have two plots, one is a Manhattan plot and the other is a QQ plot. Generally, the QQ chart is looked at first. If the QQ chart is normal, the results of the Manhattan chart will be meaningful.
QQ chart
A normal QQ chart will be slightly upward. If the QQ chart is abnormal, you should consider changing the model algorithm and try again.
Manhattan diagram
In fact, it is essentially a scatter plot. Each point represents a site. The higher the site, the more significant it is. If there are too many points and the heights are inconsistent, it will look as intricate as a high-rise building in Manhattan. (elegant scientific researcher)
The figure above shows the results of GWAS analysis of cotton resequencing. The key peak point is the research target position, and then functional verification experiments are performed.
Finally, thank you for reading this far! The material of this note is a compilation of part of the content of "Researching Monk Xiaolan Ge" in the brief book. It will be helpful to the study of population genetics. If you find it useful, please forward it and share it with others.
参考资料:
https://www.jianshu.com/p/807e54278539
https://zhuanlan.zhihu.com/p/541850657
https://www.jianshu.com/p/9793e14c0d08
This article is published by mdnice Multiple platforms