Essential knowledge tips for population genetics

What you need to know about population genetics

Today I will share a note about population genetics, which mainly refers to online public information and published literature, including an overview of population genetics, research methods, application fields, analysis processes, statistical principles, population structure assessment, etc.

everything

What is the difference between a group and an individual?

In genetics, groups and individuals are two important concepts. A population refers to a group of individuals with common genetic characteristics, while an individual refers to a single organism.

everything

First of all, a group is composed of multiple individuals, and an individual refers to a single organism. Genetic exchange and gene flow can occur between individuals in a population, which can lead to changes in gene frequencies in the population.

Secondly, population genetics studies the distribution and change patterns of genes inpopulation, while individual genetics studiesHereditary characteristics and genetic variation in individuals.

Population genetics focuses on the frequency and distribution of genes in a population, and understands the genetic structure and evolution of the population by studying the genetic composition of the population.

The study of organisms at the molecular level mainly looks at changes in single genes and changes in whole transcripts at the individual level. Based on the study of individuals, research on the group level began. Population genetics mainly studies the genetic rules of groups composed of different individuals.

Why do population genetic research?

Theoretical system

Before the rapid development of sequencing technology, groups were mainly studied based on phenotype. For example, 13 birds in the Galapagos Islands had different beaks. Darwin believed that this was the result of natural selection.

The corresponding point of view of Darwin's theory of evolution can be simply summarized as "natural selection, survival of the fittest", which is also the most popular theory of evolution.

everything

It was not until 1968 that Japanese geneticists proposed the neutral evolution theory, also called the neutral evolution theory.

The neutral theory can be understood like this: a group of people draw a lottery, and if there is no inside information, everyone has an equal probability of winning the first prize. This possibility has nothing to do with the height, age, hobbies and other factors of the people participating in the lottery. . Neutral theory is often used as a hypothesis theory in population genetic studies to calculate various other statistical indicators.

technical means

The price of sequencing has dropped significantly. According to data released by NIH, sequencing technology has become popular in recent years. Second-generation high-throughput sequencing has become an essential means for genetic research. The technical conditions are fully equipped to realize genetic analysis of population resources. parse.

everything

Population genetics based on resequencing

Resequencing can obtain genotype information of certain samples and identify key sites of variation. Through resequencing, the frequency distribution and changes of certain genes in the population can be analyzed, and the secrets contained in population genetics can be analyzed.

Type of genetic variation

Common mutation types include SNP, IdDel, SV, CNV, etc. The most concerned about resequencing is SNP, followed by InDel. There are not many studies on other structural variations. (Structural variation often needs to be studied separately and will not be expanded here)

everything

Whole genome resequencing

Whole-genome sequencing of species with a reference genome is called resequencing, while whole-genome sequencing of species without a reference genome requires de novo assembly. As the price of sequencing decreases, reference genomes of more and more species have been sequenced and assembled.

everything

In population genetics research, more species have reference genomes. Common plants include Arabidopsis thaliana, rice, wheat, and corn.

Resequencing analysis process

everything

group evolutionary selection

positive selection

Positive selection can be better explained by natural selection: If a gene or locus can make an individual have stronger viability or fertility, this will make the individual have more offspring. In this way, this gene or There are more and more sites in the group.

everything

Positive selection can spread beneficial mutation sites in the population, but at the same time reduce the polymorphism level of this site in the population.

In other words, the nucleotide composition around this site was originally diverse. After positive selection, the diversity of nucleotides around this site gradually became homogeneous.

This is like a field that originally contained rice, barnyard grass and other weeds. As the adaptability of barnyard grass increases, the barnyard grass gradually increases, the rice gradually decreases, and finally only barnyard grass remains.

This reduction in polymorphism after selection is called Selective Sweep.

negative selection

Negative selection and positive selection are exactly the opposite. If an individual in the population develops a fatal mutation, causing himself or his offspring to be eliminated from the population, this will also lead to a reduction in the polymorphism of that site in the population.

Just like I have 100 rice plants, and one of them suddenly disappears during the growth process, then for my small rice population, the unique site of the missing rice is missing in the population, and the overall Polymorphism is reduced.

everything
Balanced choice

Balancing selection refers to the retention of multiple alleles in a population's gene pool at a higher frequency than expected from genetic drift, such as heterozygous advantage.

everything

The balanced selection detection algorithm BetaScan2 is a Python script, and the input file only requires filtered SNP data.

Population Genetics Statistical Indicators

population polymorphism parameters

everything Parameter definition: where is the effective population size, is the mutation rate of each site.

Number of separation sites

The number of segregating sites is an estimate of and represents the position where the relevant gene exhibits polymorphism in a multiple sequence alignment.

everything where is the number of separation sites, such as the number of SNPs. is the sum of the reciprocals of the number of individuals.

everything
Nucleic acid diversity

Refers to nucleotide diversity. The larger the value, the higher the nucleotide diversity. It is often used to measure nucleotide diversity within a population and can also be used to deduce evolutionary relationships.

everything

can be understood as finding in pairs within the group, and then calculating the mean of the group. The commonly used software is . vcftools

everything

As shown in the example above, the nucleic acid diversity of the Sh4 gene (controlling rice grain shattering) is reduced in all subpopulations, indicating that this gene is under selection in all subpopulations, which may be related to artificial breeding selection.

within-group selection test

Tajima's DIt is a statistical test method proposed by Japanese scholar Tajima Fumio in 1989 to test whether DNA sequences follow a neutral evolution model during the evolution process.

everythingThe D value has the following three biological meanings:

  • D > 0: Balanced selection, sudden contraction. [Rare alleles exist at low frequency]
  • D < 0: Experiencing bottleneck effect, followed by population expansion. [Rare alleles exist at high frequency]
  • D = 0: balanced evolution, no evidence of selection
Disagreement between groups

It is called the fixed differentiation index, which is used to estimate the difference between the average polymorphism size between subpopulations and the average polymorphism size of the entire population, reflecting changes in the population structure.

everything

The value range of is [0,1]. When=1, it indicates that there is obvious population differentiation between subgroups. The higher the value, the higher the degree of differentiation.

Under neutral evolutionary conditions, the size of mainly depends on factors such as genetic drift and migration. Assuming that a certain allele in the population has higher fitness for a specific environment and undergoes adaptive selection, the frequency of the gene will increase in the population, the level of differentiation of the population will increase, and the populationRise.

The value can be analyzed together with the GWAS results. Regions exceeding a certain threshold are often consistent with the sites screened by GWAS.

everything

As shown above in the resequencing population genetic analysis of cotton, the GWAS significant peak signal overlaps with the peak signal of , confirming each other.

group disagreement test

ROD can identify and select models based on the difference in nucleotide polymorphism parameters between the wild population and the domesticated population, and can also measure the loss of the domesticated population compared with the wild-type population. morphology.

everything

ROD, like Fst, can be combined with GWAS analysis. Usually, the corresponding nucleic acid diversity and selection differentiation index around an important site with significant correlation have obvious changes, and they are closely linked.

Group structure analysis

Evolutionary trees, PCA and population hierarchical diagrams are the three common musketeers in population genetic analysis. Their purpose is to display population structure information, such as groupings between materials, kinship relationships, clustering information, etc.

evolutionary tree

An evolutionary tree is a diagram that connects individuals according to distant relationships. A rooted tree means that all individuals have a common ancestor. The closer the lines are, the more similar the genetic relationships of the samples are, as shown below:

everything

Out-group rooting method: When the differences between individuals in the group are small, other species can be introduced as roots.

An unrooted tree only shows the distance between individuals and has no common ancestors. The topology can be freely reconstructed to modify the shape of the tree, as shown in the following figure:

everything

Drawing method: Commonly used drawing software are Phylip and Snpphylo. Software for evolutionary tree modification include MEGA, ggtree, etc. The recommended web version tool iTOL can be operated online.

PCA principal component analysis

PCA is a very common dimensionality reduction method. It can clearly see the distribution between samples. The closer the straight line distance between the points in the scatter plot, the closer the relationship. There are many software for PCA calculation, and plink can directly calculate PCA using vcf files.

everything
Grouping based on PCA

Materials are divided according to the scatter information in the PCA plot. For example, in the following figure attached to the article about soybean resequencing, the points of different colors obviously show different distribution patterns, each representing different subgroups.

everything
Outlier detection based on PCA

An outlier sample is a sample that looks very different from other samples in the PCA plot. It may be that the genetic background of this sample is very different from other samples, or it may be that the sample is confused, such as labeling the wild-type sample as domesticated species for sequencing.

everything
Infer subgroup evolutionary relationships based on PCA

The distribution relationship between different individuals can be seen through PCA analysis, which is usually related to geographical factors. For example, due to the spatial distance between Europe and Asia, the difference between the two subgroups is large, and the point distance displayed in the PCA results is relatively large. Far.

everything

group stratification diagram

Evolutionary trees and PCA can tell whether a group is hierarchical, but they cannot know how many groups are appropriate to divide the group into, nor can they see the genetic exchange between groups. Don't be afraid, the group stratification diagram will take action.

everything

The essence of the population stratification diagram is a stacked column diagram. Each column is a sample. You can see the bloodline composition of a sample. The number of colors indicates that the sample comes from several ancestors.

If there is only one color, it means that the individual is pure. If there is a piece of uniform color, it means that the samples in this piece all contain similar bloodlines and should belong to the same subgroup.

linkage disequilibrium analysis

Linkage disequilibrium (LD) consists of two nouns, 连锁+不平衡. The two are a unified relationship of opposites. From a certain From a perspective, it represents the correlation of variation. This correlation can be measured using the correlation coefficient.

LD is an indicator that measures whether the genotypic changes of two molecular markers are in sync and there is correlation. If two SNP markers are located adjacently, genotypes will also appear to be in lockstep in the population. For example, there are two loci, corresponding to two alleles: A/a and B/b.

If two loci are linked, we will see that certain genotypes tend to be co-inherited, i.e. certain haplotypes will be more frequent than expected.

LD calculation method

usually uses and to represent the LD level between two loci. If two linked loci A and B are alleles Genes are A, a, B, b, and the corresponding frequencies are represented by plus subscripts. For example, represents the frequency corresponding to haplotype Ab. (There are 4 alleles, and 4 haplotypes)

Then the difference between the actual observed haplotype frequency and the expected haplotype frequency is calculated as:

everything

The calculation method of correlation coefficient is:

everything

The calculation method is:

everything
LD decay analysis

As the distance between markers increases, the average LD degree will decrease and show an attenuation state. This situation is called LD attenuation.

everything
LD decay can be used to judge the diversity difference of the population. Generally, the LD decay of the wild-type population is faster than that of the domesticated population. Whether the number of markers used in GWAS is sufficient is judged by the LD attenuation distance and the average distance between markers.

GWAS genome-wide association analysis

Genome-wide association analysis is commonly used in medicine and agriculture. Simply understood, it means performing correlation analysis on genetic markers such as SNP and phenotypic data, detecting loci related to the phenotype, and then going back to find the corresponding genes to study their impact on the phenotype. In medicine, these studied phenotypes are often disease phenotypes; in agronomy, they are often agronomic traits of concern, such as rice plant height, yield, number of grains per panicle, etc.

everything
GWAS mathematical model
everything
everything

The above is only a brief introduction. Please refer to relevant information for specific mathematical models and methods.

GWAS result information

GWAS result files usually only have two plots, one is a Manhattan plot and the other is a QQ plot. Generally, the QQ chart is looked at first. If the QQ chart is normal, the results of the Manhattan chart will be meaningful.

QQ chart
everything

A normal QQ chart will be slightly upward. If the QQ chart is abnormal, you should consider changing the model algorithm and try again.

Manhattan diagram

In fact, it is essentially a scatter plot. Each point represents a site. The higher the site, the more significant it is. If there are too many points and the heights are inconsistent, it will look as intricate as a high-rise building in Manhattan. (elegant scientific researcher)

everything

everythingThe figure above shows the results of GWAS analysis of cotton resequencing. The key peak point is the research target position, and then functional verification experiments are performed.


Finally, thank you for reading this far! The material of this note is a compilation of part of the content of "Researching Monk Xiaolan Ge" in the brief book. It will be helpful to the study of population genetics. If you find it useful, please forward it and share it with others.

参考资料:
https://www.jianshu.com/p/807e54278539
https://zhuanlan.zhihu.com/p/541850657
https://www.jianshu.com/p/9793e14c0d08

This article is published by mdnice Multiple platforms

Guess you like

Origin blog.csdn.net/ZaoJewin/article/details/132994679