Key points of health statistics-preventive medicine

1. Health statistics: It is the application of the basic principles and methods of probability theory and mathematical statistics to study the health status of residents and health services. The science of collecting, organizing, and analyzing data in a field.

2. Homogeneity: In statistics, if some observed objects have the same characteristics or attributes, it is called homogeneity. Qualitative. Otherwise it is called heterogeneity or intermittent.

3. Variation: The difference between homogeneous things is called variation. [Without individual variation, there is no statistics! ]

4. Population: The entire group of homogeneous observation units determined according to the purpose of the research.

5. Sample: It is a collection of representative partial observation units randomly selected from the population.

6. Sample size (sample size): The number of observation units included in the sample.

7. Parameter (parameter): An indicator that reflects the overall characteristics. Features: unknown, unique, represented by Greek letters, such as overall mean, overall rate, etc.

8. Statistic: An indicator calculated based on sample observations. Features: known, not unique, represented by Latin letters, such as sample mean, sample rate, etc.

9. Variable: The researcher needs to observe or measure a certain characteristic or attribute of each observation unit. This Characteristics or attributes are called variables.

10. Value of variable: The observed value or measured value of a variable is called the variable value or observed value. ).

11. Data (data): The collection of variable values ​​is called data.

12. Quantitative data (quantitative data): The variable value is quantitative, expressed as numerical magnitude. Characteristics: Generally there are units of measure, quantity, and measurement, and they are generally continuous data.

13. Qualitative data: Observations are qualitative and appear as mutually exclusive categories or attributes. Characteristics: Generally, there are no units of measure, quantity, and measurement, and they are generally discrete data. Can be further divided into count data and level data.

14. Count data: Group observation units according to certain categories or attributes, and count the number of observation units in each group information obtained. It can be further divided into binomial classification data and unordered multiple classification data.

15. Ordinal data: Group the observation units according to the degree or rank order of a certain trait or attribute, and count each Data obtained from group observation units. Each attribute is incompatible with each other and differs in degree.

16. Sampling research (sampling research): A research method that randomly selects samples from the population and infers the characteristics of the population through sample information.

17. Sampling error: The difference between the sample statistics and the population parameters caused by random sampling, and the sample statistics difference between.

18. Probability: Probability is a numerical measure of the likelihood of a random event occurring. Usually represented by P. The size is between 0 and 1, that is, 0 ≤ P ≤ 1.

19. Low probability event: In medical research, events with probability less than or equal to 0.05 or 0.01 are called small probability events.

20. Principle of small probability: It does not mean that it is impossible to happen, but it will not happen in a certain experiment.

21. Simple random sampling: First, uniformly number all observation units of the survey population, and then use a random number table Randomly select n (sample size) numbers using one of the methods, statistical software or lottery, and the n observation units corresponding to these n numbers constitute the research sample.

22. Systematic sampling: Also known as mechanical sampling or equidistant sampling. In advance, all observation units in the population are equidistantly divided into n (sample size) parts according to a certain sequence number, and each part contains m observation units; then starting from the first part, the i-th observation unit is randomly selected from it, and so on. Mechanically select one observation unit from Part 2, Part 3 to Part n at equal intervals m to form a sample.

23. Stratified sampling: It is to first select one or several characteristics that have a greater impact on the observed indicators. , divide the population into several layers. The measured value of this feature has small variation within the layer and large variation between layers. Then a certain number of observation units are randomly selected from each layer and combined to form a sample.

24. Cluster sampling: The population is divided into groups (primary observation units), and each group is composed of secondary Observation unit composition. Randomly select a group and investigate all secondary observation units of the selected group.

25. Reliability: The reliability of a measurement tool. It refers to the use of the same measurement tool (such as a questionnaire) to measure the same When repeated measurements are taken on an object, the degree to which each measurement value is close to its mean value.

26. Validity: refers to the extent to which measurement tools, indicators or observation results reflect the objective reality of things Sexuality, which refers to the closeness between the observed results and the goal attempted to be achieved, is a test of the validity of the measurement tool (such as a questionnaire).

27. Experimental study (experimental study): It refers to the researcher’s artificial treatment of test subjects (including humans or animals) based on the purpose of the research. ) is a research method that applies treatment factors, controls confounding factors, and observes and summarizes the effects of treatment factors.

28. Study factor (treatment): refers to the factors imposed by the researcher on the subjects.

29. Level (level): refers to the different degrees of the same processing factor in quantity or intensity.

30. Study subjects: It is the object or object on which the processing factors act.

31. Experimental effect (effect): It is the reaction and outcome after the treatment factors act on the subject. It passes through the index Reflected by the observation of choices and indicators.

32. Bias: The systematic error part of the research error is called bias.

33. Average (average): Expresses the average level or concentrated position of a set of homogeneous quantitative data. Commonly used averages include arithmetic mean, geometric mean, median, mode, harmonic mean, etc.

34. Arithmetic mean (mean): Often referred to as the mean, it is the sum of a set of observed values ​​divided by the observed value. Number of income. Commonly used x(__) represents the sample mean and represents the μ population mean.

35. Geometric mean: Represented by G, it is the nth root of the product of n observations, and Called the mean of multiples. Applicable to: ① logarithmically symmetric distribution, ② geometric series data, such as blood antibody concentration. [There cannot be 0 in the observation]

36. Median (median): Represented by M, it is a group of observed values ​​sorted from small to large. The value in the middle is the median. The median is a positional average.

37. Percentile: Expressed in terms, it means that after sorting a set of observation values ​​from small to large, Divide it into 100 equal parts, and the value corresponding to each division position is called a percentile.

38. Medical reference value range: Also known as the normal value range, it refers to the anatomy, physiology, and The fluctuation range of biochemical and other indicators.

39. Rate (rate): It is an indicator indicating the frequency or intensity of a certain phenomenon, which will not be greater than 1.

40. Proportion: Indicates the proportion of each component within a thing, often expressed as a percentage, so it is also called as a percentage.

41. Relative ratio (ratio): It is the ratio of two indicators A and B. A and B can be absolute numbers, and A and B can also be relative numbers. A and B can be two indicators with the same properties. The units of A and B can be the same or different.

42. Dynamic series (dynamic series): It is a series of statistical indicators arranged in chronological order to illustrate the changes in things over time. changes and development trends.

43. Standardization of rates: It is a method of comparing rates under a specified standard composition condition. Significance: When two rates are to be compared, and the difference in the internal composition of the two groups of objects to be compared is enough to affect the conclusion, the rate standardization method can be applied to eliminate this effect and make the two rates comparable.

44. Standard error (SE): The standard deviation of a sample statistic is usually called the standard error.

45. Confidence interval (CI): Use a certain probability or credibility (1-a) The interval estimates the range of the overall parameter. This range is called the credible interval with a credibility of 1-a, also known as the confidence interval.

46. Poisson distribution: Poisson distribution is the limit form of binomial distribution. In binomial distribution, when π is very small (< 0.05), n is very large, binomial distribution→Poisson distribution.

47. Type Ⅰ error (type Ⅰ error): H0 is actually established, but due to sampling reasons, H0 is rejected , this type of "abandoning the truth" error is called a Type I error, and its maximum probability is α.

48. Type Ⅱ error (type Ⅱ error): H0 is not actually established, but the hypothesis test does not reject it. This type of "taken False” errors are called type II errors, and their probability is represented by β.

49. Power of a test: 1-β means that when there is a real difference between the two population parameters, press The alpha level is capable of detecting such differences. That is, the degree of certainty in making a positive conclusion that H1 is actually established.

50. Pvalue: Refers to random sampling from the specified population to obtain a value equal to or The probability of being greater than the existing statistic value.

51. Parametric test: Under the condition that the overall distribution type is known, on this basis, the unknown parameters Making estimates or tests is called parametric statistics or parametric tests.

52. Nonparametric test (nonparametric test): Does not depend on the distribution type of the population, does not infer the population parameters, just passes the sample The observed values ​​compare the distribution or distribution position of the population, so it is also called an arbitrary distribution test.

53. Linear correlation (linear correlation): Also known as simple correlation, it is used to describe two linear correlations. Statistical method of the relationship between variables x and y.

54. Linear correlation coefficient (linear correlation coefficient): Also known as Pearson product-moment correlation coefficient, it is a quantitative description of a straight line between two variables An indicator of the direction and closeness of a relationship. The overall correlation coefficient is represented by ρ, and the sample correlation coefficient is represented by r.

55. Death event (death event): also known as failure event/key event, which refers to the failure or failure of a certain processing measure. Characteristic events.

56. Survival time: refers to the observed survival time, which can be expressed in days, weeks, months, years and other time units Record, commonly represented by the symbol t.

57. Complete data (complete data): The time elapsed from the starting point of observation to the occurrence of death.

58. Censored data: Abbreviated as censored data, also known as censored data or final value. The end of the survival time observation process is not due to death events, but due to other causes, which is called the end. There are three main reasons: loss to follow-up, withdrawal, and termination.

59. Survival curve: Taking the observation (follow-up) time as the horizontal axis and the survival rate as the vertical axis, divide each A curve graph connecting the survival rates corresponding to time points to describe the survival process.

60. Life table: It is a statistical table compiled based on the age group mortality rate of a specific group of people. To illustrate the human life process under the conditions of mortality in a specific population age group.

1. Main contents of health statistics:

⑴ Statistical design; ⑵ Statistical analysis; ⑶ Vital statistics; ⑷ Introduction to commonly used statistical analysis software.

2. Comprehensive accounting analysis:

⑴ Statistical description: statistical description of quantitative data and qualitative data, statistical tables and charts.

⑵ Statistical inference: mainly includes parameter estimation and hypothesis testing.

3. Basic steps of statistical work: ⑴ design; ⑵ collect data; ⑶ organize data; ⑷ analyze data.

4. Parameters VS statistics: Parameters are indicators that reflect overall characteristics; statistics are sample indicators.

5. Financial type:

⑴ Quantitative data

⑵ Qualitative data: ① count data; ② grade data.

6. Sources of sampling error: individual differences

7. The main medical research methods include: ⑴ survey research; ⑵ experimental research; ⑶ literature research.

8. Commonly used sampling methods: ⑴ Simple random sampling; ⑵ Systematic sampling; ⑶ Stratified sampling; ⑷ Cluster sampling.

9. Sampling errors range from small to large: Stratified sampling<Systematic sampling<Simple random sampling<Cluster sampling

10. Basic principles of experimental design:

⑴Control principle; ⑵Random principle; ⑶Repetition principle; ⑷Balance principle.

11. Commonly used experimental design schemes: ⑴ Completely randomized design; ⑵ Paired design; ⑶ Random block design; ⑷ Crossover design; ⑸ Factorial design; ⑹ Repeated measures design .

12. Three elements of experimental design: ⑴ treatment factors; ⑵ subjects; ⑶ experimental effects.

13. The median is suitable for: skewed distribution data, data without exact data at one or both ends, and data with unknown overall distribution.

14. Variation indicators commonly used to describe discrete trends: range, interquartile range, variance, standard deviation, and coefficient of variation.

15. Parameters of normal distribution: ⑴ mean μ; ⑵ standard deviation σ.

① μ is a position parameter. When σ is constant, the larger μ is, the more the curve moves to the right;

σ is a shape parameter. When μ is constant, the larger σ is, the flatter and wider the curve is.

② Standard normal distribution: μ=0, σ=1.

16. Standard difference VS standard code

⑴ Standard deviation represents the size of individual differences, describes the frequency distribution of data, and can be used to formulate medical reference value ranges.

⑵ The standard error describes the degree of variation of the sample mean, illustrates the size of the sampling error, and is used for interval estimation and hypothesis testing of the population mean.

17. Conditions for interval estimation of population rate using normal approximation method:

⑴ n is large enough;

⑵ Both p and 1-p are not too small;

⑶ np and n(1-p) are both greater than 5.

18. Characteristics of Poisson distribution: The variance is the same as the mean.

19. Causes of sampling error: ⑴ sampling error; ⑵ individual variation (essential difference)

20. Changes after correction of x2 test: The x2 value is too small and the P value is too large.

21. When performing statistical comparison of graded data: Use rank sum test or Ridit analysis.

22. The old data satisfies the parametric test, but the results of the non-parametric test are selected: Reduce the test efficiency and increase the probability of making type II errors.

23. Characteristics of survival curve:

is a downward curve. Smooth indicates a higher survival rate or longer survival, and steeper indicates a lower survival rate or shorter survival.

24. To compare the contribution size, use: standardized partial regression coefficient.

25. Sample size estimation:

⑴ Close to 0.5

⑵ If the same test standards are required, the required sample size is smaller when the number of cases in the two groups is equal.

⑶ α can be bilateral or unilateral; β can only be unilateral.

26. Factors affecting inspection performance:

⑴Sample size; ⑵The size of the difference in objective things; ⑶The size of the variation between individuals; ⑷α value.

27. How to increase inspection efficiency:

⑴ Increase α; ⑵ Increase sample content.

Special attack on positive distribution

1. The normal curve is highest at the mean above the horizontal axis, gradually decreases to both sides, and is symmetrical with the mean as the center, but its two ends never intersect with the horizontal axis, forming a bell-shaped curve.

2. The normal distribution has two parameters, the mean and the standard deviation. μ is a position parameter. When σ is constant, the larger μ is, the more the curve moves to the right; σ is a shape parameter. When μ is constant, the larger σ is, the flatter the curve is.

3. The distribution of the area under the normal curve has certain rules: the area under the normal distribution curve within a certain interval represents the proportion (frequency) of the number of observations within the corresponding interval to the total number of all observations. , or the probability that an observation falls within this interval:

① The area between the normal curve and the horizontal axis is always equal to 1 or 100%;

② The normal distribution is a symmetric distribution, and the area on both sides of the symmetry axis is 50%;

③ The area of ​​the interval (μ-σ, μ+σ) is 68.27%;

The area of ​​the interval (μ-1.96σ, μ+1.96σ) is 95%;

The area of ​​the interval (μ-2.58σ, μ+2.58σ) is 99%.

[Principles, common methods and applicable conditions for establishing medical reference values]

1. Principles: ① Determine a batch of "normal people" with a sufficiently large sample content; ② Select appropriate percentage cutoffs based on the research purpose and usage requirements, such as 80%, 90%, 95% and 99%, with 95% being commonly used ; ③ Determine unilateral or bilateral boundary values ​​based on professional knowledge; ④ Select appropriate calculation methods based on the distribution characteristics of the data.

2. Commonly used methods and applicable conditions:

① Normal distribution method: suitable for data with normal or approximately normal distribution

Two-sided margin:

Unilateral upper bound: Unilateral lower bound:

② Percentile method: often used for skewed distribution data and data without exact values ​​at one or both ends of the data.

The 95% reference value range for both sides is: P2.5~P97.5

The unilateral upper limit is: P95 or the unilateral lower limit is: P5

Type Ⅰ type VS Ⅱ type 错误

1. Focus on reducing type I errors: α can be small, such as 0.01; focus on reducing type II errors: α can be large, such as 0.2.

2. The larger α is, the smaller the type II error is, and the larger the test power 1-β is.

3. When P≤α rejects H0, only type I error is made; when P>α does not rejectH0, only type II error is made. type error.

4. If P≤α is tested on both sides, P≤α must be obtained on one side; if P>α is tested on one side, P>α must be obtained on both sides.

5. One-sided testing is prone to Type I errors, and two-sided testing is prone to Type II errors. One-sided testing is more efficient than bilateral testing.

Play the game

Prerequisites

①The population is homogeneous; ②The sample is representative and comparable between groups.

The basic steps

① Establish inspection hypotheses and determine inspection levels;

②Select test methods and calculate test statistics;

③ Determine the P value and make inferences.

Precautions

1. There should be a rigorous research design:

① Each research individual in the population should be homogeneous; ② The sample data should be representative; ③ The compared groups should be comparable.

2. Correctly understand the meaning of α level and P value

3. Correctly understand the statistical significance of conclusions

4. The conclusion of hypothesis testing cannot be absolute

t检验

meaning

A hypothesis testing method based on the t distribution and using the t value as the test statistic for measurement data.

Basic idea

Assume that random sampling is done under the condition that H0 is established. The probability of obtaining the t value of the existing sample test statistic according to the law of t distribution is P. Compare the P value with the preset test level α to determine whether to reject H0

Application conditions

① Independence; ② Normality (can be confirmed by normality test); ③ Homogeneity of variances (can be confirmed by homogeneity of variance test).

The main purpose

① Comparison of the single sample mean and the overall mean;

② Comparison of the difference mean of paired design data and the overall mean;

③ Comparison of the mean difference between two samples in group design.

directional analysis

meaning

A hypothesis testing method for measurement data based on the variation of data analysis and using the F value as the statistic.

Basic idea

Decompose the total variation between all observations into two or more components according to the design type, and make statistical inferences with the help of the F distribution by comparing the mean squares of different sources of variation.

Application conditions

① Independence; ② Normality (can be confirmed by normality test); ③ Homogeneity of variances (can be confirmed by homogeneity of variance test).

The main purpose

Comparison of multiple sample means (three or more)

x2 检验

meaning

A hypothesis testing method based on x2 distribution and counting data with x2 value as the test statistic.

Basic idea

The size of the x2 value reflects the degree of agreement between the actual frequency (A) and the theoretical frequency (T). When H0 is established, the difference between the actual frequency (A) and the theoretical frequency (T) should not be large. If the actual frequency (A) If it is very different from the theoretical frequency (T), the possibility of H0 being established is very small.

Application conditions

① Independence; ② Normality (can be confirmed by normality test); ③ Homogeneity of variances (can be confirmed by homogeneity of variance test).

The main purpose

① Infer whether there is any difference between two or more overall rates (or composition ratios);

② Whether there is any correlation between the two variables; ③ Test the goodness of fit of the frequency distribution.

Notes on x2 test of 2×2 table

① When n ≥ 40 and all T ≥ 5, use the basic formula or special formula of the 2×2 table x2 test to calculate the x2 value;

②When n≥40 but 1≤T<5, the correction formula needs to be used to calculate the x2 value;

③When n<40 or T<1, it is not appropriate to calculate the x2 value, and Fisher’s exact probability method must be used to directly calculate the probability.

Notes on x2 test of R×C table

① It is allowed that no more than 1/5 of the basic grids have a theoretical frequency greater than 1 and less than 5, but no theoretical frequency is less than 1;

② If the theoretical frequency of more than 1/5 grids is greater than 1 and less than 5, or if the theoretical frequency of 1 grid is less than 1, the sample size can be increased.

[Advantages and Disadvantages of Non-parametric Tests]

Advantages: ① Applicable to any distributed data;

② Not subject to the restriction of uniform overall variance;

③Can be used for statistical analysis of grade data;

④Some problems do not have appropriate parametric testing methods, but non-parametric testing can just deal with them.

Disadvantages: ① Because it does not fully utilize the information provided in the original data, the test efficiency is low;

②The effect is somewhat similar.

Chichiwa Cup

meaning

Also known as signed rank sum test or Wilcoxon pairing method, it is a non-parametric test (it does not depend on the type of population distribution, does not infer the population parameters, but only infers whether there is a difference in the distribution or distribution position of the population through the sample observation values).

Basic idea

Under the premise that H0 is established, the overall distribution of differences (paired differences, differences between each sample measurement value and the known population median M0) is symmetrical, the population median should be 0, and T+ and T- should Close to n(n+1)/4. If the positive and negative rank sums are very different, there is reason to doubt the establishment of H0.

Applicable conditions

① Data that do not meet the parameter test conditions and data that cannot meet the parameter test conditions through variable transformation;

② Data without precise measurement, such as data with uncertain values ​​at one or both ends;

③Data with unknown distribution type.

The main purpose

① Infer whether the overall median of the differences in paired design data is 0;

② Infer whether the median of the population from which the sample comes is equal to the known population median.

Level-related applicable scope

1. Data that do not obey the bivariate normal distribution;

2. Data with unknown overall distribution type; 3. Hierarchical data.

Direct line analysis

meaning

Linear regression is a statistical analysis method that studies the linear dependence between two continuous variables. The linear regression equation is used to describe the quantitative relationship between the changes in the two variables, which belongs to the category of bivariate analysis.

Prerequisites

① Independence; ② Normality (can be confirmed by normality test); ③ Homogeneity of variances (can be confirmed by homogeneity of variance test).

Application conditions

①The changing trend of the two variables is a linear trend;

②The dependent variable y is a random variable from a normal distribution, and x can be a regularly changing or artificially selected value (type I regression), or it can be a random variable (type II regression);

③For Type I regression, when x takes different values, the distributions of y are all normal distributions, and the variances of these distributions are equal; for Type II regression, x and y are required to obey the bivariate normal distribution.

Precautions

①Regression analysis must have practical significance;

② For linear regression analysis data, the response variable y is generally required to be a random variable from a normal population;

③When performing regression analysis, a scatter plot should be drawn first;

④Handling of outliers (judged by scatter plot);

⑤Avoid extension.

[The differences and connections between linear regression and linear correlation analysis]

VS

linear correlation

linear regression

the difference

Data requirements are different

The two variables are required to be bivariately normally distributed.

The dependent variable y is required to obey a normal distribution, and the independent variable x is a variable that can be accurately measured or controlled.

Statistically different

It reflects the accompanying relationship between two variables. This relationship is mutual and equal, so there must be a causal relationship.

The dependence relationship between response variables can be divided into independent variables and dependent variables. Generally, the "cause" or the transaction measurement and the smaller variation is designated as the independent variable. This dependence relationship may be a causal relationship or a subordinate relationship.

Analysis purposes are different

Use a statistical index to express the closeness and direction of the straight-line relationship between two variables.

Quantitatively express the relationship between independent variables and dependent variables using functional formulas.

connect

The direction of the relationship between variables is consistent: for the same data, the signs of r and b are consistent.

Hypothesis testing is equivalent: for the same sample, tr=tb

The values ​​of r and b can be converted to each other (find it in the book)

Use regression to explain correlation: The coefficient of determination in regression analysis is numerically equal to the square of the correlation coefficient, that is, r2.

Commonly used multi-volume measurement method

1. Count data obeys normal distribution: multiple linear regression analysis;

2. Categorical variables: Logistic regression analysis;

3. Time variable (including final inspection data): Cox regression analysis;

4. Conduct classification research on data: cluster analysis;

5. The classification has been clarified, and we want to judge its classification through certain indicators: discriminant analysis;

6. There are many indicators to study, and several comprehensive variables are needed to reflect the information of the data: principal component analysis and factor analysis.

Guess you like

Origin blog.csdn.net/qq_67692062/article/details/134814713