【Basic theory】Basic concepts of descriptive statistics

1. Description

Statistics is a branch of mathematics concerned with the collection, interpretation, organization and interpretation of data. This blog aims to answer the following questions :

1. What is descriptive statistics? 2. Types of descriptive statistics? 3. Measures of central tendency (mean, median, mode)

4. Spread/dispersion measures (standard deviation, mean deviation, variance, percentile, quartile, interquartile range) 5. What is skewness? 6. What is kurtosis? 7. What is relevance? Today, let us understand descriptive statistics once and for all. let's start!

2. What is descriptive statistics ?

        Descriptive statistics involves summarizing and organizing data for easier understanding. Unlike inferential statistics, descriptive statistics attempt to describe the data but not attempt to make inferences from a sample to the entire population. Here, we generally describe the data in the sample. This usually means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory.

2.1 Types of descriptive statistics?

        Descriptive statistics fall into two categories. A measure of central tendency and a measure of variability (spread). Note that this is actually a very naive approach.

2.2 Measurement of Central Tendency

        Central tendency is the idea that there is one number that best summarizes an entire set of measurements, and that this number is in some way the "center" of that set.

2.2.1 Average/Average

        The mean or mean is the central tendency of the data, the number around which the entire data spreads out. In a way, it's a number that estimates the value of the entire dataset.

        Let's calculate the mean of a dataset with 8 integers.

 

2.2.2 Median

        The median is the value that divides the data into 2 equal parts, that is, when the data are sorted in ascending or descending order, there are the same number of items on the right as on the left.

        Note : If you sort the data in descending order, the median will not be affected, but the IQR will be negative. We will discuss IQR later in this blog.

        If the number of items is odd, the median will be the middle item. If the number of items is even, the median will be the average of the middle 2 items.

        The median is 59, which divides a set of numbers into two equal parts. Since there are even numbers in the set, the answer is the average of the middle numbers 51 and 67.

Note:  When the values ​​are in an arithmetic progression (difference between consecutive terms is constant. Here 2.), the median is always equal to the mean .

        The average of these 5 numbers is 6, thus the median.

2.2.3 Frequency Mode

        The mode is the item that occurs the longest in the dataset, i.e., the item with the highest frequency.

        In this dataset, the mode is 67 because it has more values ​​than others, i.e. twice.

        But it is possible to have a data set where there is no pattern at all because all values ​​occur the same number of times. A dataset is bimodal if two values ​​occur together and more than others . A dataset is trimodal if three values ​​occur together and more than others , and for n modes, the dataset is multimodal .

2.3 Measurement of Diffusion/Dispersion

        Spread metrics refer to the concept of variability in data.

2.3.1 Standard deviation

        Standard deviation is a measure of the average distance between each quantity and the mean. That is, how the data is distributed from the mean. A low standard deviation indicates that the data points tend to be close to the mean of the data set, while a high standard deviation indicates that the data points are spread over a wider range of values.

        In some cases we have to choose between sample or population standard deviation.

        When we are asked to find the SD of a certain fraction of a population, a fraction of the population; then we use the sample standard deviation.

        where x̅ is the mean of the sample.

        But when we have to deal with the whole population, we use the population standard deviation.

        where μ is the mean of the population.

        Although the samples are part of the population, their SD formula should be the same, but it is not. For more information refer to this link

        As you know, in descriptive statistics we usually deal with data available in a sample, not in a population. So if we take the previous dataset, and substitute the values ​​in the example formula,

The answer is 29.62.

2.3.2 Mean Deviation/Mean Absolute Deviation

        It is the average of the absolute differences between each value in a set of values, and the average of all values ​​in that set.

        So if we use the previous dataset, and replace the values,

The answer is 23.75.

2.3.3 Variance

        The variance is the square of the average distance between each quantity and the mean. That is, it is the square of the standard deviation.

        The answer is 877.34.

2.3.4 Scope

        Range is one of the simplest descriptive statistical techniques. It is the difference between the lowest value and the highest value.

        The range is 99–12 = 87

2.3.5 Percentage

        A percentile is a way of representing a value's position in a data set. To calculate percentiles, the values ​​in the dataset should always be in ascending order.

        The median of 59 is 8 fewer than itself in 4 values. It can also be said that in the data set, 59 is the 50th percentile because 50% of the total items are less than 59. In general, if k is the nth percentile, it means that  n% of the total items  are less than  k .

2.3.6 Quartile

        In statistics and probability, quartiles are values ​​that divide data into quarters, provided the data are sorted in ascending order.

        Quartiles [Image 14] (Image courtesy: IQR | Intro to Statistical Methods )

        There are three quartile values. The first quartile value is 25%. The second quartile is the 50th percentile and the third quartile is the 75th percentile. The second quartile (Q2) is the median of the entire data. The first quartile (Q1) is the median of the upper half of the data. The third quartile (Q3) is the median of the lower half of the data.

So here, by analogy,

Q2 = 67: is the 50th percentile of the entire data, which is the median.

Q1 = 41: is the 25th percentile of the data.

Q3 = 85: is the 75th percentile of the date.

Interquartile Range (IQR)  = Q3 - Q1 = 85 - 41 = 44

Note:  If the data is sorted in descending order, the IQR will be  -44 . The magnitude will be the same, just the sign will be different. Negative IQR is fine if the data is in descending order. It's just that we negate smaller values ​​from larger values, we prefer ascending (Q3 - Q1).

3. Skewness

3.1 Definition of Skewness

        Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. Skewness values ​​can be positive, negative, or undefined.

        In a perfect normal distribution, the tails on either side of the curve are exact mirror images of each other.

        When the distribution is skewed to the left, the tail on the left side of the curve is longer than the tail on the right side, and the mean is smaller than the mode. This condition is also known as negative skewness.

        When the distribution is skewed to the right, the right tail of the curve is longer than the left tail, and the mean is greater than the mode. This condition is also known as positive skewness.

        Skewness [Picture 16] (Image courtesy of: Skewness - Clojure for Data Science [Book] )

3.2 How to calculate the skewness coefficient?

        To calculate the skewness coefficient of a sample, there are two methods:

        1] Pearson first coefficient of skewness (modulo skewness)

        2] Pearson's second skewness coefficient (median skewness)

        explain

  • The direction of skewness is given by the flag. Zero means no skewness at all.
  • Negative values ​​indicate that the distribution is negatively skewed. Positive values ​​indicate that the distribution is positively skewed.
  • This coefficient compares the sample distribution to a normal distribution. The larger the value, the more the distribution differs from the normal distribution.

Example problem: Use Pearson coefficients #1 and #2 to find the skewness of data with the following characteristics:

  • Average = 50.
  • Median = 56.
  • mode=60.
  • Standard Deviation = 8.5.

Pearson's first skewness coefficient: -1.17.

Pearson's second skewness coefficient: -2.117.

Note : Pearson's first skewness coefficient uses this mode. Therefore, if the frequency of values ​​is very low, then it will not give a stable measure of central tendency. For example, the pattern is 9 in both sets of data:

1, 2, 3, 4, 4, 5, 6, 7, 8, 9.

In the first set of data, the pattern appeared only twice. Therefore, it is not a good idea to use Pearson's first skewness coefficient. But in the second set,

1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10, 12, 12, 13.

Pattern 4 occurs 8 times. So Pearson's second skewness coefficient might give you a reasonable result.

4. Kurtosis

4.1 Definition of kurtosis

        The exact interpretation of the measurement of kurtosis was once disputed, but it is now resolved. It's about the presence of outliers. Kurtosis is a measure of whether data is heavy-tailed (lots of outliers) or light-tailed (lack of outliers) relative to a normal distribution.

Kurtosis [Image 19] (Image courtesy of: MVP Programs Help — MVP Programs Help Files )

4.2 There are three types of kurtosis

4.2.1 Middle school

        The kurtosis is similar to the normal distribution kurtosis and is zero.

4.2.2 Hook end Coulter

        A distribution is one with greater kurtosis than the mesogenetic distribution. This distribution has a thick and heavy tail. If the distribution curve is more peaked than the mesogenic curve, it is called a hook-end curve.

4.2.3 Platypus

        A distribution is one that has less kurtosis than a mesogenetic distribution. This distribution has a thinning tail. If the peak of the distribution curve is smaller than the middle Coulter curve, it is called a duckbill curve.

The key difference between skewness and kurtosis is that skewness refers to the degree of symmetry whereas kurtosis refers to the degree of presence of outliers in a distribution.

Five, related

        Correlation is a statistical technique that shows whether pairs of variables are related and how strong the relationship is.

Correlation [Image 20] (Image courtesy of: Correlation in Statistics: Correlation Analysis Explained - Statistics How To )

        The main result of correlation is called the correlation coefficient (or "r"). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the stronger the correlation between the two variables.

        If r is close to 0, it means that there is no relationship between the variables. If r is positive, it means that when one variable gets larger, the other variable gets larger. If r is negative, it means that as one gets larger, the other gets smaller (often called a "negative" correlation).

        I hope I have given you an idea of ​​what exactly descriptive statistics are. This is a basic overview of some basic statistical techniques that can help you understand data science in the long run.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131746889