Study notes are for study use only.
Table of contents
2- wide table becomes long table
3- long table variable width table
1- What is tidy data?
According to Hadley, clean data has the following characteristics:
- Each variable forms a column, that is, variables with the same attributes form a column;
- Each observation constitutes a row;
- Each variable value for each observation constitutes a cell;
Data that does not meet the above conditions are called dirty and untidy data, and they often have the following characteristics:
- Column names are not variable names, but values;
- Put multiple variables in one column;
- Variables are placed in both rows and columns;
- Multiple types of observation units are in the same cell, that is, each cell is not a value;
- One observation unit is placed in multiple tables;
Data reshaping : The functions in the tidyverse series of packages operate on neat data frames, and untidy data needs to be transformed into neat data first. This process is data reshaping;
Data reshaping includes : length-width table conversion, split/merge columns, and square. Among them, the transformation of the length and width table uses the pivot_longer() and pivot_wider() functions
Dirty data example and description:
In this example, both male and female belong to gender, so male and female can be classified as one variable. Violating the first of the tidy data requirements, a column is a variable.
In this example, because the two variables of age and weight are placed in one column, although they are separated by backslashes, it is easy for humans to understand according to common sense, but the computer does not understand, it only says that the two columns are originally Numeric data are treated as strings, which violates the third rule of tidy data, that each variable for each observation constitutes a cell.
In this example, none of the three requirements for clean data are met.
The key to making data tidy is to learn to distinguish variables, observations, and values.
2- wide table becomes long table
Wide table: refers to the data set that clearly subdivides all variables. The table is relatively wide. The value that should have been placed in the cell is placed in the column name, such as male and female, which should be placed in the cell. The content of the column has become the column name of a certain two columns;
Long table: Refers to data that contains categorical variables in the dataset.
Use the pivot_longer() function in the tidyr package to convert a wide table into a long table, and use the pivot_wider() function to convert a long table into a wide table, which is the inverse transformation of the pivot_longer() function.
Grammar introduction:
pivot_longer(data, cols, names_to, values_to, values_drop_na, ...)
in
- data: the data frame to reshape;
- cols: select the column to be deformed with the select column syntax, that is, the column to be processed;
- names_to: Set the column name, specifically, in order to store the column name of the column to be processed, create a new column or several columns (according to the specific problem), and set a new column name for the newly created column;
- values_to: Set the column name, specifically, store the column to be processed, the value in the cell below it, and set a new column name for this column.
- values_drop_na: Whether to ignore missing values in deformed columns (NA, not available)
- If the column name of the deformed column includes prefixes, variable names + separators, and regular expression group capture patterns in addition to the desired "content", you can use the parameters names_prefix, names_sep, and names_pattern to extract the desired "content" , Note that the "content" here refers to the desired part of the column name.
Example 1 :
Wide table variable length table ( store the column name of the column to be reshaped in one column)
> df <- read.csv("配套数据/分省年度GDP.csv")
> df
地区 X2019年 X2018年 X2017年
1 北京市 35371.28 33105.97 28014.94
2 天津市 14104.28 13362.92 18549.19
3 河北省 35104.52 32494.61 34016.32
4 黑龙江省 13612.68 12846.48 15902.68
> df %>%
+ pivot_longer(-地区, names_to = "年份", values_to="GDP")
# A tibble: 12 × 3
地区 年份 GDP
<chr> <chr> <dbl>
1 北京市 X2019年 35371.
2 北京市 X2018年 33106.
3 北京市 X2017年 28015.
4 天津市 X2019年 14104.
5 天津市 X2018年 13363.
6 天津市 X2017年 18549.
7 河北省 X2019年 35105.
8 河北省 X2018年 32495.
9 河北省 X2017年 34016.
10 黑龙江省 X2019年 13613.
11 黑龙江省 X2018年 12846.
12 黑龙江省 X2017年 15903.
df is a wide table. Except for the region column, all the remaining columns are the columns we want to reshape. In the pivot_longer() function, the first parameter is the data frame we want to process, because the pipeline operation is used here , so the first parameter of this function is omitted, the second parameter is the column to be reshaped, here means that except for the region, all the remaining columns are what we want to reshape, the third parameter names_to= "Year" indicates the column we want to reshape in the original data. The column name of this column is stored in a new column. We use names_to to give the new column a column name. In this case, name the new column for "year". The fourth parameter is values_to="GDP", indicating the column we want to reshape in the original data. The value stored in the cell of this column is now placed in a new column. We need to take a column name for this new column , the column is named "GDP".
As you can see from this example, we reshape the columns, extract the column names of these reshaped columns, put them in the cells of a new column, and repeat them in a cycle, that is, x2019, x2018, and x2017 are a cycle .
Example 2 :
Wide table variable length table ( store the column name of the column to be reshaped in multiple columns)
Raw data: wide table
The goal is transformed into the following long table
analyze:
This data is to collect the information of each family child, for example, in family 1, there are two children, child1 and child2, and the date of birth and gender of these children are collected.
In the original data, the columns we want to reshape are all columns except the family column. The column names of these columns are separated by underscores. We want to make the data into 3 columns, namely child column, dob date of birth, gender gender. Among them, the two columns of dob and gender remain unchanged, without any operation, child1 and child2 become a new column, and name this column child.
> load("配套数据/family.rda")
> knitr::kable(family, align="c")
| family | dob_child1 | dob_child2 | gender_child1 | gender_child2 |
|:------:|:----------:|:----------:|:-------------:|:-------------:|
| 1 | 1998-11-26 | 2000-01-29 | 1 | 2 |
| 2 | 1996-06-22 | NA | 2 | NA |
| 3 | 2002-07-11 | 2004-04-05 | 2 | 2 |
| 4 | 2004-10-10 | 2009-08-27 | 1 | 1 |
| 5 | 2000-12-05 | 2005-02-28 | 2 | 1 |
>
> family %>%
+ pivot_longer(-family,
+ names_to = c(".value", "child"),
+ names_sep="_",
+ values_drop_na = TRUE)
# A tibble: 9 × 4
family child dob gender
<int> <chr> <date> <int>
1 1 child1 1998-11-26 1
2 1 child2 2000-01-29 2
3 2 child1 1996-06-22 2
4 3 child1 2002-07-11 2
5 3 child2 2004-04-05 2
6 4 child1 2004-10-10 1
7 4 child2 2009-08-27 1
8 5 child1 2000-12-05 2
9 5 child2 2005-02-28 1
Code explanation:
- -family, indicates that the column to be deformed is a column other than the family column.
- .names_sep="_" indicates that the column names of the reshaped columns are separated by underscores.
- names_to=c(".value", "child") is used to set the column names of the newly created columns in the long table. Specifically, the column name of the column to be reshaped is divided into two parts using underscores, the first part is the date of birth and gender, and the second part is child 1 and child 2.
- The column information (column name + cell content below it) generated by the first part remains unchanged;
- "child" is the column name of the newly created column, which is used to store the content of the second part of the column name to be reshaped, namely child 1 and child 2.
- values_drop_na=TRUE: indicates that the missing value NA in the column to be deformed is ignored in the data reshaping.
Example 3:
Wide table variable length table ( store the column name of the column to be reshaped in multiple columns)
raw data wide table
Goal: Convert to a long table in the following form
> df <- read.csv("配套数据/参赛队信息.csv")
> df
队员1姓名 队员1专业 队员2姓名 队员2专业 队员3姓名 队员3专业
1 张三 数学 李四 英语 王五 统计学
2 赵六 经济学 钱七 数学 孙八 计算机
>
> df %>%
+ pivot_longer(everything(),
+ names_to=c("队员", ".value"),
+ names_pattern = "(.*\\d)(.*)")
# A tibble: 6 × 3
队员 姓名 专业
<chr> <chr> <chr>
1 队员1 张三 数学
2 队员2 李四 英语
3 队员3 王五 统计学
4 队员1 赵六 经济学
5 队员2 钱七 数学
6 队员3 孙八 计算机
Grammar explanation:
- everything(): Indicates that all columns are selected, that is, the columns to be reshaped are all columns;
- names_pattern= "(.*\\d)(.*)" : Use this parameter and regular expression for group capture. \\d means matching numbers, that is, 0-9, * means any character, letter, number except newline, * matches at least once.
- names_to=c("team member", ".value") indicates the column name of the newly created column, the column name of the newly created column is "team member", and the remaining columns and information remain unchanged. Specifically, the column name of the column to be reshaped is divided into two parts using a regular expression. The content of the first part is child1, child2. Here, a new column is created for the first part of the column name, and the column name of the new column is set to child , the second part is column information (column name and cell part, remain unchanged, for example, in this example, there are two column names in the second part, name and major, these two columns remain unchanged, keep the original column name and the cell content under the column )
3- long table variable width table
Use the pivot_wider() function in the tidyr package to implement long and wide tables
pivot_wider(data, id_cols, names_from, values_from, values_fill,...)
in:
- data: Indicates the data frame to be reshaped;
- id_cols: The column that uniquely identifies the observation, the default is a column other than the columns specified by names_from and values_from.
- names_from: Specifies which variable column the column name comes from;
- values_from: specifies which variable column the column value comes from
- values_fill: If the cell value is correct after the table is widened, what value should be set to be filled.
- There are also parameters to help fix column names: names_prefix, names_sep, names_glue.
Example 1:
There is only one column name and one column value,
Column name: from the Type column
Column value: from the Heads column
In tidy data, you can use the column name to access the information for this entire column.
> load("配套数据/animals.rda")
> animals
# A tibble: 228 × 3
Type Year Heads
<chr> <int> <dbl>
1 Sheep 2015 24943.
2 Cattle 1972 2189.
3 Camel 1985 559
4 Camel 1995 368.
5 Camel 1997 355.
6 Goat 1977 4411.
7 Cattle 1979 2477.
8 Cattle 2014 3414.
9 Cattle 1996 3476.
10 Cattle 2017 4388.
# ℹ 218 more rows
# ℹ Use `print(n = ...)` to see more rows
>
> animals %>%
+ pivot_wider(names_from=Type, values_from=Heads, values_fill = 0)
# A tibble: 48 × 6
Year Sheep Cattle Camel Goat Horse
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2015 24943. 3780. 368. 23593. 3295.
2 1972 13716. 2189. 625. 4338. 2239.
3 1985 13249. 2408. 559 4299. 1971
4 1995 0 3317. 368. 8521. 2684.
5 1997 14166. 3613. 355. 10265. 2893.
6 1977 13430. 2388. 609 4411. 2104.
7 1979 14400. 2477. 614. 4715. 2079.
8 2014 23215. 3414. 349. 22009. 0
9 1996 13561. 3476. 358. 9135. 2770.
10 2017 30110. 4388. 434. 27347. 3940.
# ℹ 38 more rows
# ℹ Use `print(n = ...)` to see more rows
You can see that the value in the first column Type of the original data of animals is repeated. The cell content in this column, that is, the type of animal, is used as a new variable. If there are several types, create several columns and use names_from Specify which column of the original data the newly created column name comes from. The column name of the data frame used here indicates that the information of this column (column name + cell content) can be accessed, and values_from is used to specify the cell of the newly created column Which column of the original data the content comes from.
Example 2:
There are only multiple column name columns or multiple value columns. The following example shows that there are two value columns, estimate and moe
> us_rent_income#tidyr自带的数据集;
# A tibble: 104 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama income 24476 136
2 01 Alabama rent 747 3
3 02 Alaska income 32940 508
4 02 Alaska rent 1200 13
5 04 Arizona income 27517 148
6 04 Arizona rent 972 4
7 05 Arkansas income 23789 165
8 05 Arkansas rent 709 5
9 06 California income 29454 109
10 06 California rent 1358 3
# ℹ 94 more rows
# ℹ Use `print(n = ...)` to see more rows
> us_rent_income%>%
+ pivot_wider(names_from=variable, values_from=c(estimate, moe))
# A tibble: 52 × 6
GEOID NAME estimate_income estimate_rent moe_income moe_rent
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 24476 747 136 3
2 02 Alaska 32940 1200 508 13
3 04 Arizona 27517 972 148 4
4 05 Arkansas 23789 709 165 5
5 06 California 29454 1358 109 3
6 08 Colorado 32401 1125 109 5
7 09 Connecticut 35326 1123 195 5
8 10 Delaware 31560 1076 247 10
9 11 District of Col… 43198 1424 681 17
10 12 Florida 25952 1077 70 3
# ℹ 42 more rows
# ℹ Use `print(n = ...)` to see more rows
Wide and long tables:
In the process of changing a wide table to a long table, the columns to be reshaped should be "integrated" into several columns. Abstractly, it means multiple columns, "integrated" into fewer columns than before, that is, a wide table will be transformed into a long table . The integration here is in quotation marks, which means that the information of the columns to be reshaped is integrated. This integration operation is completed by creating new columns and retaining some columns. The categorical variables, such as male and female, male as a variable, are self-contained One column, women as a variable form a column of its own, the process of changing a wide table to a long table is to classify men and women into variables and name them gender. When creating a new column, it is necessary to name the column. The name of this column must be a string "gender". The cell content of this gender column is repeatedly cycled by male and female. The original data column name is the male column, and the cell content below is the same as the female column. The cell content forms a column by itself, and when you need to create a new column for this cell value, take a column name, and the column name is usually represented by a string.
In the process of widening the long table, the content below the variable (that is, a column) of the long table is repeated. At this time, the repeated content should be extracted to make these values become new column names, so set names_from, that is, which column of the original data the new column name comes from. At this time, the parameter is followed by the column name, not a string. After having the new column name, we need to fill the new column, the cell content below, what value to fill? It uses the values of certain columns of the original long table (concrete analysis of specific issues) to fill in. Therefore, the column names of the original data are filled after values_from, without quotation marks, not strings.
Sample data source:
reference:
"R Language Programming" (published in February 2023, People's Posts and Telecommunications Press)
"R Data Science in Practice: Detailed Explanation of Tools and Case Analysis" (published in June 2019, Machinery Industry Press)
R language data visualization practice (micro-video full solution version)---big data professional chart from entry to mastery. (Published in February 2022, Electronic Industry Press)