---
title: 'Coronavirus: China and Rest of World'
author: "基于Michal Bogacz在kaggle分析的改进,北邮吴俊 版权所有"
output:
html_document:
df_print: paged
toc: yes
toc_float: yes
toc_depth: '4'
pdf_document:
code_folding: hide
css: wjtemplate.css
highlight: tango
number_sections: yes
theme: cosmo
---
```{r include = FALSE}
if(!knitr:::is_html_output())
{
options("width"=56)
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=56, indent = 2), tidy = TRUE)
knitr::opts_chunk$set(fig.pos = 'H')
}
```
# 研究背景与问题
本分析的目的是比较中国和国外的新冠疫情变化情况,原作者是 Michal Bogacz,链接:https://www.kaggle.com/michau96/coronavirus-china-and-rest-of-world ,本人基于提供的数据及图表做了进一步分析与改进,分析结论仅做为教学参考用。
# 分析过程
## 数据预处理
```{r echo=TRUE, message=FALSE, warning=FALSE}
# 清空环境,设置工作目录
rm(list=ls())
setwd("G:/R default work directory") # 设置数据集和分析结果所在工作目录
begin<- Sys.time()
options(warn=-1) #消除警告信息
library(tidyverse)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(gridExtra)
library(ggrepel)
library(lubridate)
data <- read.csv("covid_19_data.csv")
data <- data[,c(1:2,4,6:8)]
# %in% 判断前面一个向量内的元素是否在后面一个向量中,返回逻辑判断布尔值
data$isChina <- ifelse(data$Country %in% c("Mainland China", "China"),"China","Not China")
data$Date <- as.Date(data$ObservationDate, format = "%m/%d/%y")
data <- data[,-2] # 剔除第2列
head <- data[sample(1:nrow(data),5), ]
head <- head[order(head$SNo),]
head
```
从输出的数据集前五行看,分析可用到的数据字段有 5 个,分别是: date, number of infected, fatalities and cured in a given region, and whether the region under study belongs to China。该数据集的字段有限,呈现时序变化特征,因此比较适合做时间序列分析,如预测,干预的因果推断等。
## 确认感染人数的变动趋势分析
```{r grss, out.width='80%', fig.align='center', fig.cap="确认感染人数的变动趋势", echo=FALSE, message=FALSE, warning=FALSE}
Conf <- data %>%
group_by(isChina, Date) %>%
summarise(x = sum(Confirmed))
ggplot(Conf, aes(Date, x, colour = isChina))+
geom_line(size = 2, alpha = 0.5)+
geom_point(size = 2.7)+
# scale_y_continuous(trans="log10")+
labs(x = "", y = "确认感染人数", title = "确认感染人数变动趋势", subtitle = "中国与世界其他国家的对比")+
# geom_text_repel(aes(label = x), nudge_y = 1100, color = "black", size = 3.9)+
scale_x_date(date_labels = "%b %d", date_breaks = "7 days")+
scale_colour_brewer(palette = "Set1")+
theme_fivethirtyeight()+
theme(legend.position="bottom", legend.direction="horizontal", legend.title = element_blank(), axis.text = element_text(size = 14),
legend.text = element_text(size = 13), axis.title = element_text(size = 14), axis.line = element_line(size = 0.4, colour = "grey10"),
plot.background = element_rect(fill = "#C8EDAF"), legend.background = element_rect(fill = "#C8EDAF"))
```
图中可以看到,1月22日-2月1日,是疫情第一阶段,增速较快,2月2日-11日增速上升非常快,疫情进入快速蔓延阶段,2月13-2月17是阶段高峰期,2月21日后增速大大减缓。
## 病殁人数的变动趋势分析
```{r bmrs, out.width='80%', fig.align='center', fig.cap="病殁人数的变动趋势", echo=FALSE, message=FALSE, warning=FALSE}
Dea <- data %>%
group_by(isChina, Date) %>%
summarise(x = sum(Deaths))
ggplot(Dea, aes(Date, x, colour = isChina))+
geom_line(size = 2, alpha = 0.5)+
geom_point(size = 2.7)+
#scale_y_continuous(trans="log10")+
scale_x_date(date_labels = "%b %d", date_breaks = "7 days")+
labs(x = "", y = "Number of deaths", title = "病殁人数", subtitle = "中国与世界其他国家的对比")+
# geom_text_repel(aes(label = x), nudge_y = 40, color = "black", size = 4.1)+
scale_colour_brewer(palette = "Set1")+
theme_fivethirtyeight()+
theme(legend.position="bottom", legend.direction="horizontal", legend.title = element_blank(), axis.text = element_text(size = 14),
legend.text = element_text(size = 13), axis.title = element_text(size = 14), axis.line = element_line(size = 0.4, colour = "grey10"),
plot.background = element_rect(fill = "#C8EDAF"), legend.background = element_rect(fill = "#C8EDAF"))
```
图中可以看到,1月份病殁人数增长平稳,进入2月后增速很快,但没有确诊人数增速快,说明病殁人数的增长主要源于确诊人数的增长。
## 治愈人数的变动趋势分析
```{r zyrs, out.width='80%', fig.align='center', fig.cap="治愈人数的变动趋势", echo=FALSE, message=FALSE, warning=FALSE}
Rec <- data %>%
group_by(isChina, Date) %>%
summarise(x = sum(Recovered))
ggplot(Rec, aes(Date, x, colour = isChina))+
geom_line(size = 2, alpha = 0.5)+
geom_point(size = 2.7)+
#scale_y_continuous(trans="log10")+
scale_x_date(date_labels = "%b %d", date_breaks = "7 days")+
labs(x = "", y = "治愈人数", title = "治愈人数的变动趋势", subtitle = "中国与世界其他国家的对比")+
# geom_text_repel(aes(label = x), nudge_y = 95, color = "black", size = 3.9)+
scale_colour_brewer(palette = "Set1")+
theme_fivethirtyeight()+
theme(legend.position="bottom", legend.direction="horizontal", legend.title = element_blank(), axis.text = element_text(size = 14),
legend.text = element_text(size = 13), axis.title = element_text(size = 14), axis.line = element_line(size = 0.4, colour = "grey10"),
plot.background = element_rect(fill = "#C8EDAF"), legend.background = element_rect(fill = "#C8EDAF"))
```
图中显示中国的治愈人数增速高于病殁人数,在2月中旬,每天治愈约1000人,在1月底,中国之外感染的10人已治愈,这种状况也让西方国家误以为新冠病毒只对中国人有严重影响。
## 病殁率的变动趋势分析
```{r bml, out.width='80%', fig.align='center', fig.cap="病殁率的变动趋势", echo=FALSE, message=FALSE, warning=FALSE}
Tog2 <- cbind(Dea, Conf)
Tog2 <- Tog2[,c(1,2,3,6)]
names(Tog2)[3:4] <- c("Deaths", "Total")
Tog2$Dea2All <- Tog2$Deaths/Tog2$Total
#Tog2_China <- Tog2[1:14,]
ggplot(Tog2[-c(1,42),], aes(Date,Dea2All, colour = isChina))+
geom_line(size = 2.2, alpha = 0.5)+
geom_point(size = 3)+
labs(x = "", y = " 病殁率", title = "病殁率的变动趋势", subtitle = "中国与世界其他国家的对比", colour = "")+
#geom_text_repel(aes(label = paste0(round(100*Dea2All,1), "%")), nudge_y = 0.0013, nudge_x = 0.2, color = "black", size = 3.8)+
scale_y_continuous(labels = scales::percent, limits = c(0,0.036))+
scale_x_date(date_labels = "%b %d", date_breaks = "7 days")+
scale_colour_brewer(palette = "Set1")+
theme_fivethirtyeight()+
annotate("text", x = mean(Tog2$Date)-12, y = 0.0135, label = "中国以外的 \n 第1例死亡", size = 4.8)+
annotate("segment", x = mean(Tog2$Date)-13.5, xend = mean(Tog2$Date)-9.5, y = 0.0109, yend = 0.0062, colour = "deepskyblue4", size=0.7, alpha=0.7, arrow=arrow())+
theme(legend.position="bottom", legend.direction="horizontal", axis.text = element_text(size = 14), axis.title = element_text(size = 14),
legend.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"),
plot.background = element_rect(fill = "#C8EDAF"), legend.background = element_rect(fill = "#C8EDAF"))
```
以病殁率(病殁人数/确诊感染人数)来分析,2月20前,中国的病殁率在2%-3%左右波动,2月20日后有所上升。其他国家的病殁率在2月20日后也上升到1%-2%区间。
## 治愈率的变动趋势分析
```{r zyl, out.width='80%', fig.align='center', fig.cap="治愈率的变动趋势", echo=FALSE, message=FALSE, warning=FALSE}
Tog3 <- cbind(Rec, Conf)
Tog3 <- Tog3[,c(1,2,3,6)]
names(Tog3)[3:4] <- c("Rec", "Total")
Tog3$Rec2All <- Tog3$Rec/Tog3$Total
ggplot(Tog3[-c(1,42),], aes(Date,Rec2All, colour = isChina))+
geom_line(size = 2.2, alpha = 0.5)+
geom_point(size = 3)+
labs(x = "", y = "治愈率", title = "治愈率的变动趋势", subtitle = "中国与世界其他国家的对比", colour = "")+
# geom_text(aes(label = paste0(round(100*Rec2All,1), "%")), nudge_y = 0.0025, nudge_x = 0.2, color = "black", size = 4.1)+
scale_y_continuous(labels = scales::percent, limits = c(0,0.6))+
scale_x_date(date_labels = "%b %d", date_breaks = "7 days")+
scale_colour_brewer(palette = "Set1")+
theme_fivethirtyeight()+
theme(legend.position="bottom", legend.direction="horizontal", axis.text = element_text(size = 14), axis.title = element_text(size = 14), legend.text = element_text(size = 13),
axis.line = element_line(size = 0.4, colour = "grey10"), plot.background = element_rect(fill = "#C8EDAF"), legend.background = element_rect(fill = "#C8EDAF"))
```
以治愈率(治愈人数/确诊感染人数)来分析,1月23-1月30日,中国的治愈率先高后低,2月13日开始快速上升。其他国家的治愈率在2月13日后波动下行,说明缺乏有效的治疗药物或手段干预。
## 治愈人数/死亡人数比例的变动趋势分析
```{r zybl, out.width='80%', fig.align='center', fig.cap="治愈/死亡比例的变动趋势", echo=FALSE, message=FALSE, warning=FALSE}
Tog <- cbind(Dea, Rec)
Tog <- Tog[,c(1,2,3,6)]
names(Tog)[3:4] <- c("Deaths", "Recovered")
Tog$Rec2Dea <- Tog$Recovered/Tog$Deaths
Tog <- Tog[month(Tog$Date)==2,]
Tog <- Tog[-c(1,30),]
ggplot(Tog, aes(Date, Rec2Dea, colour = isChina))+
geom_line(size = 2.2, alpha = 0.5)+
geom_point(size = 3.3)+
labs(x = "", y = "治愈人数/死亡人数比例", title = "治愈人数/死亡人数比例的变动趋势", subtitle = "中国与世界其他国家的对比 (自2月1日起)", colour = "Indicator")+
# geom_text_repel(aes(label = round(Rec2Dea,1)), nudge_y = 0.115, color = "black", size = 4.5)+
scale_colour_brewer(palette = "Set1")+
scale_y_continuous(limits = c(0,36))+
scale_x_date(date_labels = "%b %d", date_breaks = "5 days")+
geom_hline(yintercept = 10, linetype = 2, alpha = 0.5)+
annotate("text", x = mean(Tog$Date), y = 11.1, label = "10例 治愈 = 1例死亡", size = 4.1)+
theme_fivethirtyeight()+
theme(legend.position="right", legend.direction="vertical", axis.text = element_text(size = 14), axis.title = element_text(size = 14),
legend.text = element_text(size = 13), axis.line = element_line(size = 0.4, colour = "grey10"),
plot.background = element_rect(fill = "#C8EDAF"), legend.background = element_rect(fill = "#C8EDAF"))
```
最后我们考察下治愈人数与病殁人数的比率变动情况,该比率高于1说明治愈人数多于病殁人数,鉴于发病初始阶段数据量小扰动大,从2月1日开始计算,显示中国的治愈人数与病殁人数的比率稳步上升,中国之外的其他国家该指标呈现倒U型趋势,说明其他国家对新冠疫情的防控更多还是以病人治疗而非疫情的全面防控。
# **分析主要结论**
通过上述分析得到的主要结论如下:
- 在新冠疫情爆发的1月-2月底期间,中国的感染人数高于世界其他地区,同时病殁率也是最高;
- 随着时间的推移,中国的治愈人数呈现非线性增长态势;病殁人数缓慢下降,但不高于3%;
- 中国的治愈人数的增长逼近病殁人数的增长;
- 以治愈率和治愈人数/死亡人数比例考察,中国以外国家在这两个指标上的表现均不乐观,3月份除中国以外的其他国家疫情防控形势较为严峻。
**本次数据分析耗时 `r round(difftime(Sys.time(),begin,units = "mins"),digits = 2)` 分钟**。
Markdown自动生成目录:
toc: true #是否展示目录
toc_float: true #目录是否出现在左侧
toc_depth: 4 #目录能展示的级别