- 列表
- 列表的创建
列表是“递归型”的向量,即列表中的元素是可以再分的。以超市货品数据为例,可以用列表A存储每一种货物,列表A中的每一种元素可被视为一种货物。对于每一种货物,可以用列表B存储其各方面的属性,比如名称,价格,生产日期等信息。
> goods <- list(name = "Cookie", price = 4, outdate = F)
> goods
$name
[1] "Cookie"
$price
[1] 4
$outdate
[1] FALSE
> typeof(goods$name)
[1] "character"
> typeof(goods$price)
[1] "double"
> typeof(goods$outdate)
[1] "logical"
创建列表时建议指定标签,typeof()函数可以查看列表中元素类型。
- 列表元素的访问
列表元素的访问方式主要有两种,一种方式是按数字索引另一种则是按名称索引。
> goods$name
[1] "Cookie"
> goods[["name"]]
[1] "Cookie"
> goods[[1]]
[1] "Cookie"
注意的是,双方括号返回的是对应元素的值,返回值本身的数据类型不会变化,而单方括号返回的是新的列表,因此单方括号通常用来获取原列表的一段子列表,且应该是连续的片段。通过names()函数可以访问列表的标签。
> names(goods)
[1] "name" "price" "outdate"
- 增删列表元素
> goods$producer <- "A Company" #添加标签并初始化 > goods[["material"]] <- "flour" > goods $name [1] "Cookie" $price [1] 4 $outdate [1] FALSE $producer [1] "A Company" $material [1] "flour" [[6]] [1] 1
> goods$material <- NULL #直接将对应元素赋值为NULL就可以删除元素
- 拼接列表
R中的c()函数通常被用来拼接列表。
> c(list(A = 1, c = "C"), list(new = "NEW"))
$A
[1] 1
$c
[1] "C"
$new
[1] "NEW"
- 列表转换成向量
> unlist(goods) # 列表转化成向量
name price outdate producer
"Cookie" "4" "FALSE" "A Company" "1"
> ngoods <- unlist(goods)
> names(ngoods)
[1] "name" "price" "outdate" "producer" ""
> names(ngoods) <- NULL # 去除列表名的向量
> ngoods
[1] "Cookie" "4" "FALSE" "A Company" "1"
> mgoods <- unlist(goods)
> unname(mgoods) # 去除向量名的新向量
[1] "Cookie" "4" "FALSE" "A Company" "1"
> c(goods, recursive = T) # 得到为真向量, 且在拼接列表时,recursive参数作用是打散“递归结构”
name price outdate producer
"Cookie" "4" "FALSE" "A Company" "1"
- 列表上的运算
> temp <- list(1:10, -2:-9)
> lapply(temp, mean) #返回的是列表
[[1]]
[1] 5.5
[[2]]
[1] -5.5
> temp
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
[1] -2 -3 -4 -5 -6 -7 -8 -9
> sapply(temp, mean) #返回的是矩阵或者向量
[1] 5.5 -5.5
- 列表的递归
> a1 <- list(name ="Cookie", price = 4.0, outdate = F)
> a2 <- list(name ="Milk", price = 2.0, outdate = T)
> warehouse <- list(a1, a2)
> warehouse
[[1]]
[[1]]$name
[1] "Cookie"
[[1]]$price
[1] 4
[[1]]$outdate
[1] FALSE
[[2]]
[[2]]$name
[1] "Milk"
[[2]]$price
[1] 2
[[2]]$outdate
[1] TRUE
- 数据框
数据框在形式上与矩阵非常相似,有行和列两个维度,但是数据框的列可以是不同的模式(mode),即不同的列包含的基本数据类型可以不同。
- 数据框的创建
> male <- c(124, 88, 200) # 每个向量最好命名标签
> female <- c(108, 56, 221)
> degree <- c("low", "middle", "high")
> myopia <- data.frame(degree, male, female) # 将三个向量整合成数据框,当各个向量的长度不一样的时候,较短的向量便会以循环补齐的原则补充完整。
> myopia
degree male female
1 low 124 108
2 middle 88 56
3 high 200 221
> str(myopia) # 查看数据框的内部结构
'data.frame': 3 obs. of 3 variables:
$ degree: Factor w/ 3 levels "high","low","middle": 2 3 1
$ male : num 124 88 200
$ female: num 108 56 221
第一行结果显示数据框myopia有三个变量的三条观测记录~每个变量可以理解为数据框的一列,而每个观测为数据框的每一行数据。可以看见“degree”从向量变成了因子,将参数stringAsFactors设定成为FALSE,就不会改变。还可以read.csv函数来创建数据框。
- 数据框元素的访问
> myopia$degree
[1] low middle high
Levels: high low middle
> myopia[["degree"]]
[1] low middle high
Levels: high low middle
> myopia[["1"]]
NULL
> myopia[[1]]
[1] low middle high
Levels: high low middle
> myopia[1,]
degree male female
1 low 124 108
> myopia[,2]
[1] 124 88 200
> myopia[1,2]
[1] 124
- 提取子数据框
> (sub <- myopia[2:3, 1:2])
degree male
2 middle 88
3 high 200
> class(sub)
[1] "data.frame"
> (sub1 <- myopia[2:3, 2])
[1] 88 200
> class(sub1)
[1] "numeric"
提示:在赋值语句外加上括号会打印出赋值后变量的值,取数据框的一列时,返回的结果就变成了向量,将参数drop设置为F时则不会,当取单独的一行时,返回的结果类型还是数据框。
> myopia[1:2] # 指定的一个维度,默认是取列
degree male
1 low 124
2 middle 88
3 high 200
> myopia[c("male", "female")] # 也可以用列名取子数据框
male female
1 124 108
2 88 56
3 200 221
> myopia[myopia$male > 100,] # 使用筛选语句
degree male female
1 low 124 108
3 high 200 221
- 数据框行列的添加
> myopia # 若数据框在添加时没有将stringAsFactor设置成F,那么添加数据时会和因子的水平(level)发生冲突
degree male female
1 low 124 108
2 middle 88 56
3 high 200 221
> ages <- c(13, 12, 14)
> names <- c("Jack", "Steven", "Marry")
> cbind(myopia, ages)
degree male female ages
1 low 124 108 13
2 middle 88 56 12
3 high 200 221 14
- 数据框的合并
> students
names ages
1 Jack 15
2 Steven 16
> students2
names gender
1 Jack M
2 Steven M
> merge(students, students2) # 两个数据框有共同的列名“names”, 将其所在行提出,综合生成新的数据框。
names ages gender
1 Jack 15 M
2 Steven 16 M
> students
names ages
1 Jack 15
2 Steven 16
3 Sarah 14
> na <- c("Jack", "Conan", "Gin")
> add <- c("Beijing", "Chongqing", "Shanghai")
> students3 <- data.frame(na, add)
> students3
na add
1 Jack Beijing
2 Conan Chongqing
3 Gin Shanghai
> merge(students, students3, by.x = "names", by.y = "na") #将“names”列和“na”列合并
names ages add
1 Jack 15 Beijing
> merge(students, students3, by.x = "names", by.y = "na", all.x = T)
names ages add
1 Jack 15 Beijing
2 Sarah 14 <NA>
3 Steven 16 <NA>
> merge(students, students3, by.x = "names", by.y = "na", all.y = T)
names ages add
1 Conan <NA> Chongqing
2 Gin <NA> Shanghai
3 Jack 15 Beijing
> merge(students, students3, by.x = "names", by.y = "na", all = T)
names ages add
1 Conan <NA> Chongqing
2 Gin <NA> Shanghai
3 Jack 15 Beijing
4 Sarah 14 <NA>
5 Steven 16 <NA>
- 数据框的其他操作
> tt <- rbind(students, list("Kevin", 30))
> tt$grade <- c(88, 74, 90, 82)
> tt
names ages grade
1 Jack 15 88
2 Steven 16 74
3 Sarah 14 90
4 Kevin 30 82
> apply(tt[,2:3, drop = F], 2, mean)
> (al <- lapply(students, sort)) # 将列进行排序,返回列表
$names
[1] "Jack" "Sarah" "Steven"
$ages
[1] "14" "15" "16"
> (al <- sapply(students, sort)) # 返回的结果是矩阵,可以用as.data.frame()函数将结果转化成数据框
names ages
[1,] "Jack" "14"
[2,] "Sarah" "15"
[3,] "Steven" "16"
- 因子
R中的变量一般分为3类:名义型,有序型和连续型, 其中名义型的变量和有序型的变量称为因子。
- 因子的创建
> ssample <- c("BJ", "SH", "CQ", "SH")
> (sf <- factor(ssample)) # factor函数本质上试讲一个向量重新编码成一个因子,可以把因子看成包含更多信息的向量。
[1] BJ SH CQ SH
Levels: BJ CQ SH
> nsample <- c(2, 3, 3, 5)
> (sf <- factor(nsample))
[1] 2 3 3 5
Levels: 2 3 5
> str(sf) # 对于字符型向量,因子水平按默认按字母表顺序创建,数字型向量水平依旧按照数值从小到大的顺序创建。
Factor w/ 3 levels "2","3","5": 1 2 2 3
> unclass(sf)
[1] 1 2 2 3
attr(,"levels")
[1] "2" "3" "5"
- 因子中插入水平
> fsample <- factor(sample, levels = c(7, 10, 12, 15, 100))
> fsample
[1] 12 15 7 10
Levels: 7 10 12 15 100
> fsample[5] <- 100
> fsample
[1] 12 15 7 10 100
Levels: 7 10 12 15 100
> fsample[6] <- 100
> fsample
[1] 12 15 7 10 100 100
Levels: 7 10 12 15 100
> fsample[6] <- 99 # 不能向因子中添加不存在的水平值
Warning message:
In `[<-.factor`(`*tmp*`, 6, value = 99) :
invalid factor level, NA generated
- 因子和常用的函数
> wt <- c(46, 39, 35,42, 43, 43)
> group <- c("A", "B", "C", "A", "B", "C")
> tapply(wt, as.factor(group), mean) # 将wt向量中的数据分为了3组,再对每组求一个平均值
A B C
44 41 39
> wt <- c(46, 39, 35,42, 43, 43,46, 39, 35,42, 43, 43)
> diet <- c("A", "B", "C", "A", "B", "C", "A", "B", "C","A", "B", "C")
> gender <- c("M", "M","M","M","M","M","F","F","F","F","F","F" )
> tapply(wt, list(as.factor(diet), as.factor(gender)), mean)
F M
A 44 44
B 41 41
C 39 39
> split(wt, list(diet, gender)) #形成分组
$A.F
[1] 46 42
$B.F
[1] 39 43
$C.F
[1] 35 43
$A.M
[1] 46 42
$B.M
[1] 39 43
$C.M
[1] 35 43
> myopia
degree male female
1 low 124 108
2 middle 88 56
3 high 200 221
> by(myopia, myopia$degree, function(frame) frame[,2]+ frame[,3]) # 应用数据框和矩阵面对来自不同的分组应用不同函数的方法
myopia$degree: high
[1] 421
------------------------------------------------------------------------------
myopia$degree: low
[1] 232
------------------------------------------------------------------------------
myopia$degree: middle
[1] 144
- 表