R语言操作和处理数据小总结

2019-05-12 本文已影响12人秦城听雪

在R文本编辑器中输入数据

elements<- data.frame()
elements<- edit(elements)
print(elements)
str(iris)

提取前5行

iris[1:5,]

提取“Sepal.Length","Sepal.Width”两列的前5行

iris[1:5,c("Sepal.Length","Sepal.Width")]
sample(1:6,10,replace = TRUE)

指定种子值

set.seed(20)
index<- sample(1:nrow(iris),5)
index
iris[index,]

寻找重复值

duplicated(c(1,2,3,1,2,1,4,3))

查找哪一行是重复的

which(duplicated(iris))

删除重复项

##方法1 指定一个逻辑向量，FALSE表示去掉对应的元素。
##！运算符表示逻辑非，也就是说它会将TRUE变成FALSE，FALSE变成TRUE。
#即不要重复这一行
iris[!duplicated(iris),]
##方法2 指定负值
index<- which(duplicated(iris))
iris[-index,]
#删除包含缺失值的行
str(airquality)
complete.cases(airquality)
x<- airquality[complete.cases(airquality),]
str(x)
#方法2删除包含缺失值的行
x<- na.omit(airquality)

使用原始数据时要将其赋值，这样才能不会删除修改原来数据帧中的内容

数据帧的列运算

计算花萼的长宽比

y<- iris$Sepal.Length/iris$Sepal.Width
head(y)

利用with 和within提升代码的可读性来进行计算

z<- with(iris, Sepal.Length / Sepal.Width)
···
#利用identical(x,y)来验证两个变量到底是否完全相同
···
identical(y,z)

within函数对列进行复制，假设我们想把计算得到的长宽比存储到原始数据帧中

iris$ratio<-  iris$Sepal.Length/iris$Sepal.Width

也可以写成下列格式

iris<- within(iris,ratio<- Sepal.Length/Sepal.Width) 
head(iris$ratio)

##########

rm(list = ls())

对数据进行分组

######################

1. 用cut创建等量分组

head(state.x77)

提取Frost列

frost<- state.x77[,"Frost"]
head(frost,5)
cut(frost,3,include.lowest = TRUE)

2.为cut添加标签

cut(frost,3,include.lowest = TRUE,labels = c("Low","Med","High"))

3.使用table对观测进行计数

x<- cut(frost,3,include.lowest = TRUE,labels = c("Low","Med","High"))
table(x)

数据集的组合

merge()函数的使用

all.states<- as.data.frame(state.x77)
all.states$Name<- rownames(state.x77)
rownames(all.states)<- NULL
str(all.states)

提取气候寒冷的州

cold.states<- all.states[all.states$Frost>150,c("Name","Frost")]
cold.states

提取面积最大的州

large.states<- all.states[all.states$Area>=100000,c("Name","Area")]
large.states

使用merge使两个数据找到交集

merge(cold.states,large.states)

取∪集

merge(cold.states,large.states,all = TRUE)

使用查询表#match()函数可以返回两个向量中相匹配的元素的位置，是第一个向量在第二个向量中首次匹配的位置

index<- match(cold.states$Name,large.states$Name)
index

使用na.omit()函数去除向量中的NA值

large.states[na.omit(index),]

%in%函数，能够返回一个逻辑向量，告诉我们哪些地方的值是匹配的

index<- cold.states$Name %in% large.states$Name
index

#####################

数据排列

数据准备

some.states<- data.frame(Region=state.region,state.x77)

取前十行，前三列

some.states<- some.states[1:10,1:3]
some.states

利用sort进行升序

sort(some.states$Population)

降序

sort(some.states$Population,decreasing = TRUE)

获取排序后的位置

order.pop<- order(some.states$Population)
order.pop

代码告诉我们第一个元素位于第二位，第二个位于第8位

some.states$Population[order.pop]

数据帧的升排序

some.states[order.pop,]

降序

order(some.states$Population)
order(some.states$Population,decreasing = TRUE)

可以利用order的结果对数据帧进行降序排列，忽略存储位置的中间变量

some.states[order(some.states$Population,decreasing = TRUE),]

基于多列进行排序

当有了多个向量一致时，第一个向量值相同时，利用第二个向量进行比较

index<- with(some.states,order(Region,Population))
some.states[index,]

#####################################################
########使用apply()函数########

str(Titanic)

按照第一维度class对titanic数据集进行遍历统计

apply(Titanic,1,sum)
apply(Titanic,3,sum)
apply(Titanic,c(3,4),sum)

lapply和sapply

要获得iris数据集中每个元素的类型

lapply(iris,class)

使用sapply()时，R会尝试对结果进行花间，转换为矩阵或向量

sapply(iris,class)

sapply(iris,function(x) ifelse(is.numeric(x),mean(x),NA))

使用tappy()创建表格型汇总数据

tapply(iris$Sepal.Length,iris$Species,mean)

使用tapply()创建高维表格

str(mtcars)

将发动机的自动挡（0）和手动挡（1）的数据进行整合，成为一个因子型数据

cars<- within(mtcars,am<- factor(am,levels = 0:1,labels = c("Automatic","Manual")))
with(cars,tapply(mpg, am, mean))
with(cars,tapply(mpg,list(gear,am),mean))

reshape2包可以实现长数据和宽数据之间的转换

install.packages("reshape2")
library("reshape2")
goals<- data.frame(
  Game=c("1st","2nd","3rd","4th"),
  venue=c("Bruges","Ghent","Ghent","Bruges"),
  Granny=c(12,4,5,6),
  Gertrude=c(11,5,6,7)
  )
goals

要让数据从宽变长，可以使用melt()函数进行融化

由长变宽，则使用dcast()或者acast()

mgoals<- melt(goals,id.vars = c("Game","venue"))
mgoals

R语言操作和处理数据小总结

在R文本编辑器中输入数据

提取前5行

提取“Sepal.Length","Sepal.Width”两列的前5行

指定种子值

寻找重复值

查找哪一行是重复的

删除重复项

数据帧的列运算

计算花萼的长宽比

利用with 和within提升代码的可读性来进行计算

within函数对列进行复制，假设我们想把计算得到的长宽比存储到原始数据帧中

也可以写成下列格式

对数据进行分组

1. 用cut创建等量分组

提取Frost列

2.为cut添加标签

3.使用table对观测进行计数

数据集的组合

merge()函数的使用

提取气候寒冷的州

提取面积最大的州

使用merge使两个数据找到交集

取∪集

使用查询表#match()函数可以返回两个向量中相匹配的元素的位置，是第一个向量在第二个向量中首次匹配的位置

使用na.omit()函数去除向量中的NA值

%in%函数，能够返回一个逻辑向量，告诉我们哪些地方的值是匹配的

数据排列

数据准备

取前十行，前三列

利用sort进行升序

降序

获取排序后的位置

代码告诉我们第一个元素位于第二位，第二个位于第8位

数据帧的升排序

降序

可以利用order的结果对数据帧进行降序排列，忽略存储位置的中间变量

基于多列进行排序

当有了多个向量一致时，第一个向量值相同时，利用第二个向量进行比较

按照第一维度class对titanic数据集进行遍历统计

lapply和sapply

要获得iris数据集中每个元素的类型

使用sapply()时，R会尝试对结果进行花间，转换为矩阵或向量

使用tappy()创建表格型汇总数据

使用tapply()创建高维表格

将发动机的自动挡（0）和手动挡（1）的数据进行整合，成为一个因子型数据

reshape2包可以实现长数据和宽数据之间的转换

要让数据从宽变长，可以使用melt()函数进行融化

由长变宽，则使用dcast()或者acast()

猜你喜欢

热点阅读