数据处理-2(基于R语言)

2021-02-16 本文已影响0人北欧森林

查找不相同的项

library(dplyr)

a <- anti_join(x, y, by = "ID")  # 丢弃x表中与y表中的observation相匹配的所有项

将复制的内容加载入R里

df1 <- read.table("clipboard", header = T, sep = "\t")

查找列名中包含某个字符串的列

test[grep("aa", test$name), ]

注意factor的水平

b <- factor(1:3,levels=1:5);b
## [1] 1 2 3
## Levels: 1 2 3 4 5

改变因子的排列顺序(R中的因子存在着有序和无序两种，默认按照ASCII顺序排序)
对于无序因子：

# 创建一个错误次序的因子 
sizes <- factor(c("small", "large", "large", "small", "medium")) 
sizes 
#> [1] small large large small medium 
#> Levels: large medium small
# 顺序被直接指定
sizes <- factor(sizes, levels = c("small", "medium", "large")) 
sizes 
#> [1] small  large  large  small  medium 
#> Levels: small medium large

对于有序因子：

sizes <- ordered(c("small", "large", "large", "small", "medium")) 
sizes <- ordered(sizes, levels = c("small", "medium", "large")) 
sizes 
#> [1] small large large small medium 
#> Levels: small < medium < large

Bonus:

# 快速逆序排列
sizes <- factor(sizes, levels=rev(levels(sizes)))

source: https://sr-c.github.io/2018/09/16/Changing-the-order-of-levels-of-a-factor/

row.names 和rownames的区别：
There are two functions in the R core library:

row.names: Get and Set Row Names for Data Frames
rownames: Retrieve or set the row names of a matrix-like object.

If you don't want to bother distinguishing the two functions, then it would be logical to just use the generic version row.names() all the time, since it always dispatches the appropriate method. For example, if x is a matrix, then row.names(x) just passes cleanly through to rownames(x) because there is no more specific method for that class of object.

更改列名

library(tidyverse)
plyr::rename(d, c("old2"="two", "old3"="three"))

#Note: plyr中的rename和dplyr中的rename用法是不同的.
## plyr::rename
rename(data, c(old=new))

## dplyr::rename
rename(data, new = old)

#method2
library(reshape) # 加载所需的包
dat <- rename(dat,c(国家 = "country")) 
head(dat)   

#method3: 你想把列名变成x1,x2,...x10
cnames=paste("x",1:10,sep="")
colnames(dat)=cnames

替换数据集里的某些数值

library(stringr)

str_replace_all(a$AFP, c("?1250"), c("1250")) #被替换对象是第二个参数

# 以下二者相同，pattern是被替换对象
gsub("?800", 800, a$AFP)
gsub(pattern = "?800", replacement = "800", a$AFP)

去除高度线性相关变量

datTrain1 = datTrain[,-c(1,6)]
descrCor = cor(datTrain1)
descrCor

highlyCorDescr = findCorrelation(descrCor, cutoff = .75, names = F, verbose = T)
filteredTrain = datTrain1[,-highlyCorDescr]

对测试集标准化：

library(caret)
preProcValues = preProcess(datTrain, method = c("center", "scale"))
trainTransformed = predict(preProcValues, datTrain)
testTransformed = predict(preProcValues,datTest)

删除近零方差变量

nzv = nearZeroVar(datTrain)
nzv

做lasso回归时，对于x和y的数据类型要求

class(x)
class(y)
# [1] "data.frame"
# [1] "numeric"

x <- as.matrix(x)
y <- as.numeric(unlist(y))
class(x)
class(y)
# [1] "matrix" "array" 
# [1] "numeric"

数据处理-2(基于R语言)

猜你喜欢

热点阅读