R语言 --- split 二三事

2021-07-14 本文已影响0人日月其除

最近有很多对文件的操作，经常使用到split函数，但是存在三个split函数，有时候会弄混，谨以此文以记之。
1. split()
2. str_spit()
3. strsplit()

split()

Usage
split(x, f, drop = FALSE, ...)
## Default S3 method:
split(x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...)
split(x, f, drop = FALSE, ...) <- value
unsplit(value, f, drop = FALSE)


Arguments
x   vector or data frame containing values to be divided into groups.

f   a ‘factor’ in the sense that as.factor(f) defines the grouping, or a list of such factors in which case their interaction is used for the grouping. If x is a data frame, f can also be a formula of the form ~ g to split by the variable g, or more generally of the form ~ g1 + ... + gk to split by the interaction of the variables g1, ..., gk, where these variables are evaluated in the data frame x using the usual non-standard evaluation rules.

drop     logical indicating if levels that do not occur should be dropped (if f is a factor or a list).

value   a list of vectors or data frames compatible with a splitting of x. Recycling applies if the lengths do not match.

sep character string, passed to interaction in the case where f is a list.

lex.order   logical, passed to interaction when f is a list.

... 
further potential arguments passed to methods.

总结： split(参数):split(向量/列表/数据框,因子/因子列表)
split()函数可以分组数据框和向量，返回list。
可以直接使用unsplit()。
split是按照factor去切分vector或者数据框，因此不能这样用：

> split(c('1_1', '2-2', '3_3'), '_')
$`_`
[1] "1_1" "2-2" "3_3"

切割数据框的用法：

> data = data.frame(v1 = c(1,1,2,2,3,3), v2 = c('a', 'b', 'c', 'd','e','f'))
> data
  v1 v2
1  1  a
2  1  b
3  2  c
4  2  d
5  3  e
6  3  f
> split(data, data$v1) #返回一个list，按照v1分组
$`1`
  v1 v2
1  1  a
2  1  b

$`2`
  v1 v2
3  2  c
4  2  d

$`3`
  v1 v2
5  3  e
6  3  f

针对vector的用法：

> x = c(rep(1:10, 2))
> f = gl(10,1)
> x
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9
[20] 10
> f
 [1] 1  2  3  4  5  6  7  8  9  10
Levels: 1 2 3 4 5 6 7 8 9 10
> split(x,f)
$`1`
[1] 1 1

$`2`
[1] 2 2

$`3`
[1] 3 3

$`4`
[1] 4 4

$`5`
[1] 5 5

$`6`
[1] 6 6

$`7`
[1] 7 7

$`8`
[1] 8 8

$`9`
[1] 9 9

$`10`
[1] 10 10

str_split()
来自R包stringr
有两种形式 str_split() & str_split_fixed()
str_split() 修改simplify = T效果等同于 str_split_fixed()

Usage
str_split(string, pattern, n = Inf, simplify = FALSE)
str_split_fixed(string, pattern, n)

Arguments
string  Input vector. Either a character vector, or something coercible to one.

pattern Pattern to look for.

The default interpretation is a regular expression, as described in stringi::stringi-search-regex. Control options with regex().

Match a fixed string (i.e. by comparing only bytes), using fixed(). This is fast, but approximate. Generally, for matching human text, you'll want coll() which respects character matching rules for the specified locale.

Match character, word, line and sentence boundaries with boundary(). An empty pattern, "", is equivalent to boundary("character").

n   number of pieces to return. Default (Inf) uses all possible split positions.

For str_split_fixed, if n is greater than the number of pieces, the result will be padded with empty strings.

simplify    If FALSE, the default, returns a list of character vectors. If TRUE returns a character matrix.

str_spllit()主要用于 split a vector of strings. 返回一个list。
str_spllit_fixed()可以返回一个matrix。
举个栗子:

> str_split(c('1_2','1_1','2_2','3'), '_')
[[1]]
[1] "1" "2"

[[2]]
[1] "1" "1"

[[3]]
[1] "2" "2"

[[4]]
[1] "3"
> str_split_fixed(c('1_2','1_1','2_2','3'), pattern = '_', n =2)
     [,1] [,2]
[1,] "1"  "2" 
[2,] "1"  "1" 
[3,] "2"  "2" 
[4,] "3"  ""

str_split(c('1_2','1_1','2_2','3'), pattern = '_', n =2, simplify = T)
[,1] [,2]
[1,] "1" "2"
[2,] "1" "1"
[3,] "2" "2"
[4,] "3" ""

strsplit()
对character组成的vector进行切割。返回一个list。
fixed = T 可有对.分割的charactor切割。对于其他的分割符不需要额外添加fixed = T

Description
Split the elements of a character vector x into substrings according to the matches to substring split within them.

Usage
strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
Arguments
x   character vector, each element of which is to be split. Other inputs, including a factor, will give an error.

split   character vector (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.

fixed   logical. If TRUE match split exactly, otherwise use regular expressions. Has priority over perl.

perl    logical. Should Perl-compatible regexps be used?

useBytes    logical. If TRUE the matching is done byte-by-byte rather than character-by-character, and inputs with marked encodings are not converted. This is forced (with a warning) if any input is found which is marked as "bytes" (see Encoding).

上栗子：

> strsplit(c('1.2','1.1','2.2','3'), split = '.')
[[1]]
[1] "" "" ""

[[2]]
[1] "" "" ""

[[3]]
[1] "" "" ""

[[4]]
[1] ""

> strsplit(c('1.2','1.1','2.2','3'), split = '.', fixed = T)
[[1]]
[1] "1" "2"

[[2]]
[1] "1" "1"

[[3]]
[1] "2" "2"

[[4]]
[1] "3"

> strsplit("a.b.c", "[.]")
[[1]]
[1] "a" "b" "c"

> strsplit(c('1_2','1_1','2_2','3'), split = '_')
[[1]]
[1] "1" "2"

[[2]]
[1] "1" "1"

[[3]]
[1] "2" "2"

[[4]]
[1] "3"

## Note that final empty strings are not produced:
strsplit(paste(c("", "a", ""), collapse="#"), split="#")[[1]]
# [1] ""  "a"
## and also an empty string is only produced before a definite match:
strsplit("", " ")[[1]]    # character(0)
strsplit(" ", " ")[[1]]   # [1] ""

R语言 --- split 二三事

猜你喜欢

热点阅读