Golang笔记 string, byte slices, ru
我们经常会碰到string
,byte slices
以及rune
之间的相互转化问题,现简单介绍一下。
String
本质上是只读的slice of bytes
。
indexing a string yields its bytes, not its characters: a string is just a bunch of bytes.
rune
是int32
的别名,代表字符的Unicode编码,采用4个字节存储,将string转成rune就意味着任何一个字符都用4个字节来存储其unicode值,这样每次遍历的时候返回的就是unicode值,而不再是字节了。
String
is immutable byte sequence.Byte slice
is mutable byte sequence.Rune
slice is re-grouping of byte slice so that each index is a character.// rune is an alias for int32 and is equivalent to int32 in all ways. It is // used, by convention, to distinguish character values from integer values. type rune = int32
下面我们定义placeOfInterest
为 raw string
, 其由反引号 back quotes
包围着, 因此它仅仅只能包含literal text
。
func main() {
const placeOfInterest = `⌘`
fmt.Printf("plain string: ")
fmt.Printf("%s", placeOfInterest)
fmt.Printf("\n")
fmt.Printf("quoted string: ")
fmt.Printf("%+q", placeOfInterest)
fmt.Printf("\n")
fmt.Printf("hex bytes: ")
for i := 0; i < len(placeOfInterest); i++ {
fmt.Printf("%x ", placeOfInterest[I])
}
for _, ch := range placeOfInterest {
fmt.Printf("\nUnicode character: %c", ch)
}
fmt.Printf("\nThe length of placeOfInterest: %d", len(placeOfInterest))
fmt.Printf("\n")
const Chinese = "中国话"
fmt.Println(len(Chinese))
for index, runeValue := range Chinese {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
}
输出结果为:
plain string: ⌘
quoted string: "\u2318"
hex bytes: e2 8c 98
Unicode character: ⌘
The length of placeOfInterest: 3
9
U+4E2D '中' starts at byte position 0
U+56FD '国' starts at byte position 3
U+8BDD '话' starts at byte position 6
从上面输出结果可以看出:
- 符号⌘的
Unicode character
值为U+2318
,其由三个字节组成:e2 8c 98
。它们是UTF-8
编码表示的16进制值2318
。 - 通过
for range
对字符串进行遍历时,每次获取到的对象都是rune
类型的。而for循环遍历输出的是各个字节。 - go采用的是
UTF-8
编码,即go的源代码是被定义成UTF-8文本形式的,其他形式的表述是不被允许的。这就是说,当我们在代码中写下⌘
时,程序将符号⌘
的UTF-8编码写入源代码文本中。因此当我们打印16进制bytes时,我们只是将编辑器放置在文件中的数据给dump下来了而已。 - 使用
len
函数获取到string的长度并不是字符个数,而是字节个数。 - Unicode标准使用码点
code point
来表示a single value
所表述的item
。例如符号⌘,其16进制值为2318,其code point 为U+2318。
但是由于Code point
比较绕口,因此go引进了一个新的词汇项rune
来表示。rune
经常出现在library和源代码中,它基本上就和Code point
一样,但是go语言将rune
表示为int32的alias,这样通过一个整形值来代表Code point
将更加清晰明了。因此,在Golang中我们可以将character constant
称为rune constant
。表达式'⌘'
的类型和值分别为rune
,整形值0x2318
.
需要注意的是:
Unicode
只是一个符号集,它只规定了符号的二进制代码,却没有规定这个二进制代码应该如何存储。而UTF-8 就是在互联网上使用最广的一种 Unicode 的实现方式。
UTF-8
最大的一个特点,就是它是一种变长的编码方式。它可以使用1~4
个字节表示一个符号,根据不同的符号而变化字节长度。
UTF-8
编码格式为:
- 对于单字节的符号,字节的第一位设为0,后面7位为这个符号的 Unicode 码。因此对于英语字母,UTF-8 编码和 ASCII 码是相同的。
-
对于n字节的符号(n > 1),第一个字节的前n位都设为1,第n + 1位设为0,后面字节的前两位一律设为10。剩下的没有提及的二进制位,全部为这个符号的 Unicode 码。
UTF-8编码格式
总结
- Go source code is always UTF-8.
- A string holds arbitrary bytes.
- A string literal, absent byte-level escapes, always holds valid UTF-8 sequences. Some people think Go strings are always UTF-8, but they are not: only string literals are UTF-8. As we showed in the previous section, string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes. To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.
- Those sequences represent Unicode code points, called runes.
- No guarantee is made in Go that characters in strings are normalized.
String
is a nice way to deal with short sequence, of bytes or characters. Everytime you operate on string, such as find replace string or take substring, a new string is created. This is very inefficient if string is huge, such as file content. [see Golang: String]Byte slice
is just like string, but mutable. i.e. you can modify each byte or character. This is very efficient for working with file content, either as text file, binary file, or IO stream from networking. [see Golang: Slice]Rune slice
is like byte slice, except that each index is a character instead of a byte. This is best if you work with text files that have lots non-ASCII characters, such as Chinese text or math formulas ∑ or text with emoji ♥ . [see Golang: Rune]