Golang笔记 string, byte slices, ru

2020-03-07  本文已影响0人  酱油王0901

我们经常会碰到stringbyte slices以及rune之间的相互转化问题,现简单介绍一下。

String本质上是只读的slice of bytes

indexing a string yields its bytes, not its characters: a string is just a bunch of bytes.

runeint32的别名,代表字符的Unicode编码,采用4个字节存储,将string转成rune就意味着任何一个字符都用4个字节来存储其unicode值,这样每次遍历的时候返回的就是unicode值,而不再是字节了。

  • String is immutable byte sequence.
  • Byte slice is mutable byte sequence.
  • Rune slice is re-grouping of byte slice so that each index is a character.
 // rune is an alias for int32 and is equivalent to int32 in all ways. It is
 // used, by convention, to distinguish character values from integer values.
 type rune = int32

下面我们定义placeOfInterestraw string, 其由反引号 back quotes包围着, 因此它仅仅只能包含literal text

func main() {
    const placeOfInterest = `⌘`

    fmt.Printf("plain string: ")
    fmt.Printf("%s", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("quoted string: ")
    fmt.Printf("%+q", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("hex bytes: ")
    for i := 0; i < len(placeOfInterest); i++ {
        fmt.Printf("%x ", placeOfInterest[I])
    }
    for _, ch := range placeOfInterest {
        fmt.Printf("\nUnicode character: %c", ch)
    }
    fmt.Printf("\nThe length of placeOfInterest: %d", len(placeOfInterest))
    fmt.Printf("\n")

    const Chinese = "中国话"
    fmt.Println(len(Chinese))
    for index, runeValue := range Chinese {
          fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
    }
}

输出结果为:

plain string: ⌘
quoted string: "\u2318"
hex bytes: e2 8c 98
Unicode character: ⌘
The length of placeOfInterest: 3
9
U+4E2D '中' starts at byte position 0
U+56FD '国' starts at byte position 3
U+8BDD '话' starts at byte position 6

从上面输出结果可以看出:

  1. 符号⌘的Unicode character值为U+2318,其由三个字节组成:e2 8c 98。它们是UTF-8编码表示的16进制值2318
  2. 通过for range对字符串进行遍历时,每次获取到的对象都是rune类型的。而for循环遍历输出的是各个字节。
  3. go采用的是UTF-8编码,即go的源代码是被定义成UTF-8文本形式的,其他形式的表述是不被允许的。这就是说,当我们在代码中写下时,程序将符号 的UTF-8编码写入源代码文本中。因此当我们打印16进制bytes时,我们只是将编辑器放置在文件中的数据给dump下来了而已。
  4. 使用len函数获取到string的长度并不是字符个数,而是字节个数
  5. Unicode标准使用码点 code point来表示a single value所表述的item。例如符号⌘,其16进制值为2318,其code point 为U+2318。

但是由于Code point比较绕口,因此go引进了一个新的词汇项rune来表示。rune经常出现在library和源代码中,它基本上就和Code point一样,但是go语言将rune表示为int32的alias,这样通过一个整形值来代表Code point将更加清晰明了。因此,在Golang中我们可以将character constant称为rune constant 。表达式'⌘'的类型和值分别为rune ,整形值0x2318.

需要注意的是:
Unicode 只是一个符号集,它只规定了符号的二进制代码,却没有规定这个二进制代码应该如何存储。而UTF-8 就是在互联网上使用最广的一种 Unicode 的实现方式。
UTF-8 最大的一个特点,就是它是一种变长的编码方式。它可以使用1~4个字节表示一个符号,根据不同的符号而变化字节长度。
UTF-8编码格式为:

  1. 对于单字节的符号,字节的第一位设为0,后面7位为这个符号的 Unicode 码。因此对于英语字母,UTF-8 编码和 ASCII 码是相同的。
  2. 对于n字节的符号(n > 1),第一个字节的前n位都设为1,第n + 1位设为0,后面字节的前两位一律设为10。剩下的没有提及的二进制位,全部为这个符号的 Unicode 码。


    UTF-8编码格式

总结

  • Go source code is always UTF-8.
  • A string holds arbitrary bytes.
  • A string literal, absent byte-level escapes, always holds valid UTF-8 sequences. Some people think Go strings are always UTF-8, but they are not: only string literals are UTF-8. As we showed in the previous section, string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes. To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.
  • Those sequences represent Unicode code points, called runes.
  • No guarantee is made in Go that characters in strings are normalized.
  • String is a nice way to deal with short sequence, of bytes or characters. Everytime you operate on string, such as find replace string or take substring, a new string is created. This is very inefficient if string is huge, such as file content. [see Golang: String]
  • Byte slice is just like string, but mutable. i.e. you can modify each byte or character. This is very efficient for working with file content, either as text file, binary file, or IO stream from networking. [see Golang: Slice]
  • Rune slice is like byte slice, except that each index is a character instead of a byte. This is best if you work with text files that have lots non-ASCII characters, such as Chinese text or math formulas ∑ or text with emoji ♥ . [see Golang: Rune]

References

上一篇下一篇

猜你喜欢

热点阅读