10. Character Sets
An NSCharacterSet
object represents a set of Unicode characters. NSString
and NSScanner
objects use NSCharacterSet
objects to group characters together for searching operations, so that they can find any of a particular set of characters during a search.
- NSCharacterSet对象表示一组Unicode字符。 NSString和NSScanner对象使用NSCharacterSet对象将字符组合在一起以进行搜索操作,以便他们可以在搜索过程中找到任何特定的字符集。
Character Set Basics
- 字符集基础知识
A character set object represents a set of Unicode characters. Character sets are represented by instances of a class cluster. The cluster’s two public classes, NSCharacterSet
and NSMutableCharacterSet
, declare the programmatic interface for immutable and mutable character sets, respectively. An immutable character set is defined when it is created and subsequently cannot be changed. A mutable character set can be changed after it’s created.
- 字符集对象表示一组Unicode字符。 字符集由类集群的实例表示。 集群的两个公共类NSCharacterSet和NSMutableCharacterSet分别声明了不可变和可变字符集的编程接口。 创建不可变字符集时会定义该字符集,并且随后无法更改。 可变字符集在创建后可以更改。
A character set object doesn’t perform any tasks; it simply holds a set of character values to limit operations on strings. The NSString
and NSScanner
classes define methods that take NSCharacterSet
objects as arguments to find any of several characters. For example, this code excerpt finds the range of the first uppercase letter in myString:
.
- 字符集对象不执行任何任务; 它只是包含一组字符值来限制对字符串的操作。
NSString
和NSScanner
类定义了将NSCharacterSet
对象作为参数来查找多个字符中的任何一个的方法。 例如,此代码摘录查找myString:
中第一个大写字母的范围。
NSString *myString = @"some text in an NSString...";
NSCharacterSet *characterSet = [NSCharacterSet uppercaseLetterCharacterSet];
NSRange letterRange = [myString rangeOfCharacterFromSet:characterSet];
After this fragment executes, letterRange.location
is equal to the index of the first “N” in “NSString” after rangeOfCharacterFromSet:
is invoked. If the first letter of the string were “S”, then letterRange.location
would be 0
.
- 执行此片段后,在调用rangeOfCharacterFromSet:之后,letterRange.location等于“NSString”中第一个“N”的索引。 如果字符串的第一个字母是“S”,则letterRange.location将为0。
Creating Character Sets
- 创建字符集
NSCharacterSet
defines class methods that return commonly used character sets, such as letters (uppercase or lowercase), decimal digits, whitespace, and so on. These “standard” character sets are always immutable, even if created by sending a message to NSMutableCharacterSet
. See Standard Character Sets and Unicode Definitions for more information on standard character sets.
- NSCharacterSet定义返回常用字符集的类方法,例如字母(大写或小写),十进制数字,空格等。 即使通过向NSMutableCharacterSet发送消息来创建,这些“标准”字符集也始终是不可变的。 有关标准字符集的更多信息,请参阅标准字符集和Unicode定义。
You can use a standard character set as a starting point for building a custom set by making a mutable copy of it and changing that. (You can also start from scratch by creating a mutable character set with alloc
and init
and adding characters to it.) For example, this fragment creates a character set containing letters, digits, and basic punctuation:
- 您可以使用标准字符集作为构建自定义集的起点,方法是创建自定义集并对其进行更改。 (您也可以通过使用alloc和init创建可变字符集并从中添加字符来从头开始。)例如,此片段创建一个包含字母,数字和基本标点符号的字符集:
NSMutableCharacterSet *workingSet = [[NSCharacterSet alphanumericCharacterSet] mutableCopy];
[workingSet addCharactersInString:@";:,."];
NSCharacterSet *finalCharacterSet = [workingSet copy];
To define a custom character set using Unicode code points, use code similar to the following fragment (which creates a character set including the form feed and line separator characters):
- 要使用Unicode代码点定义自定义字符集,请使用类似于以下片段的代码(它创建包含换页符和行分隔符字符的字符集):
UniChar chars[] = {0x000C, 0x2028};
NSString *string = [[NSString alloc] initWithCharacters:chars
length:sizeof(chars) / sizeof(UniChar)];
NSCharacterSet *characterSet = [NSCharacterSet characterSetWithCharactersInString:string];
Performance considerations
- 性能考虑因素
Because character sets often participate in performance-critical code, you should be aware of the aspects of their use that can affect the performance of your application. Mutable character sets are generally much more expensive than immutable character sets. They consume more memory and are costly to invert (an operation often performed in scanning a string). Because of this, you should follow these guidelines:
由于字符集通常参与性能关键代码,因此您应该了解它们的使用方面可能会影响应用程序的性能。 可变字符集通常比不可变字符集贵得多。 它们消耗更多内存并且反转成本很高(通常在扫描字符串时执行操作)。 因此,您应遵循以下准则:
-
Create as few mutable character sets as possible
- 创建尽可能少的可变字符集。
-
Cache character sets (in a global dictionary, perhaps) instead of continually recreating them.
- 缓存字符集(可能是在全局字典中)而不是不断地重新创建它们。
-
When creating a custom set that doesn’t need to change after creation, make an immutable copy of the final character set for actual use, and dispose of the working mutable character set. Alternatively, create a character set file as described in Creating a character set file and store it in your application’s main bundle.
- 创建创建后不需要更改的自定义集时,请为实际使用创建最终字符集的不可变副本,并处理可工作的可变字符集。 或者,创建一个字符集文件,如创建字符集文件中所述,并将其存储在应用程序的主包中。
-
Similarly, avoid archiving character set objects; store them in character set files instead. Archiving can result in a character set being duplicated in different archive files, resulting in wasted disk space and duplicates in memory for each separate archive read.
- 同样,避免存档字符集对象; 将它们存储在字符集文件中。 归档可能导致字符集在不同的归档文件中重复,从而导致每个单独的归档读取浪费的磁盘空间和内存中的重复。
Creating a character set file
- 创建字符集文件
If your application frequently uses a custom character set, you should save its definition in a resource file and load that instead of explicitly adding individual characters each time you need to create the set. You can save a character set by getting its bitmap representation (an NSData
object) and saving that object to a file:
- 如果您的应用程序经常使用自定义字符集,则应将其定义保存在资源文件中并加载,而不是在每次需要创建集时显式添加单个字符。 您可以通过获取其位图表示(NSData对象)并将该对象保存到文件来保存字符集:
NSData *charSetRep = [finalCharacterSet bitmapRepresentation];
NSURL *dataURL = <#URL for character set#>;
NSError *error;
BOOL result = [charSetRep writeToURL:dataURL options:NSDataWritingAtomic error:&error];
By convention, character set filenames use the extension .bitmap
. If you intend for others to use your character set files, you should follow this convention. To read a character set file with a .bitmap
extension, simply use the characterSetWithContentsOfFile:
method.
- 按照惯例,字符集文件名使用扩展名
.bitmap
。 如果您打算让其他人使用您的字符集文件,则应遵循此约定。 要读取带有.bitmap扩展名的字符集文件,只需使用characterSetWithContentsOfFile:
方法。
Standard Character Sets and Unicode Definitions
- 标准字符集和Unicode定义
The standard character sets, such as that returned by letterCharacterSet
, are formally defined in terms of the normative and informative categories established by the Unicode standard, such as Uppercase Letter, Combining Mark, and so on. The formal definition of a standard character set is in most cases given as one or more of the categories defined in the standard. For example, the set returned by lowercaseLetterCharacterSet
include all characters in normative category Lowercase Letters, while the set returned by letterCharacterSet
includes the characters in all of the Letter categories.
- 标准字符集(例如letterCharacterSet返回的字符集)是根据Unicode标准建立的规范和信息类别正式定义的,例如大写字母,组合标记等。 在大多数情况下,标准字符集的正式定义是作为标准中定义的一个或多个类别给出的。 例如,lowercaseLetterCharacterSet返回的集合包括标准类别Lowercase Letters中的所有字符,而letterCharacterSet返回的集合包括所有Letter类别中的字符。
Note that the definitions of the categories themselves may change with new versions of the Unicode standard. You can download the files that define category membership from http://www.unicode.org/.
- 请注意,类别本身的定义可能会随着Unicode标准的新版本而改变。 您可以从http://www.unicode.org/下载定义类别成员资格的文件。