String及StringTable(一):String源码解读
在前面关于java日期对象中的系列文章中介绍到,String类是immutable实现的典范。通过不可变的方式实现,来确保了String的性能和安全性。现就String详细源码一探究竟。
1.Immutable的相关实现
1.申明及成员变量
String首先是final修饰class,同时核心的char数组values也是final修饰:
public final class String
implements java.io.Serializable, Comparable<String>, CharSequence {
/** The value is used for character storage. */
private final char value[];
/** Cache the hash code for the string */
private int hash; // Default to 0
/** use serialVersionUID from JDK 1.0.2 for interoperability */
private static final long serialVersionUID = -6849794470754667710L;
}
final修饰的类不可集成,final修饰的变量只能被初始化一次。因此,这可以决定,String类在第一次调用初始化构造方法之后,就不能被改变。
String类的核心是通过对final属性的 char数组values操作。由于只能初始化一次,因而大部分操作都是通过System.arraycopy方法,复制一个新的char数组,然后返回。
2.构造方法
String的构造方法有很多种,最常用的是通过String进行构造。
/**
* Initializes a newly created {@code String} object so that it represents
* the same sequence of characters as the argument; in other words, the
* newly created string is a copy of the argument string. Unless an
* explicit copy of {@code original} is needed, use of this constructor is
* unnecessary since Strings are immutable.
*
* @param original
* A {@code String}
*/
public String(String original) {
this.value = original.value;
this.hash = original.hash;
}
当然也可以通过char数组构造。
/**
* Allocates a new {@code String} so that it represents the sequence of
* characters currently contained in the character array argument. The
* contents of the character array are copied; subsequent modification of
* the character array does not affect the newly created string.
*
* @param value
* The initial value of the string
*/
public String(char value[]) {
this.value = Arrays.copyOf(value, value.length);
}
但是需要注意的是,这两个构造函数有着本质的区别。String(String original) 实际上只创建了一个新的String对象,但是其属性还是通过指针的方式,指向原来的char数组。而String(char value[]) 则是通过System.arraycopy的方式,重新在堆区copy出了一个全新的char数组。这是有本质区别的。我们可以通过反射方法进行验证:
public static void main(String[] args) {
String a = "12345";
String b = "12345";
String c = new String("12345");
String d = new String(b.toCharArray());
try {
Field charField = String.class.getDeclaredField("value");
charField.setAccessible(true);
char[] objects = (char[]) charField.get(a);
System.out.println(objects.length);
} catch (NoSuchFieldException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
}
System.out.println("a is : {"+a+"}");
System.out.println("b is : {"+b+"}");
System.out.println("c is : {"+c+"}");
System.out.println("d is : {"+d+"}");
}
上述代码输出为:
a is : {02345}
b is : {02345}
c is : {02345}
d is : {12345}
可以看到,我们通过反射修改了a中将首字母改为了0,a、b、c、d中,只有d保持不变,其他都一同被修改。
这是因为,java中字符串有字符串常量池StringTable,这个将在后续介绍。需要说明的是,a b 实际上都是指向常量池中的同一内容。那么c的构造方法我们可以发现,其内部的指针仍然指向的是最初a里面的char数组。而d则采用了arraycopy重建了新的char数组。
在String中,除了public String(String original) 这个构造方法之外,其他都是通过arraycopy生成新的char数组。
3.get类方法
get相关的方法有getBytes方法。其源码如下:
/**
* Encodes this {@code String} into a sequence of bytes using the
* platform's default charset, storing the result into a new byte array.
*
* <p> The behavior of this method when this string cannot be encoded in
* the default charset is unspecified. The {@link
* java.nio.charset.CharsetEncoder} class should be used when more control
* over the encoding process is required.
*
* @return The resultant byte array
*
* @since JDK1.1
*/
public byte[] getBytes() {
return StringCoding.encode(value, 0, value.length);
}
可以看到,调用get方法,实际上通过StringCoding方法堆字符串进行了转码。而StringCoding方法最终调用的是:
// Trim the given byte array to the given length
//
private static byte[] safeTrim(byte[] ba, int len, Charset cs, boolean isTrusted) {
if (len == ba.length && (isTrusted || System.getSecurityManager() == null))
return ba;
else
return Arrays.copyOf(ba, len);
}
还是Arrays.copyOf方法。这样此类get方法就创建了一个新的String对象返回。
类型的get方法还有:
public byte[] getBytes(String charsetName)
public byte[] getBytes(Charset charset)
public void getChars(int srcBegin, int srcEnd, char dst[], int dstBegin)
类似的这些方法,都是通过arraycopy创建了一个新的char数组。
2.与序列化有关的serialPersistentFields
在String源码中有个非常特殊的地方:
/**
* Class String is special cased within the Serialization Stream Protocol.
*
* A String instance is written into an ObjectOutputStream according to
* <a href="{@docRoot}/../platform/serialization/spec/output.html">
* Object Serialization Specification, Section 6.2, "Stream Elements"</a>
*/
private static final ObjectStreamField[] serialPersistentFields =
new ObjectStreamField[0];
申明了一个private final static的ObjectStreamField数组,但是奇怪的是这个数组的长度为0。经过查询文档发现,这是对序列化接口implements java.io.Serializable的一种约定。
通常implements java.io.Serializable的类能够被序列化,在序列化的过程中,serialVersionUID用于实现反序列化的约束,如果不一致则反序列化会失败。而所有非 static 和 transient 修饰的属性都会被序列化。在前面学习Date对象的时候就学过,Date的fastTime由于被transient修饰因而不会被序列化。哪 private static final ObjectStreamField[] serialPersistentFields 的作用又是什么呢?
private static final ObjectStreamField[] serialPersistentFields =
new ObjectStreamField[0];
如果出现在实现了Serializable接口的类中,那么序列化的时候只会对这个数组中指定的属性才会进行序列化。通常如下使用:
class List implements Serializable {
public ObjectStreamField[] serialPersistentFields = { new ObjectStreamField("myField", List.class) };
...
}
在String中,这个数组的大小为0,意思就是说,String类的任何属性在序列化处理的时候都不会被序列化。
类似于一种保护机制。因此这也是一个冷知识点。
参考:
Code Correctness: Incorrect serialPersistentFields Modifier
3.CaseInsensitiveComparator内部类
实际上本节要讨论的是String的可比较性,基本上经常看源码的人都知道,jvm源代码中,需要重点关注的是哪些内部class。对于String也是一样。
3.1 CaseInsensitiveComparator内部类
String的这个内部类实现了Comparator接口:
/**
* A Comparator that orders {@code String} objects as by
* {@code compareToIgnoreCase}. This comparator is serializable.
* <p>
* Note that this Comparator does <em>not</em> take locale into account,
* and will result in an unsatisfactory ordering for certain locales.
* The java.text package provides <em>Collators</em> to allow
* locale-sensitive ordering.
*
* @see java.text.Collator#compare(String, String)
* @since 1.2
*/
public static final Comparator<String> CASE_INSENSITIVE_ORDER
= new CaseInsensitiveComparator();
private static class CaseInsensitiveComparator
implements Comparator<String>, java.io.Serializable {
// use serialVersionUID from JDK 1.2.2 for interoperability
private static final long serialVersionUID = 8575799808933029326L;
public int compare(String s1, String s2) {
int n1 = s1.length();
int n2 = s2.length();
int min = Math.min(n1, n2);
for (int i = 0; i < min; i++) {
char c1 = s1.charAt(i);
char c2 = s2.charAt(i);
if (c1 != c2) {
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if (c1 != c2) {
c1 = Character.toLowerCase(c1);
c2 = Character.toLowerCase(c2);
if (c1 != c2) {
// No overflow because of numeric promotion
return c1 - c2;
}
}
}
}
return n1 - n2;
}
/** Replaces the de-serialized object. */
private Object readResolve() { return CASE_INSENSITIVE_ORDER; }
}
通过源代码我们可以发现,String在这个比较器内实现了忽略字符串大小写的比较。但是String本身还有一个比较方法。
3.2 compareTo
/**
* Compares two strings lexicographically.
* The comparison is based on the Unicode value of each character in
* the strings. The character sequence represented by this
* {@code String} object is compared lexicographically to the
* character sequence represented by the argument string. The result is
* a negative integer if this {@code String} object
* lexicographically precedes the argument string. The result is a
* positive integer if this {@code String} object lexicographically
* follows the argument string. The result is zero if the strings
* are equal; {@code compareTo} returns {@code 0} exactly when
* the {@link #equals(Object)} method would return {@code true}.
* <p>
* This is the definition of lexicographic ordering. If two strings are
* different, then either they have different characters at some index
* that is a valid index for both strings, or their lengths are different,
* or both. If they have different characters at one or more index
* positions, let <i>k</i> be the smallest such index; then the string
* whose character at position <i>k</i> has the smaller value, as
* determined by using the < operator, lexicographically precedes the
* other string. In this case, {@code compareTo} returns the
* difference of the two character values at position {@code k} in
* the two string -- that is, the value:
* <blockquote><pre>
* this.charAt(k)-anotherString.charAt(k)
* </pre></blockquote>
* If there is no index position at which they differ, then the shorter
* string lexicographically precedes the longer string. In this case,
* {@code compareTo} returns the difference of the lengths of the
* strings -- that is, the value:
* <blockquote><pre>
* this.length()-anotherString.length()
* </pre></blockquote>
*
* @param anotherString the {@code String} to be compared.
* @return the value {@code 0} if the argument string is equal to
* this string; a value less than {@code 0} if this string
* is lexicographically less than the string argument; and a
* value greater than {@code 0} if this string is
* lexicographically greater than the string argument.
*/
public int compareTo(String anotherString) {
int len1 = value.length;
int len2 = anotherString.value.length;
int lim = Math.min(len1, len2);
char v1[] = value;
char v2[] = anotherString.value;
int k = 0;
while (k < lim) {
char c1 = v1[k];
char c2 = v2[k];
if (c1 != c2) {
return c1 - c2;
}
k++;
}
return len1 - len2;
}
我们知道,当调用Arrays.sort或者Collections.sort方法的时候,实际上就是使用的compareTo(T o)来进行排序。实现了可比较性。
但是为什么String需要同时实现两个比较方法呢?
3.3 String的可比较性总结
String同时提供了两种比较方法:
public int compareToIgnoreCase(String str) {
return CASE_INSENSITIVE_ORDER.compare(this, str);
}
和实现接口的默认方法:
public int compareTo(String anotherString)
对于Comparable接口,一个类只能实现一次,但是如果要提供多种比较方法,那么就需要使用Comparator实现。Comparator是一种无侵入的外部类实现。如果我们有多个排序规则,则可以自行定义多个Comparator。在String类中,通过内部类的方式提供了大多数情况下都可能会用到的忽略大小写的比较方法,之后通过compareToIgnoreCase提供给外部调用。
4.hashcode和equals
hashcode和equals这是一对非常重要的方法。关系到能否在HashMap中作为key。
4.1 hashcode方法
/**
* Returns a hash code for this string. The hash code for a
* {@code String} object is computed as
* <blockquote><pre>
* s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
* </pre></blockquote>
* using {@code int} arithmetic, where {@code s[i]} is the
* <i>i</i>th character of the string, {@code n} is the length of
* the string, and {@code ^} indicates exponentiation.
* (The hash value of the empty string is zero.)
*
* @return a hash code value for this object.
*/
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
String采用重写hashcode方法。以31为权,之后每一位字符串的ascII码进行运算,用自然溢出来取模。其计算公式如下:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
在Effective Java中说过,采用奇素数来进行,如果乘数是偶数,并且乘法溢出的话,信息就会丢失,因为与2相乘等价于移位运算。使用素数的好处并不是很明显,但是习惯上都使用素数来计算散列结果。
另外 31有个很好的特性,就是用移位和减法来代替乘法,可以得到更好的性能:
31*i==(i<<5)-i
现在的JVM可以自动完成这种优化。
4.2 equals方法
/**
* Compares this string to the specified object. The result is {@code
* true} if and only if the argument is not {@code null} and is a {@code
* String} object that represents the same sequence of characters as this
* object.
*
* @param anObject
* The object to compare this {@code String} against
*
* @return {@code true} if the given object represents a {@code String}
* equivalent to this string, {@code false} otherwise
*
* @see #compareTo(String)
* @see #equalsIgnoreCase(String)
*/
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = value.length;
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
其源码如上述。可以发现该方法首先比较两个对象时,首先判断地址是否相等,如果地址相等则直接返回。如果不同,则看看需要对比的object是否instanceof String,之后转换为String,首先比较长度,之后挨个比较字符串内容。如都相同则返回true。反之则返回false。
5.intern方法
intern是String中的一个特殊方法,我们可以看源码中,只有这个方法是native的实现方式。
/**
* Returns a canonical representation for the string object.
* <p>
* A pool of strings, initially empty, is maintained privately by the
* class {@code String}.
* <p>
* When the intern method is invoked, if the pool already contains a
* string equal to this {@code String} object as determined by
* the {@link #equals(Object)} method, then the string from the pool is
* returned. Otherwise, this {@code String} object is added to the
* pool and a reference to this {@code String} object is returned.
* <p>
* It follows that for any two strings {@code s} and {@code t},
* {@code s.intern() == t.intern()} is {@code true}
* if and only if {@code s.equals(t)} is {@code true}.
* <p>
* All literal strings and string-valued constant expressions are
* interned. String literals are defined in section 3.10.5 of the
* <cite>The Java™ Language Specification</cite>.
*
* @return a string that has the same contents as this string, but is
* guaranteed to be from a pool of unique strings.
*/
public native String intern();
从注释中可以看出,这个方法的作用就是,如果常量池中存在此字符串,则返回常量池中字符串的引用,如果没有该字符串,则将该值加入StringTable,之后再返回。关于StringTable常量池我们将再后续介绍。