String內部儲存方式與Unicode

永不放棄EX發表於2017-04-28

本文分析String類,從原始碼的角度出發分析了Java中String的內部儲存方式

String類中的私有域

String類中,字串是以 char[]的形式被儲存

/** The value is used for character storage. */
private final char value[];
/** Cache the hash code for the string */
private int hash; // Default to 0

String類的構造方法

要了解value這個char[]中到底儲存了什麼,需要找一個String類中具有代表性的構造方法。下面是String類的一個構造方法。其中最重要的引數就是int[] codePoints,另外兩個引數只是指定了擷取的位置和長度。

    public String(int[] codePoints, int offset, int count) {
        if (offset < 0) {
            throw new StringIndexOutOfBoundsException(offset);
        }
        if (count <= 0) {
            if (count < 0) {
                throw new StringIndexOutOfBoundsException(count);
            }
            if (offset <= codePoints.length) {
                this.value = "".value;
                return;
            }
        }
        // Note: offset or count might be near -1>>>1.
        if (offset > codePoints.length - count) {
            throw new StringIndexOutOfBoundsException(offset + count);
        }

        final int end = offset + count;

        // Pass 1: Compute precise size of char[]
        int n = count;
        for (int i = offset; i < end; i++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                continue;
            else if (Character.isValidCodePoint(c))
                n++;
            else throw new IllegalArgumentException(Integer.toString(c));
        }

        // Pass 2: Allocate and fill in char[]
        final char[] v = new char[n];

        for (int i = offset, j = 0; i < end; i++, j++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                v[j] = (char)c;
            else
                Character.toSurrogates(c, v, j++);
        }

        this.value = v;
    }

引數CodePionts是什麼

要理解上述構造方法,首先要知道Code Piont是什麼。Code Point就是一個完整的Unicode字元。由於不是所有的Code Point都能用16bit(java中char是16bit)表示,所以CodePoints引數為int陣列,並且在建構函式中需要轉換才能存入char value[] 中。

Unicode:
(統一碼、萬國碼、單一碼)是電腦科學領域裡的一項業界標準,包括字符集、編碼方案等。Unicode 是為了解決傳統的字元編碼方案的侷限而產生的,它為每種語言中的每個字元設定了統一併且唯一的二進位制編碼,以滿足跨語言、跨平臺進行文字轉換、處理的要求。1990年開始研發,1994年正式公佈(維基百科)。
Unicode簡單地說是一個能全球通用的字元編碼,它為每個字元指定了一個唯一的編號。採用U+後面接一組十六進位制數來表示

BMP(Basic Multilingual Plane,基本多文種平面):
只需要知道BMP代表了一個字元範圍,在BMP範圍內的字元,可以用4位十六進位制數表示(16bit),而在BMP以外的字元,需要不止4位十六進位制數表示。

CodePoints到char[]轉換過程:
主要轉換過程如下,最重要是判斷 codePoints[i] 是否為BMP範圍內的編碼,如果是則可以用char表示,否則需要用兩個char來表示 Character.toSurrogates(c, v, j++)

  // Pass 2: Allocate and fill in char[]
        final char[] v = new char[n];

        for (int i = offset, j = 0; i < end; i++, j++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                v[j] = (char)c;
            else
                Character.toSurrogates(c, v, j++);
        }

        this.value = v;
 /*********************************************************/
    Character.toSurrogates(c, v, j++);    
    static void toSurrogates(int codePoint, char[] dst, int index) {
        // We write elements "backwards" to guarantee all-or-nothing
        dst[index+1] = lowSurrogate(codePoint);
        dst[index] = highSurrogate(codePoint);
    }
 /*********************************************************/

Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D75 in Section 3.8, Surrogates.)

High-Surrogate Code Unit. A 16-bit code unit in the range D80016 to DBFF16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. (See definition D72 in Section 3.8, Surrogates.)

Low-Surrogate Code Unit. A 16-bit code unit in the range DC0016 to DFFF16, used in UTF-16 as the trailing code unit of a surrogate pair. Also known as a trailing surrogate. (See definition D74 in Section 3.8, Surrogates.)

總結

String內部採用char陣列形式儲存Unicode字串,由於char是16位,也可以說是UTF-16編碼。但並不是一個char儲存一個字元,當字元在BMP範圍以外時,會用兩個char儲存一個字元。

相關文章