String內部儲存方式與Unicode

永不放棄EX發表於2017-04-28

原文網址 : https://blog.csdn.net/u013564276/article/details/70896912

本文分析String類，從原始碼的角度出發分析了Java中String的內部儲存方式

String類中的私有域

String類中，字串是以 char[]的形式被儲存

/** The value is used for character storage. */
private final char value[];
/** Cache the hash code for the string */
private int hash; // Default to 0

String類的構造方法

要了解value這個char[]中到底儲存了什麼，需要找一個String類中具有代表性的構造方法。下面是String類的一個構造方法。其中最重要的引數就是int[] codePoints，另外兩個引數只是指定了擷取的位置和長度。

    public String(int[] codePoints, int offset, int count) {
        if (offset < 0) {
            throw new StringIndexOutOfBoundsException(offset);
        }
        if (count <= 0) {
            if (count < 0) {
                throw new StringIndexOutOfBoundsException(count);
            }
            if (offset <= codePoints.length) {
                this.value = "".value;
                return;
            }
        }
        // Note: offset or count might be near -1>>>1.
        if (offset > codePoints.length - count) {
            throw new StringIndexOutOfBoundsException(offset + count);
        }

        final int end = offset + count;

        // Pass 1: Compute precise size of char[]
        int n = count;
        for (int i = offset; i < end; i++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                continue;
            else if (Character.isValidCodePoint(c))
                n++;
            else throw new IllegalArgumentException(Integer.toString(c));
        }

        // Pass 2: Allocate and fill in char[]
        final char[] v = new char[n];

        for (int i = offset, j = 0; i < end; i++, j++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                v[j] = (char)c;
            else
                Character.toSurrogates(c, v, j++);
        }

        this.value = v;
    }

引數CodePionts是什麼

要理解上述構造方法，首先要知道Code Piont是什麼。Code Point就是一個完整的Unicode字元。由於不是所有的Code Point都能用16bit（java中char是16bit）表示，所以CodePoints引數為int陣列，並且在建構函式中需要轉換才能存入char value[] 中。

Unicode：
（統一碼、萬國碼、單一碼）是電腦科學領域裡的一項業界標準,包括字符集、編碼方案等。Unicode 是為了解決傳統的字元編碼方案的侷限而產生的，它為每種語言中的每個字元設定了統一併且唯一的二進位制編碼，以滿足跨語言、跨平臺進行文字轉換、處理的要求。1990年開始研發，1994年正式公佈（維基百科）。
Unicode簡單地說是一個能全球通用的字元編碼，它為每個字元指定了一個唯一的編號。採用U+後面接一組十六進位制數來表示

BMP（Basic Multilingual Plane，基本多文種平面）：
只需要知道BMP代表了一個字元範圍，在BMP範圍內的字元，可以用4位十六進位制數表示（16bit），而在BMP以外的字元，需要不止4位十六進位制數表示。

CodePoints到char[]轉換過程：
主要轉換過程如下，最重要是判斷 codePoints[i] 是否為BMP範圍內的編碼，如果是則可以用char表示，否則需要用兩個char來表示 Character.toSurrogates(c, v, j++)。

  // Pass 2: Allocate and fill in char[]
        final char[] v = new char[n];

        for (int i = offset, j = 0; i < end; i++, j++) {
            int c = codePoints[i];
            if (Character.isBmpCodePoint(c))
                v[j] = (char)c;
            else
                Character.toSurrogates(c, v, j++);
        }

        this.value = v;
 /*********************************************************/
    Character.toSurrogates(c, v, j++);    
    static void toSurrogates(int codePoint, char[] dst, int index) {
        // We write elements "backwards" to guarantee all-or-nothing
        dst[index+1] = lowSurrogate(codePoint);
        dst[index] = highSurrogate(codePoint);
    }
 /*********************************************************/

Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D75 in Section 3.8, Surrogates.)

High-Surrogate Code Unit. A 16-bit code unit in the range D80016 to DBFF16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. (See definition D72 in Section 3.8, Surrogates.)

Low-Surrogate Code Unit. A 16-bit code unit in the range DC0016 to DFFF16, used in UTF-16 as the trailing code unit of a surrogate pair. Also known as a trailing surrogate. (See definition D74 in Section 3.8, Surrogates.)

總結

String內部採用char陣列形式儲存Unicode字串，由於char是16位，也可以說是UTF-16編碼。但並不是一個char儲存一個字元，當字元在BMP範圍以外時，會用兩個char儲存一個字元。

MongoDB 儲存引擎與內部原理
2022-12-05
MongoDB儲存引擎
Git儲存內容的位置與方式
2018-07-22
Git
Android-內部儲存和外部儲存
2018-09-04
Android
Aspose.Slides.NET 19.2 解析ppt內容儲存svg 儲存ppt內部圖片
2024-08-03
IDESVG
Java HashMap原理及內部儲存結構
2019-01-18
JavaHashMap
資料庫內部儲存結構探索
2019-03-03
資料庫
HDU-安卓程式開發之簡單儲存/內部儲存/外部儲存 & 捉蟲
2020-12-04
安卓
Python convert string to unicode number
2024-11-07
PythonUnicode
【JDK原始碼分析】String的儲存區與不可變性
2020-03-08
JDK原始碼
block底層儲存方式
2018-04-14
BloC
Spark(16) -- 資料讀取與儲存的主要方式
2020-10-05
Spark
儲存過程與儲存函式
2020-11-16
儲存過程儲存函式
日期的正確儲存方式
2019-06-26
內部類與靜態內部類
2018-12-11
C++ string互轉wstring/Unicode互轉ANSI/Unicode互轉UTF8
2018-07-16
C++Unicode
JavaScript本地儲存的方式有哪些
2024-04-19
JavaScript
Java常見的本地儲存方式
2024-06-23
Java
雲原生儲存詳解：容器儲存與 K8s 儲存卷
2020-06-23
K8S
Redis 儲存物件資訊是用 Hash 還是 String
2022-09-13
Redis物件
Unicode 與 UTF
2019-03-30
Unicode
內部專家親自揭秘!滴滴物件儲存系統的演進之路
2018-11-20
物件
利用泛型模擬棧結構實現內部鏈式儲存結構
2020-11-01
泛型
Spring整合Quartz案例使用RAM儲存方式
2018-08-02
Springquartz
Spring整合Quartz案例使用JDBC儲存方式
2018-08-02
SpringquartzJDBC
人類儲存方式的變革史
2019-09-19
PHP session 儲存方式 file 改為 Redis
2019-12-21
PHPSessionRedis
MySQL:Innodb中數字的儲存方式
2020-07-28
MySql
PHP 自定義session儲存 FILE 方式類
2021-09-09
PHPSession
Julia 內建的數學常量與 Unicode 字元輸入
2018-08-14
Unicode字元
TiDB 技術內幕 - 說儲存
2018-03-08
TiDB
如何理解 python UnicodeEncodeError 和 UnicodeDecodeError ：python 的 string 和 unicode
2019-02-27
PythonUnicodeError
儲存新圖譜：DNA儲存的邊界與天地
2022-02-22
iOS開發資料儲存篇—iOS中的幾種資料儲存方式
2018-11-29
iOS
docker儲存配置與管理
2024-04-26
Docker
細述Kubernetes和Docker容器的儲存方式
2019-04-13
Docker
Laravel 自定義配置資訊的儲存方式
2020-05-19
Laravel
單細胞資料儲存方式彙總
2024-09-19
變數的儲存方式和生存期
2024-11-14
變數
SAS 數值儲存方式和精度問題
2022-03-18

String內部儲存方式與Unicode

String類中的私有域

String類的構造方法

引數CodePionts是什麼

總結

相關文章