ES6新增API：String篇（二）

這篇文章介紹ES6中針對unicode字串的幾個新增函式。

String.prototype.codePointAt
函式型別：
```
(index?: number)=> number|undefined
```
codePointAt是一個原型函式，它根據傳入的index引數，返回字串中位於該處的字元的碼點（code point）值。這個方法可以識別utf-16中的4位元組碼點，支援的範圍比原型函式charCodeAt更廣，charCodeAt只能識別2位元組的基本平面字元（BMP）。另外，當index越界時，codePointAt返回undefined，charCodeAt返回NaN。
除了這兩點之外，codePointAt和charCodeAt的結果基本一致:
- index的引數預設值都為0
- 當字元處於基本平面字符集的時候，二者返回的結果是一樣的。
```
const str = 'abc'; //字元 'a' 位於基本平面字符集中
console.log(str.codePointAt(0));//97
//index預設值為0
console.log(str.codePointAt());//97
//index越界時，返回undefined
console.log(str.codePointAt(5));//undefined
console.log(str.charCodeAt(0));//97
//index預設值為0
console.log(str.charCodeAt());//97
//index越界時，返回NaN
console.log(str.charCodeAt(5));//NaN
```

當字元處於輔助平面字符集的時候，codePointAt能夠正確識別，並返回對應字元的碼點（code point）。charCodeAt不能正確識別，只能返回當前位置的2位元組字元的碼點。

比如，對於輔助平面的高音字元?，它由兩個2位元組的基本平面字元 0xd834 和 0xdd1e 表示。當我們對?

使用charCodeAt時，只能得到對應位置的碼點。

const str = '\ud834\udd1e'; //輔助平面字元 高音字元 ?
console.log(str.charCodeAt(0).toString(16)); //d834 
console.log(str.charCodeAt(1).toString(16)); //dd1e

當我們使用codePointAt時，可以得到?的碼點0x1d11e。

console.log(str.codePointAt(0).toString(16)); //1d11e
//當index為1時，'\udd1e'後面沒有另一個程式碼單元，被認為只是一個2位元組的字元，而非是一對程式碼單元，所以此時只返回'\udd1e'的碼點，而非'\ud834\udd1e'的碼點
console.log(str.codePointAt(1).toString(16)); //dd1e

String.fromCodePoint

函式型別：

(...codePoints: number[])=> string

靜態函式fromCodePoint是根據傳入的unicode碼點返回對應的字串，和fromCharCode相比，它支援直接傳入輔助平面的碼點值了。還是以高音符號?為例，使用fromCodePoint可以直接傳入碼點值0x1d11e，而fromCharCode值需要傳入0xd834 和 0xdd1e。

console.log(String.fromCodePoint(0x1d11e)); //?
console.log(String.fromCodePoint(0xd834, 0xdd1e)); //?
console.log(String.fromCharCode(0x1d11e)); //턞 不能正確識別，亂碼
console.log(String.fromCharCode(0xd834, 0xdd1e)); //?

對於基本平面的字元，fromCodePoint和fromCharCode結果是一樣的。

console.log(String.fromCodePoint(97)); //'a'
console.log(String.fromCodePoint(97, 98)); //'ab'
console.log(String.fromCodePoint()); //''
console.log(String.fromCharCode(97)); //'a'
console.log(String.fromCharCode(97, 98)); //'ab'
console.log(String.fromCharCode()); //''

String.prototype.normalize
函式型別：
```
(form:'NFC'|'NFD'|'NFKC'|'NFKD')=>string
```
原型函式normalize接受一個指定正規化（如果你不明白NFC、NFD等的意義，點一下）形式的引數form，form預設值為'NFC'（Normalization Form Canonical Composition，以標準等價方式來分解，然後以標準等價重組），返回正規化的字串。
unicode對合成符號（字元中的字母帶有聲調等附加符號）提供了兩種表示方式，一種是使用一個unicode碼點表示，一種是將合成字元中的字母與附加符號組合，使用兩個碼點，比如ń是一個合成符號，我們既可以使用一個碼點0x0144表示，也可以使用兩個碼點0x006e和0x0301表示。
```
const str1 = '\u0144'; //ń
const str2 = '\u006e\u0301'; //ń
console.log({
    str1,
    str2,
});//{ str1: 'ń', str2: 'ń' }
```
這兩種表示方式，是視覺和語義上都是相同的，它們是標準等價的。但是，在程式碼層面它們卻是不同的，str1是一個碼點，str2是兩個碼點，這很有可能會導致問題。
```
console.log(str1.length, str2.length);//1 2
console.log(str1 === str2);//false
```
normalize函式便是為了解決這種問題，兩個字串通過normalize函式實現正規化之後，就不會再出現這種問題了。
```
let str1 = '\u0144'; //ń
let str2 = '\u006e\u0301'; //ń
//正規化
str1 = str1.normalize();
str2 = str2.normalize();
console.log({
    str1,
    str2,
}); //{ str1: 'ń', str2: 'ń' }

console.log(str1.length, str2.length); //1 1
console.log(str1 === str2); //true
```
新增的unicode表示方法
在之前，我們表示unicode字元可以通過\u+碼點的方式，ES6新增了一種表示方式，即\u+{ 碼點 }。
這兩種方式的不同之處，也很容易想到，\u+{ 碼點 }支援寫入輔助平面的4位元組碼點，而\u+碼點僅支援基本平面的2位元組碼點。
```
//對於基本平面的2位元組碼點，兩種沒有區別
const str1 = '\u{0144}';
const str2 = '\u0144';
console.log(str1 === str2); //true
//高音符號
const str3 = '\u{1d11e}';
//錯誤的表示方法，被識別為了 \u1d11 和 e 兩個字元
const str4 = '\u1d11e';
console.log(str4,str3===str4); //ᴑe false
```
unicode確實讓人頭痛，如果有朋友對於unicode不太瞭解，可以在評論區留言，我會再發一篇詳細介紹unicode與JS的文章。

ES6新增API：String篇（二）

相關文章