JAVA中文比較問題的分析和解決 (轉)[@more@]

中文比較問題的分析解決:namespace prefix = o ns = "urn:schemas--com::office" />

遊晶（yousp@yeah）

　Java的中文問題由來已久，前不久筆者需要做中的中文比較排序，對字串進行GBK或者GB2312編碼以後，使用String.compareTo方法仍然不能得到正確結果。因此，懷著懷疑的態度，對中String類的做了一翻探究。（作者使用JDK為1.3.1版本）

以下是String.java中compareTo的原始碼，請注意其中的註釋：

public class String

{

…

public int compareTo(String anotherString) {

int len1 = count;

int len2 = anotherString.count;

//n為兩個字串長度的最小者

int n = Math.min(len1, len2);

//獲取字元陣列

char v1[] = value;

char v2[] = anotherString.value;

//取偏依位置

/** The offset is the first index of the storage that is used. */

//offset 是第一個

int i = offset;

int j = anotherString.offset;

//如果i == j

//這裡可能是判斷取同一記憶體中兩個字串的情景。。。

// A

// B s1 |

// C

// D s2

// E |

// F |

// G

// 可能這種情況 i = j

if (i == j) {

int k = i;

int lim = n + i;

while (k < lim)

{

char c1 = v1[k];

char c2 = v2[k];

if (c1 != c2) file://直到找到一個不相等的字元，返回c1 - c2

return c1 - c2;

k++;

}

} else {

while (n-- != 0) file://直到兩個字串長度記數為0

{

char c1 = v1[i++]; file://分別取字元

char c2 = v2[j++];

if (c1 != c2) { //發現不相等，立即返回c1 - c2;

return c1 - c2;

}

return len1 - len2;

//最後這裡可能出現的情況是: 兩個字串比較完之後還沒有得到結果。相等的情況

}

…

}//end of class String

為什麼Java在做漢字的CompareTo時比較會有問題呢？透過對compareTo原始碼的分析發現，關鍵在於JDK的compareTo實現是直接使用Char來進行比較的：

char c1 = v1[k];

char c2 = v2[k];

可是當Java使用GB2312編碼時，一個對漢字所獲取到的Char值卻是不規則的，即一個漢字在Java中作為一個char來處理（雙位元組字元）時，將這樣的雙位元組字元進行強制轉換成int型別時，所得到的不是包含了漢字編碼順序的中文內碼。可以看一下一組測試資料可以看到其中奧妙：

字元

Char值

Byte[]值

按Byte[]合成的值

我

25105

[50:46]

[-5046]

愛

29233

[80:82]

[-8082]

北

21271

[79:79]

[-7979]

京

20140

[66:87]

[-6687]

天

22825

[52:20]

[-5220]

安

23433

[80:78]

[-8078]

門

38376

[61:59]

[-6159]

A

65

[-65]

[65]

B

66

[-66]

[66]

C

67

[-67]

[67]

D

68

[-68]

[68]

按照中文順序：“我”字應該在“愛”字後面，因此理論上來講"我"字的Char值應該比“愛"字的char值要大。可是不知道為什麼Java的漢字char(兩個byte)->int型別的轉換會發生很大偏差。而失去了漢字原本在ＧＢＫ規範當中，按內碼排列好的順序。但從一個漢字拆分成2個位元組的byte[]時，所得到的值並沒有打亂GBK編碼規定的順序，因此得到解決問題的思路：將String進行GB2312編碼後取得某個漢字獲取其Char值時，將漢字拆分成2個位元組byte[]再進行計算，從而得到正確的內碼。

因此我自己寫了下面這樣幾個，基本上解決了漢字比較的問題：

函式包括三個，你可以隨意放置到任何類當中作為輔助函式使用(Private Helper)。

n public int compare(String s1, String s2) ：主要工作是為比較做一些前期的編碼工作可以說是的一個外殼。

n public int chineseCompareTo(String s1, String s2)：該函式則是中文字串比較主體，其內部實現了比較的最基本邏輯，和JDK的compareTo所使用的邏輯是一樣的。介面也一樣。

n public static int getCharCode(String s)：該函式則負責將一個以字串形式存在的字元轉換成為int編碼，兒不損失其位置資訊。注意輸入通常是：“我”或者“A”，如果輸入更長的字串，則改函式獲得的是第一個字元的值。

private static String __ENCODE__ = "GBK"; file://一定要是GBK

private static String __SERVER_ENCODE__ = "GB2312"; file://上的預設編碼

/*

比較兩字串

*/

public int compare(String s1, String s2)

{

String m_s1 = null, m_s2 = null;

try

{

//先將兩字串編碼成GBK

m_s1 = new String ( s1.getBytes(__SERVER_ENCODE__), __ENCODE__);

m_s2 = new String ( s2.getBytes(__SERVER_ENCODE__), __ENCODE__);

}

catch( Exception ex)

{

return s1.compareTo(s2);

}

int res = chineseCompareTo(m_s1, m_s2);

System.out.println("比較：" + s1 + " | " + s2 + "==== Result: " + res);

return res;

}

//獲取一個漢字/字母的Char值

public static int getCharCode(String s)

{

if (s==null && s.equals(“”)) return -1; file://保護程式碼

byte [] b = s.getBytes();

int value = 0;

//保證取第一個字元（漢字或者英文）

for (int i = 0; i < b.length && i <= 2; i ++)

{

value = value * 100 + b[i];

}

return value;

}

//比較兩個字串

public int chineseCompareTo(String s1, String s2)

{

int len1 = s1.length();

int len2 = s2.length();

int n = Math.min(len1, len2);

for (int i = 0; i < n; i ++)

{

int s1_code = getCharCode(s1.charAt(i) + "");

int s2_code = getCharCode(s2.charAt(i) + "");

if (s1_code != s2_code) return s1_code - s2_code;

}

return len1 - len2;

}

可見，對系統原始碼的解剖，能讓我們在迷惑之餘同樣有機會窺探系統內部運作的奧妙。不過讓人非常費解的是，Java內部的某些類書寫風格非常不好，同時存在一些。不過這也許是筆者個人感受。偶有所獲，願與大家共同分享，其中疏漏之處望不吝賜教。

JAVA中文比較問題的分析和解決 (轉)

中文比較問題的分析解決:namespace prefix = o ns = "urn:schemas--com::office" />

相關文章