ANSI(字符集) and Unicode(字符集) and UTF-8(編碼Unicode字符集)

為什麼要Unicode？

ANSI用一個位元組（8位）表示一個字元

但是很多世界上其他的語言無法表示，但是又想每個字元都平等對待，所以需要更長的位

Unicode用兩個位元組（16位）來表示一個國際字元，這樣世界上任何國家的任何字元都可以用兩個位元組統一表示

對C++而言如何表示Unicode？

char正好是8位，可以用來表示一個ANSI字元

wchar_t正好是16位，可以表示一個Unicode字元

Unicode統一碼？

Unicode：統一碼，包含世界上各個國家的全部語言文字，字元。1994年出第一版。標準內容是每一個字元的序號ID。

範圍：目前的Unicode字元分為17組編排，0x0000 至 0x10FFFF，每組稱為平面（Plane），而每平面擁有65536個碼位，共1,114,112個

從Unicode ID到記憶體表示?

Unicode僅僅規定了一個Unicode字元的序號，並沒有規定在記憶體中用幾個位元組，是否固定位元組還是變長位元組。所以，Unicode在記憶體中的實現有常見的兩種方式，變長編碼，定長編碼。

定長與變長UTF-8?

（1）變長編碼：多個位元組表示一個字元,UTF-8：char：string：,適合網路傳輸（壓縮原理：經常使用的字元用較少位元組，不經常使用的字元用較多位元組表示）

（2）定長編碼：兩個位元組表示一個字元,UTF-16：wchar_t ：wstring：

CString同時支援兩種不同的Unicode記憶體表示

CString使用TCHAR中間型別，由編譯選項來決定是char 還是wchar_t

_T("中間型別的字串");最終生成程式碼是wchar_t還是char取決於編譯選項是Unicode還是多位元組，所以程式碼中都應該使用中間型別字串_T("")

寬字元初始化L""，L是用來初始化一個wchar_t字串常量的。char 字串常量的初始化只需要""雙引號即可。

typedef ATL::CStringT< TCHAR, StrTraitMFC_DLL< TCHAR > > CString;

（1）編譯選項如果設定為Unicode：CString<-TCHAR<-WCHAR<-wchar_t<-wstring.c_str()

（2）編譯選項如果設定為多位元組：CString<-TCHAR<-char<-string.c_str()

UTF-8編碼在兩種編譯選項下CString與string之間的轉換：

（1）Unicode：此時CString使用wchar_t表示q

string轉CString：

CString().Format("%s",string.c_str());//wchar_t只儲存一個char的內容，此時多位元組的UTF-8中的每一個位元組（而不是每一個字元）就會被CString當做是一個UTF-16字元,從而產生亂碼

正確的做法是使用MultiByteToWideChar系統函式進行轉換：將多位元組的每一個字元（多位元組的字元）轉換為一個固定長（wchar_t儲存）的字元（UTF-16）

CString轉string

同樣的需要轉碼：WideCharToMultiByte

（2）多位元組：此時CString使用char表示，所以本質上（編碼與儲存方式）是相同的

CString cs(L"中國");

std::cout<<cs.GetBuffer()<<std::endl;//中國

std::string s(cs.GetBuffer());

std::cout<<s.c_str()<<std::endl;//中國

CString cs_from_str(s.c_str());

std::cout<<cs_from_str.GetBuffer()<<std::endl;

上面的程式碼在Unicode下面不能通過編譯：wchar_t不能和char不能進行隱式型別轉換

Windows轉碼函式：

////CString與string之間的轉換

//std::string UnicodeToANSI( const std::wstring& str )

//{

//

// char* pElementText;

// int iTextLen;

// // wide char to multi char

// iTextLen = WideCharToMultiByte( CP_ACP,

// 0,

// str.c_str(),

// -1,

// NULL,

// 0,

// NULL,

// NULL );

// pElementText = new char[iTextLen + 1];

// memset( ( void* )pElementText, 0, sizeof( char ) * ( iTextLen + 1 ) );

// ::WideCharToMultiByte( CP_ACP,

// 0,

// str.c_str(),

// -1,

// pElementText,

// iTextLen,

// NULL,

// NULL );

// std::string strText(pElementText);

// delete[] pElementText;

// return strText;

//}

//std::wstring ANSIToUnicode( const std::string& str )

//{

// int len = 0;

// len = str.length();

// int unicodeLen = ::MultiByteToWideChar( CP_ACP,

// 0,

// str.c_str(),

// -1,

// NULL,

// 0 );

// wchar_t * pUnicode;

// pUnicode = new wchar_t[unicodeLen+1];

// memset(pUnicode,0,(unicodeLen+1)*sizeof(wchar_t));

// ::MultiByteToWideChar( CP_ACP,

// 0,

// str.c_str(),

// -1,

// (LPWSTR)pUnicode,

// unicodeLen );

// std::wstring rt(( wchar_t* )pUnicode);

// delete pUnicode;

//

// return rt;

//}

ANSI(字符集) and Unicode(字符集) and UTF-8(編碼Unicode字符集)

為什麼要Unicode？

對C++而言如何表示Unicode？

Unicode統一碼？

從Unicode ID到記憶體表示?

定長與變長UTF-8?

CString同時支援兩種不同的Unicode記憶體表示

UTF-8編碼在兩種編譯選項下CString與string之間的轉換：

Windows轉碼函式：

相關文章