CaryHsu - 學無止盡: SQL Server - Unicode字元儲存與處理方式

2012年4月21日星期六

SQL Server - Unicode字元儲存與處理方式

由於各國有各國的語言與文字呈現方式，所以在以往儲存文字時，都會以特定字元碼加以儲存，如繁體中文就使用大五碼(Big5)，而簡體中文則使用GB，但由於如果同一個系統中需要處理不同的語系或處理到一些罕用字時，常會有亂碼與處理上的問題，所以後來為了統一所有的字元碼，ISO/IEC 便製定(10646 國際編碼標準) 推出了 Unicode (UTF-8、UTF-16、UTF32)，簡稱為萬國碼。

在一般的網頁應用程式上，在字元編碼，前端的網頁中你必須需要指定編碼，否則會以系統的預設值而定，但是在SQL Server中如果需要儲存與處理Unicode字元時，必須符合以下兩個條件。

必須條件：
1. 資料欄位需要支援 Unicode，請使用 nchar,nvarchar,ntext 型態。
2. 查詢Unicode 字串常數需要 N 前置詞，請參考以下文件：

INF：SQL Server 中的 Unicode 字串常數需要 N 前置詞
http://support.microsoft.com/?id=239530

相信大家先前對於上述的處理方式都已有一定的了解，但我最近遇到一個滿特別的情況，使用者有輸入一個罕用字，而在資料庫中因為使用nvarchar的型態，所以儲存的部份一切正常，但是只要查詢條件中有使用到此罕用字時，就會有查詢筆數上的問題，後來經過確認，發現主要是此字是屬於香港增補字符集，所以在處理上必須指定特定的定序，說明如下，其中最令人高興的事，在 SQL Server 2012中特別針對增補字符集推出了一個以 SC 結尾的定序，差別就在於如果你使用一般的定序時，如果有使用到增補字符集，透用LEN的函數查詢時，一個字會變成長度2，而如果使用類似substring進行取字時就會有問題，而Supplementary Character (SC) Collations就是主要用來解決這個問題，所以還沒升級的，動作要快了哦!!

補充說明：
SQL Server provides data types such as nchar and nvarchar to store Unicode data.These data types encode text in a format called UTF-16.The Unicode Consortium allocates each character a unique codepoint, which is a value in the range 0x0000 to 0x10FFFF.The most frequently used characters have codepoint values that will fit into a 16-bit word in memory and on disk, but characters with codepoint values larger than 0xFFFF require two consecutive 16-bit words.These characters are called supplementary characters, and the two consecutive 16-bit words are called surrogate pairs.

If you use supplementary characters:

Supplementary characters can be used in ordering and comparison operations in collation versions 90 or greater.
All _100 level collations support linguistic sorting with supplementary characters.
Supplementary characters are not supported for use in metadata, such as in names of database objects.
Introduced in SQL Server 2012, a new family of supplementary character (SC) collations can be used with the data types nchar, nvarchar and sql_variantLatin1_General_100_CI_AS_SC, or if using a Japanese collation, Japanese_Bushu_Kakusu_100_CI_AS_SC.

The SC flag can be applied to:

Version 90 Windows collations
Version 100 Windows collations

The SC flag cannot be applied to:

Version 80 non-versioned Windows collations
The BIN or BIN2 binary collations
The SQL* collation

參考連結：
==========
香港增補字符集部首5劃
http://code.web.idv.hk/h2u/h2u_05.php
INF：在 SQL Server 上儲存 UTF-8 的資料
http://support.microsoft.com/?id=232580
Translate to and from UCS-2 or UTF-8 as appropriate within the application. Sample code for this type of conversion is located at the Unicode Consortium's site:
http://www.unicode.org/Public/PROGRAMS/CVTUTF
Collation (Transact-SQL)
http://msdn.microsoft.com/en-us/library/ff848763(v=sql.105).aspx
CAST 和 CONVERT (Transact-SQL)
http://technet.microsoft.com/zh-tw/library/ms187928.aspx
Collation and Unicode Support
http://msdn.microsoft.com/zh-tw/library/ms143726.aspx
nchar and nvarchar (Transact-SQL)
http://msdn.microsoft.com/en-us/library/ms186939.aspx
Surrogates and Supplementary Characters
http://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx

關鍵字：SQL Server 2012、Unicode、UTF-8、UTF-16、Supplementary Character、增補字符集、nchar、nvarchar

8 則留言:

Unknown2012年4月23日晚上9:05
請教一下 , 如果一開始資料庫沒有設定支援擴充字集的話 ,字還可以存的進去嗎 ?
回覆刪除
回覆
caryhsu2012年4月24日晚上8:45
只要你有使用nchar,nvarchar,ntext + 前置字N，一定是可以存進去的，是在查詢比對時才會有問題。
回覆刪除
回覆
匿名2012年6月15日上午11:47
由於我一個系統要從db2轉到sql server,我們遇到有關unicode設定的問題,因為曾經移植到,oracle,mysql, postgressql,在unicode設定只需要在create database時設成utf-8就可以,曾請教您同仁,他給我這個網址,但還無法解我的問題,我希望系統和db間是可以透通的,unicdoe的設定是在dbms系統level,或是在database level,而不是在column level,否則我們必需針對不同db,修改 sql statment,因對sql server 不熟是否方便提提供解法?
回覆刪除
回覆
shchen2013年6月21日下午5:49
我也是遇到此問題,使用SQL SERVER 2005,用n開頭的資料型態,能存簡體字,但查詢比對時,無法查詢找出簡體字的資料.
上述中一般定序與特殊定序是指什麼呢??能解決SQL SERVER 2005的問題嗎,或者一定要升級成SQL SERVER 2012+增補字集,才能解決??
回覆刪除
回覆
匿名2014年4月2日晚上9:19
借問一下Supplementary Character (SC) Collations
sql server 2012 express 可以安裝嗎?我有這方面的困擾.
又,這東西要去哪邊下載?麻煩告知,謝謝!
回覆刪除
回覆
匿名2014年4月3日下午1:37
我已經排除問題了，謝謝!
回覆刪除
回覆

新增留言

訂閱：張貼留言 (Atom)

2012年4月21日 星期六

SQL Server - Unicode字元儲存與處理方式

8 則留言:

2012年4月21日星期六