mysql实现原理 - 字符集和排序规则
介绍
字符集
在计算机中字符是以二进制的格式存储的,每个字符在字符编码系统中都有一个唯一的二进制代码,这个过程确保了字符在存储和传输时不会丢失或混淆。将一个字符映射成一个二进制数据的过程就称为字符编码,而将一个二进制数据映射到一个字符的过程就称为字符解码。而字符集就是一套字符和二进制数据映射关系的集合。
一些重要的字符集
ASCII 字符集
共收录128个字符,包括空格、标点符号、数字、大小写字母和一些不可见字符。由于总共才128个字符,所以可以使用1个字节来进行编码。

ISO 8859-1 字符集
共收录256个字符,是在 ASCII 字符集的基础上又扩充了128个西欧常用字符(包括德法两国的字母),也可以使用1个字节来进行编码。这个字符集也有一个别名 latin1 。
GB2312 字符集
收录了汉字以及拉丁字母、希腊字母、日文平假名及片假名字母、俄语西里尔字母。其中收录汉字6763个,其他文字符号682个。
GBK 字符集
GBK 字符集只是在收录字符范围上对 GB2312 字符集作了扩充,编码方式上兼容 GB2312 。
utf8 字符集
收录地球上能想到的所有字符,而且还在不断扩充。这种字符集兼容 ASCII 字符集,采用变长编码方式,编码一个字符需要使用1~4个字节。
实准确的说,utf8只是Unicode字符集的一种编码方案,Unicode字符集可以采用utf8、utf16、utf32这几种编码方案,utf8使用1~4个字节编码一个字符,utf16使用2个或4个字节编码一个字符,utf32使用4个字节编码一个字符
比较规则
比较规则就是我们查询数据库中的数据,数据以什么样的顺序返回,不同的比较规则,返回的顺序也是区别很大的,可能直接影响到我们对数据的使用
比如:A的ascii码是65,B的ascii码是66,a的ascii码是97,如果以二进制比较规则,则是直接比较字符的ascii码,结果就是 A < B < a ,但是如果二进制比较基础上再加上一个不区分大小写,则又是另外一个顺序了A = a < B 。
所以了解一下数据库的字符集和比较规则对我们正确使用数据库是很重要的,接下来我们展开看看。
数据库中的字符集和比较规则
说明
- 演示使用的数据库版本为[8.0.13]
查看
查看字符集
show character set ;
或
show charset ;
| 字符集 | 描述 | 默认比较规则 | 最大长度(字节) | 
|---|---|---|---|
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci | 1 | 
| ascii | US ASCII | ascii_general_ci | 1 | 
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 | 
| binary | Binary pseudo charset | binary | 1 | 
| cp1250 | Windows Central European | cp1250_general_ci | 1 | 
| cp1251 | Windows Cyrillic | cp1251_general_ci | 1 | 
| cp1256 | Windows Arabic | cp1256_general_ci | 1 | 
| cp1257 | Windows Baltic | cp1257_general_ci | 1 | 
| cp850 | DOS West European | cp850_general_ci | 1 | 
| cp852 | DOS Central European | cp852_general_ci | 1 | 
| cp866 | DOS Russian | cp866_general_ci | 1 | 
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 | 
| dec8 | DEC West European | dec8_swedish_ci | 1 | 
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci | 3 | 
| euckr | EUC-KR Korean | euckr_korean_ci | 2 | 
| gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 | 
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 | 
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 | 
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci | 1 | 
| greek | ISO 8859-7 Greek | greek_general_ci | 1 | 
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 | 
| hp8 | HP West European | hp8_english_ci | 1 | 
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci | 1 | 
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 | 
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 | 
| latin1 | cp1252 West European | latin1_swedish_ci | 1 | 
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 | 
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 | 
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 | 
| macce | Mac Central European | macce_general_ci | 1 | 
| macroman | Mac West European | macroman_general_ci | 1 | 
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 | 
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 | 
| tis620 | TIS620 Thai | tis620_thai_ci | 1 | 
| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 | 
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 | 
| utf16 | UTF-16 Unicode | utf16_general_ci | 4 | 
| utf16le | UTF-16LE Unicode | utf16le_general_ci | 4 | 
| utf32 | UTF-32 Unicode | utf32_general_ci | 4 | 
| utf8 | UTF-8 Unicode | utf8_general_ci | 3 | 
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 | 
当前版本支持的字符集为41个。
utf8与utf8mb4
正常来说,utf8字符集表示一个字符需要1~4个字节,在 MySQL 5.5.3 及之前的版本中,utf8 字符集实际上是一个1~3的 UTF-8 变种。支持常规字符,并不包括所有 Unicode 字符,特别是那些位于扩展区(如表情符号、一些罕见汉字等)的字符。后来在MySQL 5.5.3 之后,utf8mb4 被引入以支持完整的四字节 UTF-8 编码。MySQL 官方也推荐在需要存储全 Unicode 字符集时使用 utf8mb4。考虑到兼容性,utf8字符也保留了下来。
查看比较规则
-- 查看utf8mb4支持的比较规则
show collation like "utf8mb4%";
| 比较规则 | 字符集 | ID | 
|---|---|---|
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | 
| utf8mb4_0900_as_ci | utf8mb4 | 305 | 
| utf8mb4_0900_as_cs | utf8mb4 | 278 | 
| utf8mb4_0900_bin | utf8mb4 | 309 | 
| utf8mb4_bin | utf8mb4 | 46 | 
| utf8mb4_croatian_ci | utf8mb4 | 245 | 
| utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | 
| utf8mb4_cs_0900_as_cs | utf8mb4 | 289 | 
| utf8mb4_czech_ci | utf8mb4 | 234 | 
| utf8mb4_danish_ci | utf8mb4 | 235 | 
| utf8mb4_da_0900_ai_ci | utf8mb4 | 267 | 
| utf8mb4_da_0900_as_cs | utf8mb4 | 290 | 
| utf8mb4_de_pb_0900_ai_ci | utf8mb4 | 256 | 
| utf8mb4_de_pb_0900_as_cs | utf8mb4 | 279 | 
| utf8mb4_eo_0900_ai_ci | utf8mb4 | 273 | 
| utf8mb4_eo_0900_as_cs | utf8mb4 | 296 | 
| utf8mb4_esperanto_ci | utf8mb4 | 241 | 
| utf8mb4_estonian_ci | utf8mb4 | 230 | 
| utf8mb4_es_0900_ai_ci | utf8mb4 | 263 | 
| utf8mb4_es_0900_as_cs | utf8mb4 | 286 | 
| utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 | 
| utf8mb4_es_trad_0900_as_cs | utf8mb4 | 293 | 
| utf8mb4_et_0900_ai_ci | utf8mb4 | 262 | 
| utf8mb4_et_0900_as_cs | utf8mb4 | 285 | 
| utf8mb4_general_ci | utf8mb4 | 45 | 
| utf8mb4_german2_ci | utf8mb4 | 244 | 
| utf8mb4_hr_0900_ai_ci | utf8mb4 | 275 | 
| utf8mb4_hr_0900_as_cs | utf8mb4 | 298 | 
| utf8mb4_hungarian_ci | utf8mb4 | 242 | 
| utf8mb4_hu_0900_ai_ci | utf8mb4 | 274 | 
| utf8mb4_hu_0900_as_cs | utf8mb4 | 297 | 
| utf8mb4_icelandic_ci | utf8mb4 | 225 | 
| utf8mb4_is_0900_ai_ci | utf8mb4 | 257 | 
| utf8mb4_is_0900_as_cs | utf8mb4 | 280 | 
| utf8mb4_ja_0900_as_cs | utf8mb4 | 303 | 
| utf8mb4_ja_0900_as_cs_ks | utf8mb4 | 304 | 
| utf8mb4_latvian_ci | utf8mb4 | 226 | 
| utf8mb4_la_0900_ai_ci | utf8mb4 | 271 | 
| utf8mb4_la_0900_as_cs | utf8mb4 | 294 | 
| utf8mb4_lithuanian_ci | utf8mb4 | 236 | 
| utf8mb4_lt_0900_ai_ci | utf8mb4 | 268 | 
| utf8mb4_lt_0900_as_cs | utf8mb4 | 291 | 
| utf8mb4_lv_0900_ai_ci | utf8mb4 | 258 | 
| utf8mb4_lv_0900_as_cs | utf8mb4 | 281 | 
| utf8mb4_persian_ci | utf8mb4 | 240 | 
| utf8mb4_pl_0900_ai_ci | utf8mb4 | 261 | 
| utf8mb4_pl_0900_as_cs | utf8mb4 | 284 | 
| utf8mb4_polish_ci | utf8mb4 | 229 | 
| utf8mb4_romanian_ci | utf8mb4 | 227 | 
| utf8mb4_roman_ci | utf8mb4 | 239 | 
| utf8mb4_ro_0900_ai_ci | utf8mb4 | 259 | 
| utf8mb4_ro_0900_as_cs | utf8mb4 | 282 | 
| utf8mb4_ru_0900_ai_ci | utf8mb4 | 306 | 
| utf8mb4_ru_0900_as_cs | utf8mb4 | 307 | 
| utf8mb4_sinhala_ci | utf8mb4 | 243 | 
| utf8mb4_sk_0900_ai_ci | utf8mb4 | 269 | 
| utf8mb4_sk_0900_as_cs | utf8mb4 | 292 | 
| utf8mb4_slovak_ci | utf8mb4 | 237 | 
| utf8mb4_slovenian_ci | utf8mb4 | 228 | 
| utf8mb4_sl_0900_ai_ci | utf8mb4 | 260 | 
| utf8mb4_sl_0900_as_cs | utf8mb4 | 283 | 
| utf8mb4_spanish2_ci | utf8mb4 | 238 | 
| utf8mb4_spanish_ci | utf8mb4 | 231 | 
| utf8mb4_sv_0900_ai_ci | utf8mb4 | 264 | 
| utf8mb4_sv_0900_as_cs | utf8mb4 | 287 | 
| utf8mb4_swedish_ci | utf8mb4 | 232 | 
| utf8mb4_tr_0900_ai_ci | utf8mb4 | 265 | 
| utf8mb4_tr_0900_as_cs | utf8mb4 | 288 | 
| utf8mb4_turkish_ci | utf8mb4 | 233 | 
| utf8mb4_unicode_520_ci | utf8mb4 | 246 | 
| utf8mb4_unicode_ci | utf8mb4 | 224 | 
| utf8mb4_vietnamese_ci | utf8mb4 | 247 | 
| utf8mb4_vi_0900_ai_ci | utf8mb4 | 277 | 
| utf8mb4_vi_0900_as_cs | utf8mb4 | 300 | 
这些比较规则的命名还挺有规律的:
- 比较规则名称以与其关联的字符集的名称开头
- 后边紧跟着该比较规则主要作用于哪种语言,比如 utf8_polish_ci 表示以波兰语的规则比较,
 utf8_spanish_ci 是以西班牙语的规则比较, utf8_general_ci 是一种通用的比较规则
- 名称后缀意味着该比较规则是否区分语言中的重音、大小写啥的,具体可以用的值如下:
| 后缀 | 英文释义 | 描述 | 
|---|---|---|
| _ai | accent insensitive | 不区分重音 | 
| _as | accent sensitive | 区分重音 | 
| _ci | case insensitive | 不区分大小写 | 
| _cs | case sensitive | 区分大小写 | 
| _bin | binary | 以二进制方式比较 | 
比如 utf8_general_ci 这个比较规则是以 ci 结尾的,说明不区分大小写
字符集和比较规则级别
MySQL 有4个级别的字符集和比较规则,分别是:
- 服务器级别
- 数据库级别
- 表级别
- 列级别
服务器级别
character_set_server 服务器级别的字符集
collation_server 服务器级别的比较规则
查看
mysql> SHOW VARIABLES LIKE 'character_set_server';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| character_set_server | utf8mb4 |
+----------------------+---------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE 'collation_server';
+------------------+--------------------+
| Variable_name    | Value              |
+------------------+--------------------+
| collation_server | utf8mb4_0900_ai_ci |
+------------------+--------------------+
1 row in set (0.02 sec)
可以看到这台mysql服务器默认的字符集为utf8mb4,默认的比较规则为utf8mb4_0900_ai_ci。
修改
- 通过配置文件修改
[server]
character_set_server=gbk
collation_server=gbk_chinese_ci
再次查看服务器的字符集和比较规则
mysql> SHOW VARIABLES LIKE 'character_set_server';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_server | gbk   |
+----------------------+-------+
1 row in set (0.01 sec)
mysql> SHOW VARIABLES LIKE 'collation_server';
+------------------+----------------+
| Variable_name    | Value          |
+------------------+----------------+
| collation_server | gbk_chinese_ci |
+------------------+----------------+
1 row in set (0.02 sec)
已经修改成我们的预期值了。
数据库级别
创建数据库时,可以指定字符集和比较规则,如果不指定,则继承服务器级别字符集和比较规则
语法:
-- 创建数据库
CREATE DATABASE 数据库名
 [[DEFAULT] CHARACTER SET 字符集名称]
 [[DEFAULT] COLLATE 比较规则名称];
-- 修改数据库
ALTER DATABASE 数据库名
 [[DEFAULT] CHARACTER SET 字符集名称]
 [[DEFAULT] COLLATE 比较规则名称];
示例:
- 创建数据库时,不指定字符集和比较规则
-- 不指定字符集和比较规则
mysql> create database db1;
Query OK, 1 row affected (0.01 sec)
mysql> use db1;
Database changed
mysql> SHOW VARIABLES LIKE 'character_set_database';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| character_set_database | gbk   |
+------------------------+-------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE 'collation_database';
+------------------+----------------+
| Variable_name    | Value          |
+------------------+----------------+
| collation_server | gbk_chinese_ci |
+------------------+----------------+
1 row in set (0.01 sec)
mysql> 
- 创建数据库时,指定字符集和比较规则
-- 指定字符集和比较规则
mysql> create database db2 character set utf8mb4 collate utf8mb4_general_ci;
Query OK, 1 row affected (0.01 sec)
mysql> use db2
Database changed
mysql> SHOW VARIABLES LIKE 'character_set_database';
+------------------------+---------+
| Variable_name          | Value   |
+------------------------+---------+
| character_set_database | utf8mb4 |
+------------------------+---------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE 'collation_database';
+--------------------+--------------------+
| Variable_name      | Value              |
+--------------------+--------------------+
| collation_database | utf8mb4_general_ci |
+--------------------+--------------------+
1 row in set (0.00 sec)
mysql> 
- 修改数据库字符集和比较规则
mysql> alter database db2 character set utf8mb3 collate utf8mb3_general_ci;
Query OK, 1 row affected, 2 warnings (0.01 sec)
mysql> use db2;
Database changed
mysql> SHOW VARIABLES LIKE 'character_set_database';
+------------------------+---------+
| Variable_name          | Value   |
+------------------------+---------+
| character_set_database | utf8mb3 |
+------------------------+---------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE 'collation_database';
+--------------------+--------------------+
| Variable_name      | Value              |
+--------------------+--------------------+
| collation_database | utf8mb3_general_ci |
+--------------------+--------------------+
1 row in set (0.00 sec)
- 如果字符集和比较规则不匹配,会报错
mysql> alter database db2 character set utf8mb3 collate utf8mb4_general_ci;
ERROR 1253 (42000): COLLATION 'utf8mb4_general_ci' is not valid for CHARACTER SET 'utf8mb3'
mysql> 
表级别
创建数据表时,可以指定字符集和比较规则,如果不指定,则继承数据库级别字符集和比较规则
语法:
CREATE TABLE 表名 (列的信息)
 [[DEFAULT] CHARACTER SET 字符集名称]
 [COLLATE 比较规则名称]]
ALTER TABLE 表名
 [[DEFAULT] CHARACTER SET 字符集名称]
 [COLLATE 比较规则名称]
示例
- 创建数据表时,不指定字符集和比较规则
mysql> create table t1 (col VARCHAR(10));
Query OK, 0 rows affected (0.02 sec)
mysql> SELECT 
    ->     TABLE_SCHEMA, 
    ->     TABLE_NAME, 
    ->     TABLE_COLLATION, 
    ->     CCSA.character_set_name AS CHARACTER_SET_NAME
    -> FROM 
    ->     information_schema.TABLES T,
    ->     information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    -> WHERE 
    ->     T.TABLE_SCHEMA = 'db1' 
    ->     AND T.TABLE_NAME = 't1'
    ->     AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+-----------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | CHARACTER_SET_NAME |
+--------------+------------+-----------------+--------------------+
| db1          | t1         | gbk_chinese_ci  | gbk                |
+--------------+------------+-----------------+--------------------+
1 row in set (0.00 sec)
- 创建数据表时,指定字符集和比较规则
mysql> create table t2 (col VARCHAR(10)) character set utf8mb4 collate utf8mb4_general_ci;
Query OK, 0 rows affected (0.02 sec)
mysql> SELECT 
    ->     TABLE_SCHEMA, 
    ->     TABLE_NAME, 
    ->     TABLE_COLLATION, 
    ->     CCSA.character_set_name AS CHARACTER_SET_NAME
    -> FROM 
    ->     information_schema.TABLES T,
    ->     information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    -> WHERE 
    ->     T.TABLE_SCHEMA = 'db1' 
    ->     AND T.TABLE_NAME = 't2'
    ->     AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t2         | utf8mb4_general_ci | utf8mb4            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.01 sec)
- 修改字符集和比较规则
mysql> alter table t2  character set utf8mb3 collate utf8mb3_general_ci;
Query OK, 0 rows affected, 2 warnings (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 2
mysql> SELECT 
    ->     TABLE_SCHEMA, 
    ->     TABLE_NAME, 
    ->     TABLE_COLLATION, 
    ->     CCSA.character_set_name AS CHARACTER_SET_NAME
    -> FROM 
    ->     information_schema.TABLES T,
    ->     information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    -> WHERE 
    ->     T.TABLE_SCHEMA = 'db1' 
    ->     AND T.TABLE_NAME = 't2'
    ->     AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t2         | utf8mb3_general_ci | utf8mb3            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.00 sec)
- 如果字符集和比较规则不匹配,会报错
mysql> alter table t2  character set utf8mb3 collate utf8mb4_general_ci;
ERROR 1253 (42000): COLLATION 'utf8mb4_general_ci' is not valid for CHARACTER SET 'utf8mb3'
列级别
对于存储字符串的列,
同一个表中的不同的列也可以有不同的字符集和比较规则。我们在创建和修改列定义的时候可以指定该列的字符集和比较规则。
语法:
CREATE TABLE 表名(
 列名 字符串类型 [CHARACTER SET 字符集名称] [COLLATE 比较规则名称],
 其他列...
);
ALTER TABLE 表名 MODIFY 列名 字符串类型 [CHARACTER SET 字符集名称] [COLLATE 比较规则名称];
示例:
- 同一个表中的不同的列指定不同的字符集和比较规则
mysql> create table t3 (
    -> c1 varchar(100) character set utf8mb3 collate utf8mb3_general_ci,
    -> c2 varchar(100) character set gbk collate gbk_chinese_ci,
    -> c3 varchar(100)
    -> ) character set utf8mb4 collate utf8mb4_general_ci;
Query OK, 0 rows affected, 2 warnings (0.02 sec)
mysql> show create table t3\G;
*************************** 1. row ***************************
       Table: t3
Create Table: CREATE TABLE `t3` (
  `c1` varchar(100) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  `c2` varchar(100) CHARACTER SET gbk COLLATE gbk_chinese_ci DEFAULT NULL,
  `c3` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
1 row in set (0.00 sec)
可以看到 c1和 c2 都是自定义字符集和比较规则,c3则是继承了表t3的字符集和比较规则
- 修改列的字符集和比较规则
-- 修改c2列的字符集和比较规则,使之继承表t3的字符集和比较规则
mysql> alter table t3 modify column `c2` varchar(100);
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> show create table t3\G;
*************************** 1. row ***************************
       Table: t3
Create Table: CREATE TABLE `t3` (
  `c1` varchar(100) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  `c2` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `c3` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
1 row in set (0.00 sec)
补充说明
- 在转换列的字符集时需要注意,如果转换前列中存储的数据不能用转换后的字符集进行表示会发生错误。比方说原先列使用的字符集是utf8,列中存储了一些汉字,现在把列的字符集转换为ascii的话就会出错,因为ascii字符集并不能表示汉字字符。
例如:
-- t3.c1的当前字符集为utf8mb3,比较规则为utf8mb3_general_ci
mysql> insert into t3(c1) values("列1");
Query OK, 1 row affected (0.00 sec)
-- 修改t3.c1列的字符集会报错
mysql> alter table db1.t3 modify column c1 varchar(100) CHARACTER SET `ascii` COLLATE ascii_general_ci;
ERROR 1366 (HY000): Incorrect string value: '\xE5\x88\x971' for column 'c1' at row 1
- 如果只修改字符集,则比较规则会自动修改为字符集支持的默认比较规则
-- 查看表t3的当前字符集和比较规则
mysql> SELECT 
    ->      TABLE_SCHEMA, 
    ->      TABLE_NAME, 
    ->      TABLE_COLLATION, 
    ->      CCSA.character_set_name AS CHARACTER_SET_NAME
    ->  FROM 
    ->      information_schema.TABLES T,
    ->      information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    ->  WHERE 
    ->      T.TABLE_SCHEMA = 'db1' 
    ->      AND T.TABLE_NAME = 't3'
    ->      AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t3         | utf8mb4_general_ci | utf8mb4            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.00 sec)
-- 修改表t3的字符集
mysql> alter table t3 CHARACTER SET utf8mb3;
Query OK, 0 rows affected, 1 warning (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 1
-- 查看表t3的修改后的字符集和比较规则
mysql> SELECT 
    ->      TABLE_SCHEMA, 
    ->      TABLE_NAME, 
    ->      TABLE_COLLATION, 
    ->      CCSA.character_set_name AS CHARACTER_SET_NAME
    ->  FROM 
    ->      information_schema.TABLES T,
    ->      information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    ->  WHERE 
    ->      T.TABLE_SCHEMA = 'db1' 
    ->      AND T.TABLE_NAME = 't3'
    ->      AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t3         | utf8mb3_general_ci | utf8mb3            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.00 sec)
比较规则确实自动随着字符集的修改而发生了修改
- 如果只修改比较规则,则字符集也会自动修改为跟比较规则对应的字符集
-- 查看表t3的当前字符集和比较规则
mysql> SELECT 
    ->      TABLE_SCHEMA, 
    ->      TABLE_NAME, 
    ->      TABLE_COLLATION, 
    ->      CCSA.character_set_name AS CHARACTER_SET_NAME
    ->  FROM 
    ->      information_schema.TABLES T,
    ->      information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    ->  WHERE 
    ->      T.TABLE_SCHEMA = 'db1' 
    ->      AND T.TABLE_NAME = 't3'
    ->      AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t3         | utf8mb3_general_ci | utf8mb3            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.01 sec)
-- 修改表t3的比较规则
mysql> alter table t3 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0
-- 查看表t3的修改后的字符集和比较规则
mysql> SELECT 
    ->      TABLE_SCHEMA, 
    ->      TABLE_NAME, 
    ->      TABLE_COLLATION, 
    ->      CCSA.character_set_name AS CHARACTER_SET_NAME
    ->  FROM 
    ->      information_schema.TABLES T,
    ->      information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    ->  WHERE 
    ->      T.TABLE_SCHEMA = 'db1' 
    ->      AND T.TABLE_NAME = 't3'
    ->      AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t3         | utf8mb4_general_ci | utf8mb4            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.01 sec)
- 我们假设一种级别的父子级关系服务器级别>数据库级别>表级别>列级别,父级的字符集和比较规则发生变化,不影响子级及以后级别的字符集和比较规则,或者说子级的字符集和比较规则在创建时就已经指定好了,不会受其他级别字符集和比较规则修改的影响
-- 查看服务器的字符集和比较规则
mysql> SHOW VARIABLES LIKE 'character_set_server';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_server | gbk   |
+----------------------+-------+
1 row in set (0.01 sec)
mysql> SHOW VARIABLES LIKE 'collation_server';
+------------------+----------------+
| Variable_name    | Value          |
+------------------+----------------+
| collation_server | gbk_chinese_ci |
+------------------+----------------+
1 row in set (0.00 sec)
-- 创建数据库,继承服务器级别的字符集和比较规则
mysql> create database db5;
Query OK, 1 row affected (0.01 sec)
-- 查看数据库的字符集和比较规则
mysql> use db5;
Database changed
mysql> SHOW VARIABLES LIKE 'character_set_database';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| character_set_database | gbk   |
+------------------------+-------+
1 row in set (0.01 sec)
mysql> SHOW VARIABLES LIKE 'collation_database';
+--------------------+----------------+
| Variable_name      | Value          |
+--------------------+----------------+
| collation_database | gbk_chinese_ci |
+--------------------+----------------+
1 row in set (0.01 sec)
------------------------
-- 修改服务器字符集和比较规则配置 并重启服务器
[server]
character_set_server=utf8mb4
collation_server=utf8mb4_general_ci
------------------------
-- 查看服务器的字符集和比较规则
mysql> SHOW VARIABLES LIKE 'character_set_server';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| character_set_server | utf8mb4 |
+----------------------+---------+
1 row in set (0.01 sec)
mysql> SHOW VARIABLES LIKE 'collation_server';
+------------------+--------------------+
| Variable_name    | Value              |
+------------------+--------------------+
| collation_server | utf8mb4_general_ci |
+------------------+--------------------+
1 row in set (0.00 sec)
-- 查看数据库的字符集和比较规则
mysql> use db5;
Database changed
mysql> SHOW VARIABLES LIKE 'character_set_database';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| character_set_database | gbk   |
+------------------------+-------+
1 row in set (0.01 sec)
mysql> SHOW VARIABLES LIKE 'collation_database';
+--------------------+----------------+
| Variable_name      | Value          |
+--------------------+----------------+
| collation_database | gbk_chinese_ci |
+--------------------+----------------+
1 row in set (0.00 sec)
字符集的应用
上边我们介绍的是在数据存储方面的编码和解码,接下来我们介绍在数据传输方便的编码和解码。
准备
- mysql服务器启动配置设置
[server]
character_set_server=utf8mb4
- 常用的mysql客户端连接到mysql服务器后,都会设置客户端字符集,以便数据可以正确传输,但是这样也给我们查看底层的系统变量造成了干扰,为了避免这个情况,这里采用php连接mysql服务器
关键系统变量
客户端连接mysql服务器,查询获取返回数据,会有多次的字符集编码转换,这个过程中会用到以下三个系统变量:
- character_set_client: 服务器解码请求时使用的字符集
- character_set_connection: 服务器处理请求时会把请求字符串从 character_set_client转为character_set_connection
- character_set_results: 服务器向客户端返回数据时使用的字符集
查看这几个变量的默认值
<?php
// 创建连接
$conn = new mysqli("127.0.0.1", "root", "123456", "db1");
// 检测连接
if ($conn->connect_error) {
    die("连接失败: " . $conn->connect_error);
}
echo "连接成功\n";
$ret = $conn->execute_query("SHOW VARIABLES LIKE 'character_set_client';");
$res = $ret->fetch_all()[0];
echo $res[0]." : ".$res[1].PHP_EOL;
$ret = $conn->execute_query("SHOW VARIABLES LIKE 'character_set_connection';");
$res = $ret->fetch_all()[0];
echo $res[0]." : ".$res[1].PHP_EOL;
$ret = $conn->execute_query("SHOW VARIABLES LIKE 'character_set_results';");
$res = $ret->fetch_all()[0];
echo $res[0]." : ".$res[1].PHP_EOL;
?>
查询结果:
连接成功
character_set_client : utf8mb4
character_set_connection : utf8mb4
character_set_results : utf8mb4
可以看到,这几个变量的默认值继承了mysql服务器级别的字符集
客户端连接成功后,可以指定字符集来修改这三个系统变量的值
$conn->set_charset("gbk");
查询结果:
连接成功
character_set_client : gbk
character_set_connection : gbk
character_set_results : gbk
也可以通过客户端单个设置系统变量的值
这是设置是非持久的
$conn->set_charset("gbk");
$ret = $conn->execute_query("SET character_set_client = utf8;");
查询结果:
连接成功
character_set_client : utf8mb3
character_set_connection : gbk
character_set_results : gbk
编码和解码使用不同字符集的后果
正确的查询:
<?php
// 创建连接
$conn = new mysqli("127.0.0.1", "root", "123456", "db1");
// 检测连接
if ($conn->connect_error) {
    die("连接失败: " . $conn->connect_error);
}
echo "连接成功\n";
$ret = $conn->execute_query("select * from db1.t3 where c1 like'%列%'");
print_r($ret->fetch_all());
查询结果:
连接成功
Array
(
    [0] => Array
        (
            [0] => 列1
            [1] => 
            [2] => 
        )
)
- 客户端单独设置了character_set_client,之后:
// 创建连接
$conn = new mysqli("127.0.0.1", "root", "123456", "db1");
// 检测连接
if ($conn->connect_error) {
    die("连接失败: " . $conn->connect_error);
}
echo "连接成功\n";
// 设置 character_set_client
$ret = $conn->execute_query("SET character_set_client = gbk");
// 查询
$ret = $conn->execute_query("select * from db1.t3 where c1 like'%列%'");
print_r($ret->fetch_all());
?>
查询结果:
连接成功
Array
(
)
结果查询为空,说明了mysql服务端通过gbk解码请求中的%列%时,中文字符不能正确解析出来,导致查询不到正确的结果
- 客户端设置了跟服务端不兼容的字符集后
<?php
// 创建连接
$conn = new mysqli("127.0.0.1", "root", "123456", "db1");
// 检测连接
if ($conn->connect_error) {
    die("连接失败: " . $conn->connect_error);
}
echo "连接成功\n";
// 设置字符集为gbk
$conn->set_charset("gbk");
$ret = $conn->execute_query("select * from db1.t3 where c1 like'%列%'");
print_r($ret->fetch_all());
查询报错:
连接成功
PHP Fatal error:  Uncaught mysqli_sql_exception: Illegal mix of collations (utf8mb3_general_ci,IMPLICIT) and (gbk_chinese_ci,COERCIBLE) for operation 'like' in /Users/xxx/code/php/test/test.php:18
Stack trace:
#0 /Users/xxx/code/php/test/test.php(18): mysqli->execute_query('select * from d...')
#1 {main}
  thrown in /Users/xxx/code/php/test/test.php on line 18
还有很多种情况,这里就不一一列举,但是字符集不一致导致的编码和解码出错的情况,导致的后果还是很严重的,我们需要特别关注
参考下面这张图,我们能更清晰的感觉到一次request,底层需要做很多次的字符串编码解码和字符集的转换,有一个环节出问题,都可能会返回一个错误的结果。
 
 这样绕来绕去看着就很晕,为了简单起见,我们通常都把 character_set_client 、character_set_connection、character_set_results 这三个系统变量设置成和客户端使用的字符集一致的情况,这样减少了很多无谓的字符集转换
 为了方便设置,mysql提供了一条非常简单的语句:
SET NAMES 字符集名;
这样一次性就可以把这三条设置成统一的字符集
比较规则的应用
比较规则的作用通常体现比较字符串大小的表达式以及对某个字符串列进行排序中,所以有时候也称为排序规则,影响的是结果集的顺序
示例:
首先看一下db1.t3.c1列的排序规则:
CREATE TABLE `t3` (
  `c1` varchar(100) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  `c2` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `c3` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
c1列目前的排序规则是 utf8mb3_general_ci,不区分大小写
 插入几条数据:
mysql> insert into db1.t3(c1) values('A'),('B'),('a'),('b');
Query OK, 4 rows affected (0.04 sec)
Records: 4  Duplicates: 0  Warnings: 0
查询一下看看返回结果:
mysql> select * from db1.t3 order by c1 asc;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| 1    | NULL | NULL |
| A    | NULL | NULL |
| a    | NULL | NULL |
| B    | NULL | NULL |
| b    | NULL | NULL |
| 列1  | NULL | NULL |
+------+------+------+
6 rows in set (0.00 sec)
我们可以看到 第二行到第五行,是大小写交叉返回的,顺序是26个英文字母的先后顺序,跟我们预期的是一样的
接下来我们修改一下排序规则
utf8mb3_bin 排序规则是直接比较字符编码,是区分大小写的
mysql> alter table db1.t3 modify column c1 varchar(200) collate utf8mb3_bin;
Query OK, 0 rows affected, 1 warning (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 1
再次执行同样的查询语句:
mysql> select * from db1.t3 order by c1 asc;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| 1    | NULL | NULL |
| A    | NULL | NULL |
| B    | NULL | NULL |
| a    | NULL | NULL |
| b    | NULL | NULL |
| 列1  | NULL | NULL |
+------+------+------+
6 rows in set (0.00 sec)
第二行到第五行的返回排序果然发生了调整,跟我们预期的也是一直的。
总结
对字符集和排序规则的深入了解,有利于我们正确使用mysql,以及针对相关问题进行快速定位。
本篇时间和篇幅都有限,先整理到这里,后续进行了更深入的整理,会通过版本迭代式地补充到这里。
