MySQL -- utf8 - utf8mb4
通过一个
emoji
表情引发的报错来认识下MySQL
中独一无二的utf8mb4
1.发现问题
java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\xAB' for column 'name' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3909)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2487)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1197)
at com.example.demo.tst.Three.main(Three.java:51)
碰到这种四个字节(\xF0\x9F\x98\xAB
)的报错,说明字符集有问题。
2. 分析问题
第一篇文章中,我们就说过mysql中的字符集转换是怎么玩的,那就一个一个来看下哪里的问题。
表结构:
mysql> show create table account;
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table
|
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| account | CREATE TABLE `account` (
`id` int(11) NOT NULL AUTO_INCREMENT COMMENT '自增id',
`name` varchar(100) CHARACTER SET utf8mb4 DEFAULT NULL,
`balance` int(11) DEFAULT NULL COMMENT '余额',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
- 先看库中字段的字符集
mysql> show full columns from account;
+---------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+----------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+---------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+----------+
| id | int(11) | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | 自增id |
| name | varchar(100) | utf8_general_ci | YES | | NULL | | select,insert,update,references | |
| balance | int(11) | NULL | YES | | NULL | | select,insert,update,references | 余额 |
+---------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+----------+
3 rows in set (0.01 sec)
上面报错的字段是name
,这里其比较规则为utf8_general_ci
,我们知道这是utf8
的比较规则,MySQL
中的utf8
字符集实际是utf8mb3
,最大只能放三个字节。
那就改呗,四个字节的utf8
-- utf8mb4
MySQL
版本5.1.13
以后开始支持utf8mb4
3.解决问题
- 检查列字符集:
mysql> alter table account modify name varchar(100) character set utf8mb4 collate utf8mb4_general_ci;
Query OK, 1 row affected (0.76 sec)
Records: 1 Duplicates: 0 Warnings: 0
- 检查
MySQL
服务端那三个字符集
mysql> show variables like 'character_set_%';
+--------------------------+---------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | E:\MySQL\mysql-5.7.26\share\charsets\ |
+--------------------------+---------------------------------------+
8 rows in set, 1 warning (0.00 sec)
啥?居然都是utf8
,没的说,改:
mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)
再查看下:
mysql> show variables like 'character_set_%';
+--------------------------+---------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | E:\MySQL\mysql-5.7.26\share\charsets\ |
+--------------------------+---------------------------------------+
8 rows in set, 1 warning (0.00 sec)
嗯,改回来了~
- 检查客户端发送字符集(这步有点坑~)
上面两步都能理解,这一步可能有点不明白为啥还要再来一个字符集?
其实,一个字符从出生(发送)到安全落库(落盘),经历的字符集可能有点多,但其第一个字符集就是这个所谓的客户端发送字符集。
虽然要经过这么多的字符集转换,但并不会有那么多的转换,因为实际情况下会将其中几个字符集都设为一致~
我们是通过Java
代码连接MySQL
,这里可以在Connector/J
中指定(省略其他配置):
jdbc:mysql://localhost:3306/asia?characterEncoding=utf8mb4&useUnicode=true
不过,你会发现这样启动会报错告诉你不支持这么写。。
这里先贴下官方说明:
Setting the Character Encoding
The character encoding between client and server is automatically detected upon connection (provided that the Connector/J connection properties
characterEncoding
andconnectionCollation
are not set). You specify the encoding on the server using the system variablecharacter_set_server
(for more information, see Server Character Set and Collation). The driver automatically uses the encoding specified by the server. For example, to use the 4-byte UTF-8 character set with Connector/J, configure the MySQL server withcharacter_set_server=utf8mb4
, and leavecharacterEncoding
andconnectionCollation
out of the Connector/J connection string. Connector/J will then autodetect the UTF-8 setting.
MySQL
告诉我们如若要使用4个字节的utf8
,那你需要做两件事:
-
MySQL
服务端配置character_set_server=utf8mb4
- 不要设置
characterEncoding
andconnectionCollation
属性
网上还有一种说法:
Connector/J did not support utf8mb4 for servers 5.5.2 and newer.
Connector/J now auto-detects servers configured with character_set_server=utf8mb4 or treats the Java encoding utf-8 passed usingcharacterEncoding=... as utf8mb4 in the SET NAMES= calls it makes when establishing the connection. (Bug #54175)
这个意思是Connector/J
并不支持utf8mb4
,不过有两种方式等价:
-
MySQL
服务器中配置加上character_set_server=utf8mb4
-
Connector/J
中加上characterEncoding=utf8
本文选择相信官方文档,那个Bug #54175
贴出来也是求同存异, 也可能是我理解错了[狗头]
我这里就直接改库中字符集了:
mysql> set character_set_server=utf8mb4;
Query OK, 0 rows affected (0.00 sec)
嗯,上面吹了这么多牛逼,结果发现还是不行:
java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\xAB' for column 'name' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3909)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2487)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1197)
at com.example.demo.tst.Three.main(Three.java:51)
原因很简单,需要重启,不过重启读的是配置文件中的配置,本次会话手动设置的字符集将全部失效。
所以建议写在MySQL
配置文件中 my.ini
(windows
系统)
综上的配置如下(省略其他配置):
[client]
# 服务端的三个字符集,效果等同 set NAMES utf8mb4
default-character-set=utf8mb4
[mysqld]
# 指定服务器系统字符集
character-set-server=utf8mb4
collation-server=utf8mb4_general_ci
这里的比较规则是
ci
(默认是这种比较规则),是不区分大小写的,如果需要可以改成utf8mb4_bin
再把服务重启一哈:
C:\Windows\system32>net stop mysql
MySQL 服务正在停止..
MySQL 服务已成功停止。
C:\Windows\system32>
C:\Windows\system32>net start mysql
MySQL 服务正在启动 ..
MySQL 服务已经启动成功。
代码再执行就执行成功了,查询看看:
mysql> select * from account;
+----+------+---------+
| id | name | balance |
+----+------+---------+
| 1 | 😫 | 10 |
+----+------+---------+
1 row in set (0.00 sec)
这个表情真丑。。
这里还可以玩一个东西,我们把服务端的三个字符集改成utf8
,再查询看看:
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)
mysql>
mysql> select * from account;
+----+------+---------+
| id | name | balance |
+----+------+---------+
| 1 | ? | 10 |
+----+------+---------+
1 row in set (0.00 sec)
还可以在这个时候来把列name
的字符集改回utf8
试试:
mysql> alter table account modify name varchar(100) character set utf8 collate utf8_general_ci;
ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\xAB' for column 'name' at row 2
挺好玩吧~~~
番外
- 本地开发环境测试的时候发现,当配置文件中加了这个配置
character-set-server=utf8mb4
,那么无论其他的字符集你怎么玩,连接怎么设置,你的emoji
表情都能正常入库,而生产环境不行。
推论:
character-set-server
这个字符集的作用可能不止于指定发送字符集为utf8mb4
, 甚至猜测不能指定发送的字符集
如果有大佬知道的,还望留言告知,万谢!
- 鉴于以上情况,测试环境
characterEncoding=utf8
此配置基本无用. - 对于生产环境不太好修改库设置的,可以在数据源中增加
initSql
,以druid
配置为例:
dataSource = new DruidDataSource();
// 省略其他配置
List<String> sqlList = new ArrayList<>();
sqlList.add("SET NAMES utf8mb4");
dataSource.setConnectionInitSqls(sqlList);
每次连接都会去设置字符集为uft8mb4
,生产环境可行。
- 最后捋一捋思路
- 库中字段字符集为
utf8mb4
- 连接的字符集要为
utf8mb4
, 就是连接过程中的那三个, 通过SET NAMES UTF8MB4
指定,这一步可以走库配置也可以走数据源设置,具体见上文 - 发送字符集要为
utf8mb4
,也就是这个配置character-set-server=utf8mb4
- 库中字段字符集为
按照这个思路,BMP
中的字符保管你全部都能存。
BMP
是啥? 我下篇文章就写写Unicode
吧,等我~