先交代下背景:我这边正在研究基于 Docker 的 MariaDB Galera Cluster 的自动化集群部署方案【以下简称 MGC,后续有时间可以考虑分享下这个方案】,已经完成所有调试,想在生产环境部署一个 MGC 集群作为从先灰度测试。
生产环境主 DB 版本为 MySQL5.5,新的 MGC 采用 Mariadb 最新 10.3.12 stable 版本,做好 MGC 集群,并导入一份从主 DBdump 出来的完整 SQL 之后,change master 开始创建主从,结果如下报错:
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: x.x.x.x
Master_User: rpl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.001411
Read_Master_Log_Pos: 360808406
Relay_Log_File: mysql-relay-bin.000001
Relay_Log_Pos: 4
Relay_Master_Log_File: mysql-bin.001411
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 360808406
Relay_Log_Space: 256
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1595
Last_IO_Error: Relay log write failure: could not queue event from master
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 15410
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: No
Gtid_IO_Pos:
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: conservative
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
Slave_Transactional_Groups: 0
1 row in set (0.00 sec)
查了下资料,都是是磁盘满导致的,但是我这边磁盘空间、权限都没问题,于是继续检查了下 MGC 节点日志如下:
2019-01-28 10:47:26 11 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.001411' at position 512281317, relay log './mysql-relay-bin.000003' position: 151473377
2019-01-28 10:47:26 10 [Note] Slave I/O thread: connected to master 'rpl@x.x.x.x:3306',replication started in log 'mysql-bin.001413' at position 412624945
2019-01-28 10:47:26 10 [Warning] Slave I/O: Notifying master by SET @master_binlog_checksum= @@global.binlog_checksum failed with error: Unknown system variable 'binlog_checksum', Internal MariaDB error code: 1193
2019-01-28 10:47:26 10 [ERROR] Slave I/O: Replication event checksum verification failed while reading from network, Internal MariaDB error code: 1743
2019-01-28 10:47:26 10 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
2019-01-28 10:47:26 10 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.001413', position 412624945
里面有一个关键信息:binlog_checksum failed,看来报错和这个有关系了。查了下资料,应该是 Mariadb 默认打开了 slave_sql_verify_checksum(MySQL 版本默认关闭),而主从版本又不一致,导致 checksum 失败。
解决办法也很简单,在 MGC 各节点的[msyqld]配置中加上:slave_sql_verify_checksum=0 ,关闭这个特性即可,具体有没有负面影响暂未深入研究。