从关系型的数据库(Mysql)升级到列式管理的联机分析型数据库(ClickHouse),这不亚于是小米加步枪升级为加特林机关枪的性能提升了,查询能力等确实是大大的提升了,这出现了一个问题我们之前存储在Mysql里的历史数据怎么往ClickHouse里面迁移呢,总不能不要吧那对业务来说他们可是不关心你底层数据怎么存储的,他们只希望数据准确,齐全,那下面我们就来说说都有哪些方式可以实现补全缺失的历史数据
环境:ClickHouse:21.8.12.1,Python:3.6.8

1、Mysql Engine

ClickHouse本身支持通过选择Engine 来远程连接Mysql来访问Mysql的数据

1.1 语法

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster](name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],...) ENGINE = MySQL('host:port', 'database', 'table', 'user', 'password'[, replace_query, 'on_duplicate_clause'])SETTINGS[ connection_pool_size=16, ][ connection_max_tries=3, ][ connection_wait_timeout=5, ][ connection_auto_close=true, ][ connect_timeout=10, ][ read_write_timeout=300 ];

1.2 参数详解

host:port— MySQL 服务器地址。database— 远程数据库名称。table— 远程表名。user— MySQL 用户。password- 用户密码。replace_query— 将INSERT INTO查询转换为REPLACE INTO. 如果replace_query=1,则查询被替换。on_duplicate_clause—ON DUPLICATE KEY on_duplicate_clause添加到INSERT查询的表达式。示例:INSERT INTO t (c1,c2) VALUES ('a', 2) ON DUPLICATE KEY UPDATE c2 = c2 + 1,哪里on_duplicate_clause是UPDATE c2 = c2 + 1。请参阅MySQL 文档以查找on_duplicate_clause可以与该ON DUPLICATE KEY子句一起使用的内容。要指定on_duplicate_clause您需要传递0给replace_query参数。如果同时传递replace_query = 1和on_duplicate_clause,ClickHouse 会生成异常。WHERE诸如此类的简单子句=, !=, >, >=, <, <=在 MySQL 服务器上执行。

1.3 示列

1.3.1 创建Mysql Engine 表

CREATE TABLE db_adp.mysql_table(`float_nullable` Nullable(Float32),`int_id` Int32)ENGINE = MySQL('localhost:3306', 'db_adp', 'mysql_table', 'reader', '123456')

1.3.2 写入历史数据到结果表中

INSERT INTO db_adp.sink_table SELECT `float_nullable`,`int_id`,FROMdb_adp.mysql_table

1.4 补充

更多有关于Mysql Engine 可参考官网:MySQL | ClickHouse Docs

2、Mysql Function

Mysql Function 的功能跟Engine 选择 Mysql 不太一样,ClickHouse 把桥接的方式封装成了一个Function方式,不需要再单独建立一张外表就能像使用Sum(),Max() 等函数一样直接使用

2.1、语法

mysql('host:port', 'database', 'table', 'user', 'password'[, replace_query, 'on_duplicate_clause'])

2.2、 参数详解

host:port— MySQL 服务器地址。database— 远程数据库名称。table— 远程表名。user— MySQL 用户。password- 用户密码。replace_query— 将INSERT INTO查询转换为REPLACE INTO. 可能的值:0- 查询执行为INSERT INTO.1- 查询执行为REPLACE INTO.on_duplicate_clause—ON DUPLICATE KEY on_duplicate_clause添加到INSERT查询的表达式。只能指定 with replace_query = 0(如果同时传递replace_query = 1and on_duplicate_clause,ClickHouse 会产生异常)。例子:INSERT INTO t (c1,c2) VALUES ('a', 2) ON DUPLICATE KEY UPDATE c2 = c2 + 1;on_duplicate_clause这是UPDATE c2 = c2 + 1。请参阅 MySQL 文档以查找on_duplicate_clause可以与该ON DUPLICATE KEY子句一起使用的内容。当前在 MySQL 服务器上执行的简单WHERE子句,例如。=, !=, >, >=, <, <=其余条件和LIMIT采样约束只有在对 MySQL 的查询完成后才会在 ClickHouse 中执行。支持必须由 列出的多个副本|。例如:SELECT name FROM mysql(`mysql{1|2|3}:3306`, 'mysql_database', 'mysql_table', 'user', 'password');或者SELECT name FROM mysql(`mysql1:3306|mysql2:3306|mysql3:3306`, 'mysql_database', 'mysql_table', 'user', 'password');

2.3 示例

2.3.1 写入结果表中

INSERT INTO db_adp.sink_table SELECT `float_nullable`,`int_id`,FROMmysql('localhost:3306', 'db_adp', 'mysql_table', 'reader', '123456')

2.4、补充

更多有关于Mysql Function 可参考官网:mysql | ClickHouse Docs

3、Altinity/clickhouse-mysql-data-reader

该工具是Altinity 开源的一个可以通过增量/全量将Mysql 的数据导入到ClickHouse 中的实用python工具,下面我们来实践一下(python 版本至少3.4+)

3.1 PyPi安装

3.1.1 Mysql 存储库

sudo yum install -y https://dev.mysql.com/get/mysql57-community-release-el7-11.noarch.rpm
sudo yum install -y epel-release

3.1.2 ClickHouse-Client

curl -s https://packagecloud.io/install/repositories/altinity/clickhouse/script.rpm.sh | sudo bash
sudo yum install -y clickhouse-client

3.1.3 依赖包

sudo yum install -y mysql-community-devel
sudo yum install -y mariadb-devel
sudo yum install -y gcc
sudo yum install -y python34-devel python34-pip

3.1.4初始化Data Reader

sudo pip3 install clickhouse-mysql

3.1.5检验是否安装成功

[user@localhost ~]$ which clickhouse-mysql/usr/bin/clickhouse-mysql
/usr/bin/clickhouse-mysql --install

3.1.6 设置Mysql 账号权限

CREATE USER 'reader'@'%' IDENTIFIED BY '123456';CREATE USER 'reader'@'127.0.0.1' IDENTIFIED BY '123456';CREATE USER 'reader'@'localhost' IDENTIFIED BY '123456';GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE, SUPER ON *.* TO 'reader'@'%';GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE, SUPER ON *.* TO 'reader'@'127.0.0.1';GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE, SUPER ON *.* TO 'reader'@'localhost';FLUSH PRIVILEGES;
[mysqld]# mandatoryserver-id= 1log_bin= /var/lib/mysql/bin.logbinlog-format= row # very important if you want to receive write, update and delete row events# optionalexpire_logs_days = 30max_binlog_size= 768M# setup listen addressbind-address = 0.0.0.0

3.2 建对应的ClickHouse表

3.2.1 查看Mysql 表结构

show create table db_adp_rt.dws_game_product_online_h;

CREATE TABLE `dws_game_product_online_h` (`statdate` bigint(11) NOT NULL COMMENT '统计时间,精确到小时,2020081316',`game_id_jf` varchar(32) COLLATE utf8mb4_bin NOT NULL COMMENT '经分游戏ID',`game_id_o2` int(11) NOT NULL COMMENT 'O2游戏ID',`data_type` int(11) NOT NULL COMMENT '统计口径字段 1 - 染色,2 - 渠道,3 - 染色且渠道',`game_name` varchar(64) COLLATE utf8mb4_bin NOT NULL COMMENT '游戏名称',`agent_id` int(11) NOT NULL COMMENT '代投方ID, 0 - 未知',`agent_name` varchar(32) COLLATE utf8mb4_bin NOT NULL COMMENT '代投方名称',`media_type` int(11) NOT NULL DEFAULT '9' COMMENT '媒体类型',`media_resource_type` int(11) NOT NULL DEFAULT '20' COMMENT '媒体内容类型',`med_platform_id` int(11) NOT NULL DEFAULT '20' COMMENT 'o2的媒体类型;--通过media_id关联',`media_id` int(10) NOT NULL COMMENT '媒体ID',`media_name` varchar(64) COLLATE utf8mb4_bin NOT NULL DEFAULT '' COMMENT '媒体名称',`plat_id` int(11) NOT NULL COMMENT '平台: 0 - IOS, 1 - 安卓, 3 - PC, 4 - 网页, 255 - 全平台, -1 - 未知',`plat_name` varchar(32) COLLATE utf8mb4_bin NOT NULL COMMENT '平台名称',`impression_num` bigint(15) DEFAULT '0' COMMENT '曝光PV',`click_num` bigint(15) DEFAULT '0' COMMENT '点击PV',`api_impression_num` bigint(15) DEFAULT '0' COMMENT 'api拉取的曝光PV',`api_click_num` bigint(15) DEFAULT '0' COMMENT 'api拉取的点击PV',`activation_num` bigint(15) DEFAULT '0' COMMENT '激活数PV',`activation_user_num` bigint(15) DEFAULT '0' COMMENT '激活用户数UV,按照deviceid_md5去重',`register_num` bigint(15) DEFAULT '0' COMMENT '新进PV',`register_user_num` bigint(15) DEFAULT '0' COMMENT '新进UV(按照openid去重)',`role_creation_num` bigint(15) DEFAULT '0' COMMENT '创角PV',`role_creation_user_num` bigint(15) DEFAULT '0' COMMENT '创角UV(按照openid去重)',`backflow_30day_num` bigint(15) DEFAULT '0' COMMENT '月回流PV',`backflow_30day_user_num` bigint(15) DEFAULT '0' COMMENT '月回流UV',`register_pay_num_today` bigint(15) DEFAULT '0' COMMENT '今日新进用户付费PV',`register_pay_user_num_today` bigint(15) DEFAULT '0' COMMENT '今日新进用户付费UV',`register_pay_amount_today` double DEFAULT '0' COMMENT '今日新进用户付费金额,单位:分',`register_pay_num_30day` bigint(15) DEFAULT '0' COMMENT '过去30天的新进用户在当天的付费PV',`register_pay_user_num_30day` bigint(15) DEFAULT '0' COMMENT '过去30天的新进用户在当天的付费UV(按照openid去重)',`register_pay_amount_30day` double DEFAULT '0' COMMENT '过去30天的新进用户在当天的付费金额,单位:分',`backflow_pay_num_today` bigint(15) DEFAULT '0' COMMENT '今日月回流用户付费PV',`backflow_pay_user_num_today` bigint(15) DEFAULT '0' COMMENT '今日月回流用户付费UV(按照openid去重)',`backflow_pay_amount_today` double DEFAULT '0' COMMENT '今日月回流用户付费金额,单位:分',`backflow_pay_num_30day` bigint(15) DEFAULT '0' COMMENT '过去30天的月回流用户在当天的付费PV',`backflow_pay_user_num_30day` bigint(15) DEFAULT '0' COMMENT '过去30天的月回流用户在当天的付费UV(按照openid去重)',`backflow_pay_amount_30day` double DEFAULT '0' COMMENT '过去30天的月回流用户在当天的付费金额,单位:分',`site_set_id` bigint(15) DEFAULT NULL COMMENT '广告版位id,腾讯广告中的site_id,头条广告中的CSITE',`billing_type` int(10) DEFAULT NULL COMMENT '出价结算类型:1:CPD,2:CPM,3:CPC,4:CPA',`update_time` varchar(100) COLLATE utf8mb4_bin DEFAULT NULL COMMENT '更新时间',UNIQUE KEY `statis_hour` (`statdate`,`game_id_o2`,`data_type`,`media_id`,`plat_id`,`site_set_id`,`billing_type`),KEY `idx_gameid_hour` (`game_id_o2`,`statdate`)) ENGINE=TokuDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin COMMENT='实时归因三口径迭代_产品(染渠+染色+渠道)'

3.2.2 转换为ClickHouse 表

MySQLClickHouse
UNSIGNED TINYINTUInt8
TINYINTInt8
UNSIGNED SMALLINTUInt16
SMALLINTInt16
UNSIGNED INT, UNSIGNED MEDIUMINTUInt32
INT, MEDIUMINTInt32
UNSIGNED BIGINTUInt64
BIGINTInt64
FLOATFloat32
DOUBLEFloat64
DATEDate
DATETIME, TIMESTAMPDateTime
BINARYFixedString

根据上面的数据类型映射关系进行转换

CREATE TABLE db_adp.dws_game_product_online_h(`statdate` Int32 COMMENT '统计时间,精确到小时,2020081316',`game_id_jf` String COMMENT '经分游戏ID',`game_id_o2` Int32 COMMENT 'O2游戏ID',`data_type` UInt8 COMMENT '统计口径字段 0 - 统一,1 - 染色,2 - 渠道,3 - 染色且渠道',`game_name` String COMMENT '游戏名称',`agent_id` Int32 COMMENT '代投方ID, 0 - 未知',`agent_name` String COMMENT '代投方名称',`media_type` UInt16 DEFAULT 9 COMMENT '媒体类型',`med_platform_id` UInt16 DEFAULT 20 COMMENT 'o2的媒体类型;--通过media_id关联',`media_id` Int32 COMMENT '媒体ID',`media_name` String COMMENT '媒体名称',`plat_id` UInt8 COMMENT '平台: 0 - IOS, 1 - 安卓, 3 - PC, 4 - 网页, 255 - 全平台, -1 - 未知',`plat_name` String COMMENT '平台名称',`cost` Float64 DEFAULT 0. COMMENT 'RTB 消耗单位:分',`impression_num` UInt64 DEFAULT 0 COMMENT '曝光PV',`click_num` UInt64 DEFAULT 0 COMMENT '点击PV',`activation_num` UInt32 DEFAULT 0 COMMENT '激活数PV',`activation_user_num` UInt32 DEFAULT 0 COMMENT '激活用户数UV,按照deviceid_md5去重',`register_num` UInt32 DEFAULT 0 COMMENT '新进PV',`register_user_num` UInt32 DEFAULT 0 COMMENT '新进UV(按照openid去重)',`role_creation_num` UInt32 DEFAULT 0 COMMENT '创角PV',`role_creation_user_num` UInt32 DEFAULT 0 COMMENT '创角UV(按照openid去重)',`backflow_30day_num` UInt32 DEFAULT 0 COMMENT '月回流PV',`backflow_30day_user_num` UInt32 DEFAULT 0 COMMENT '月回流UV',`register_pay_num_today` UInt32 DEFAULT 0 COMMENT '今日新进用户付费PV',`register_pay_user_num_today` UInt32 DEFAULT 0 COMMENT '今日新进用户付费UV',`register_pay_amount_today` Float64 DEFAULT 0. COMMENT '今日新进用户付费金额,单位:分',`register_pay_num_30day` UInt32 DEFAULT 0 COMMENT '过去30天的新进用户在当天的付费PV',`register_pay_user_num_30day` UInt32 DEFAULT 0 COMMENT '过去30天的新进用户在当天的付费UV(按照openid去重)',`register_pay_amount_30day` Float64 DEFAULT 0. COMMENT '过去30天的新进用户在当天的付费金额,单位:分',`backflow_pay_num_today` UInt32 DEFAULT 0 COMMENT '今日月回流用户付费PV',`backflow_pay_user_num_today` UInt32 DEFAULT 0 COMMENT '今日月回流用户付费UV(按照openid去重)',`backflow_pay_amount_today` Float64 DEFAULT 0. COMMENT '今日月回流用户付费金额,单位:分',`backflow_pay_num_30day` UInt32 DEFAULT 0 COMMENT '过去30天的月回流用户在当天的付费PV',`backflow_pay_user_num_30day` UInt32 DEFAULT 0 COMMENT '过去30天的月回流用户在当天的付费UV(按照openid去重)',`backflow_pay_amount_30day` Float64 DEFAULT 0. COMMENT '过去30天的月回流用户在当天的付费金额,单位:分',`update_time` DateTime COMMENT '更新时间')ENGINE = TinyLog

3.3 数据导入

clickhouse-mysql \ --src-host=127.0.0.1 \ --src-user=reader \ --src-password=123456 \ --src-table=db_adp.dws_game_product_online_h --table-migrate \ --dst-host=127.0.0.1 \ --dst-table=db_adp.dws_game_product_online_h \ --csvpool

3.3.1 更多的参数可参考下面的解释

usage: clickhouse-mysql [-h] [--config-file CONFIG_FILE] [--log-file LOG_FILE][--log-level LOG_LEVEL] [--nice-pause NICE_PAUSE][--dry] [--daemon] [--pid-file PID_FILE][--binlog-position-file BINLOG_POSITION_FILE][--mempool][--mempool-max-events-num MEMPOOL_MAX_EVENTS_NUM][--mempool-max-rows-num MEMPOOL_MAX_ROWS_NUM][--mempool-max-flush-interval MEMPOOL_MAX_FLUSH_INTERVAL][--csvpool][--csvpool-file-path-prefix CSVPOOL_FILE_PATH_PREFIX][--csvpool-keep-files] [--create-table-sql-template][--create-table-sql] [--with-create-database][--create-table-json-template] [--migrate-table][--pump-data] [--install][--src-server-id SRC_SERVER_ID] [--src-host SRC_HOST][--src-port SRC_PORT] [--src-user SRC_USER][--src-password SRC_PASSWORD][--src-schemas SRC_SCHEMAS] [--src-tables SRC_TABLES][--src-tables-where-clauses SRC_TABLES_WHERE_CLAUSES][--src-tables-prefixes SRC_TABLES_PREFIXES][--src-wait] [--src-resume][--src-binlog-file SRC_BINLOG_FILE][--src-binlog-position SRC_BINLOG_POSITION][--src-file SRC_FILE] [--dst-file DST_FILE][--dst-host DST_HOST] [--dst-port DST_PORT][--dst-user DST_USER] [--dst-password DST_PASSWORD][--dst-schema DST_SCHEMA] [--dst-distribute][--dst-cluster DST_CLUSTER] [--dst-table DST_TABLE][--dst-table-prefix DST_TABLE_PREFIX][--dst-create-table][--column-default-value [COLUMN_DEFAULT_VALUE [COLUMN_DEFAULT_VALUE ...]]][--column-skip [COLUMN_SKIP [COLUMN_SKIP ...]]][--ch-converter-file CH_CONVERTER_FILE][--ch-converter-class CH_CONVERTER_CLASS]ClickHouse data readeroptional arguments:-h, --helpshow this help message and exit--config-file CONFIG_FILEPath to config file. Default - not specified--log-file LOG_FILE Path to log file. Default - not specified--log-level LOG_LEVELLog Level. Default - NOTSET--nice-pause NICE_PAUSEMake specified (in sec) pause between attempts to readbinlog stream--dry Dry mode - do not do anything that can harm. Usefulfor debugging.--daemonDaemon mode - go to background.--pid-file PID_FILE Pid file to be used by the app in daemon mode--binlog-position-file BINLOG_POSITION_FILEFile to write binlog position to during bin logreading and to read position from on start--mempool Cache data in mem.--mempool-max-events-num MEMPOOL_MAX_EVENTS_NUMMax events number to pool - triggering pool flush--mempool-max-rows-num MEMPOOL_MAX_ROWS_NUMMax rows number to pool - triggering pool flush--mempool-max-flush-interval MEMPOOL_MAX_FLUSH_INTERVALMax seconds number between pool flushes--csvpool Cache data in CSV pool files on disk. Requires memorypooling, thus enables --mempool even if it is notexplicitly specified--csvpool-file-path-prefix CSVPOOL_FILE_PATH_PREFIXFile path prefix to CSV pool files--csvpool-keep-filesKeep CSV pool files. Useful for debugging--create-table-sql-templatePrepare CREATE TABLE SQL template(s).--create-table-sqlPrepare CREATE TABLE SQL statement(s).--with-create-databasePrepend each CREATE TABLE SQL statement(s) with CREATEDATABASE statement--create-table-json-templatePrepare CREATE TABLE template(s) as JSON. Useful forIPC--migrate-table Migrate table(s). Copy existing data from MySQLtable(s) with SELECT statement. Binlog is not readduring this procedure - just copy data from the srctable(s). IMPORTANT!. Target table has to be createdin ClickHouse or it has to be created with --dst-create-table and possibly with --with-create-databaseoptions. See --create-table-sql-template and --create-table-sql options for additional info.--pump-data Pump data from MySQL binlog into ClickHouse. Copy rowsfrom binlog until the end of binlog reached. When endof binlog reached, process ends. Use in combinationwith --src-wait in case would like to continue andwait for new rows after end of binlog reached--install Install service file(s)--src-server-id SRC_SERVER_IDSet server_id to be used when reading date from MySQLsrc. Ex.: 1--src-host SRC_HOST Host to be used when reading from src. Ex.: 127.0.0.1--src-port SRC_PORT Port to be used when reading from src. Ex.: 3306--src-user SRC_USER Username to be used when reading from src. Ex.: root--src-password SRC_PASSWORDPassword to be used when reading from src. Ex.: qwerty--src-schemas SRC_SCHEMASComma-separated list of databases (a.k.a schemas) tobe used when reading from src. Ex.: db1,db2,db3--src-tables SRC_TABLESComma-separated list of tables to be used when readingfrom src. Ex.: table1,table2,table3Ex.:db1.table1,db2.table2,db3.table3Ex.:table1,db2.table2,table3--src-tables-where-clauses SRC_TABLES_WHERE_CLAUSESComma-separated list of WHERE clauses for tables to bemigrated. Ex.: db1.t1="a=1 and b=2",db2.t2="c=3 andk=4". Accepts both (comma-separated) clause (usefulfor short clauses) or file where clause is located(useful for long clauses)--src-tables-prefixes SRC_TABLES_PREFIXESComma-separated list of table prefixes to be used whenreading from src.Useful when we need to processunknown-in-advance tables, say day-named log tables,as log_2017_12_27Ex.: mylog_,anotherlog_,extralog_3--src-waitWait indefinitely for new records to come.--src-resumeResume reading from previous position. Previousposition is read from `binlog-position-file`--src-binlog-file SRC_BINLOG_FILEBinlog file to be used to read from src. Related to`binlog-position-file`. Ex.: mysql-bin.000024--src-binlog-position SRC_BINLOG_POSITIONBinlog position to be used when reading from src.Related to `binlog-position-file`. Ex.: 5703--src-file SRC_FILE Source file to read data from. CSV--dst-file DST_FILE Target file to be used when writing data. CSV--dst-host DST_HOST Host to be used when writing to dst. Ex.: 127.0.0.1--dst-port DST_PORT Port to be used when writing to dst. Ex.: 9000--dst-user DST_USER Username to be used when writing to dst. Ex: default--dst-password DST_PASSWORDPassword to be used when writing to dst. Ex.: qwerty--dst-schema DST_SCHEMADatabase (a.k.a schema) to be used to create tables inClickHouse. It overwrites source database(s) name(s),so tables in ClickHouse would be located indifferently named db than in MySQL. Ex.: db1--dst-distributeWhether to add distribute table--dst-cluster DST_CLUSTERCluster to be used when writing to dst. Ex.: cluster1--dst-table DST_TABLETable to be used when writing to dst. Ex.: table1--dst-table-prefix DST_TABLE_PREFIXPrefix to be used when creating dst table. Ex.:copy_table_--dst-create-tablePrepare and run CREATE TABLE SQL statement(s).--column-default-value [COLUMN_DEFAULT_VALUE [COLUMN_DEFAULT_VALUE ...]]Set of key=value pairs for columns default values.Ex.: date_1=2000-01-01 timestamp_1=2002-01-01\01:02:03--column-skip [COLUMN_SKIP [COLUMN_SKIP ...]]Set of column names to skip. Ex.: column1 column2--ch-converter-file CH_CONVERTER_FILEFilename where to search for CH converter class--ch-converter-class CH_CONVERTER_CLASSConverter class name in --ch-converter-file file

3.4、补充

官网上说使用pypy 性能上能提升10倍,大家也可以尝试下:Performance

4、CSV及TSV方式

像传统的将Source数据先生成一个CSV/TSV文件,再将数据文件导入到结果表中的方式,这里不过多赘叙,大家可参看一面的链接

Upload a CSV File
Insert Local Files

5、对比

操作难易数据大小(GB)时长(s)

支持增量

Mysql Engine比较易3.6457.261
Mysql Function3.6439.634
Altinity首次需安装,以后使用易3.6424.2