首页标签分类
datax的使用
2025-08-04 · 更新 2026-03-03约 3 分钟 · 567 字
数据组件
000

目录

使用
使用场景1:无整形字段的大表同步问题
clickhousereader 组件使用问题
报错
原因
解决
执行任务测试

使用

基本使用步骤为:

  1. 编写任务配置文件
  2. 启动任务

使用注意事项:

  1. reader组件尽量配置整形的(只支持整形),均匀分布的数据列来设置splitPk,结合channel的设置能够提高数据的读取速度
  2. 如果splitPk设置为非整形,即使任务设置了多channel,也只是单线程同步
  3. datax的Table ModeSQL Mode是互斥的。即一旦使用了 querySql后,splitPk、table、column、where 等参数就会被完全忽略
  4. 对于没有整形字段可以切分的大数据表;方案1可以采用手动切分数据,启动多个dataxjob的方式进行同步;方案2可以为原来的表添加整形字段的形式来使得splitPk的功能生效

使用场景1:无整形字段的大表同步问题

clickhouse同步mysql:

  1. 源表情况
sql
自动换行:关
放大阅读
展开代码
CREATE TABLE test.measurements ( `city` String, `temperature` Float64 ) ENGINE = MergeTree ORDER BY city SETTINGS index_granularity = 8192; -- 10亿条记录,3.72GB大小 SELECT table, formatReadableSize(sum(bytes_on_disk)) AS total_size, sum(rows) AS total_rows FROM system.parts WHERE database = 'test' AND table = 'measurements' AND active GROUP BY table; select count(1) from measurements;
  1. 为源表添加数据分布均匀整形列

measurements表中没有整形字段,考虑增加一个整形字段(按照city字段取hash)

sql
自动换行:关
放大阅读
展开代码
-- 添加物化列, ALTER TABLE test.measurements ADD COLUMN `city_hash` UInt64 MATERIALIZED cityHash64(city); -- clickhouse的物化列,具有 写入时自动计算特点,默认只对新插入的数据记录生效,要使得历史的全局数据都生效,需要执行如下命令 ALTER TABLE test.measurements MATERIALIZE COLUMN city_hash; -- 查看添加物化列后的表大小 -- -- 10亿条记录,3.75GB大小,则新增了30MB的存储消耗 SELECT table, formatReadableSize(sum(bytes_on_disk)) AS total_size, sum(rows) AS total_rows FROM system.parts WHERE database = 'test' AND table = 'measurements' AND active GROUP BY table; -- 由于需要使用 city_hash 来划分数据,可以添加索引 ALTER TABLE test.measurements ADD INDEX idx_city_hash city_hash TYPE minmax GRANULARITY 1; ALTER TABLE test.measurements MATERIALIZE INDEX idx_city_hash; -- 查询添加索引的任务是否完成,is_done为1表示完成了 SELECT database, table, mutation_id, command, create_time, is_done, parts_to_do, latest_fail_reason FROM system.mutations WHERE table = 'measurements' AND database = 'test' ORDER BY create_time DESC LIMIT 10; -- 查询添加索引后的表大小 -- -- 10亿条记录,3.76 GiB大小,则新增了110MB的存储消耗 SELECT table, formatReadableSize(sum(bytes_on_disk)) AS total_size, sum(rows) AS total_rows FROM system.parts WHERE database = 'test' AND table = 'measurements' AND active GROUP BY table;
  1. 在datax 配置使用整形列作为splitPk channel的计算:目前机器节点为9GB RAM,4cores,默认datax JVM最大堆内存为1GB;

    channel 数量通常建议设置为 CPU 核心数的 1 到 2 倍,此处设置为8个channel

    每个 channel 的 Buffer 默认大小约为 8MB,总 Buffer 内存 ≈ 8 * 8MB = 64MB

    channel设置为8,但是没有指定splitPk,则为单线程执行:

    json
    自动换行:关
    放大阅读
    展开代码
    { "job": { "setting": { "speed": { "channel": 8 }, "errorLimit": { "record": 0, "percentage": 0.02 } }, "content": [ { "reader": { "name": "clickhousereader", "parameter": { "username": "default", "password": "default", "column": [ "city","temperature" ], "connection": [ { "table": [ "measurements" ], "jdbcUrl": [ "jdbc:clickhouse://hadoop104:8123/test" ] } ] } }, "writer": { "name": "mysqlwriter", "parameter": { "writeMode": "insert", "username": "root", "password": "aaaaaa", "column": [ "city", "temperature" ], "session": [ "set session sql_mode='ANSI'" ], "preSql": [ "" ], "connection": [ { "jdbcUrl": "jdbc:mysql://hadoop102:3306/test?useUnicode=true&characterEncoding=utf8&serverTimezone=Asia/Shanghai", "table": [ "measurements" ] } ] } } } ] } }

    观察结果可知:为单线程任务,耗时41分钟,同步10亿条clickhouse数据到mysql,且数据一致

    bash
    自动换行:关
    放大阅读
    展开代码
    2025-08-05 11:41:48.009 [job-0] INFO JobContainer - 任务启动时刻 : 2025-08-05 11:00:47 任务结束时刻 : 2025-08-05 11:41:48 任务总计耗时 : 2460s 任务平均流量 : 4.93MB/s 记录写入速度 : 406504rec/s 读出记录总数 : 1000000000 读写失败总数 : 0

    channel设置为8,指定splitPk,则为多线程执行:

    json
    自动换行:关
    放大阅读
    展开代码
    { "job": { "setting": { "speed": { "channel": 8 }, "errorLimit": { "record": 0, "percentage": 0.02 } }, "content": [ { "reader": { "name": "clickhousereader", "parameter": { "username": "default", "password": "default", "column": [ "city","temperature" ], "connection": [ { "table": [ "measurements_backup" ], "jdbcUrl": [ "jdbc:clickhouse://hadoop104:8123/test" ] } ], "splitPk": "cith_hash" } }, "writer": { "name": "mysqlwriter", "parameter": { "writeMode": "insert", "username": "root", "password": "aaaaaa", "column": [ "city", "temperature" ], "session": [ "set session sql_mode='ANSI'" ], "preSql": [ "" ], "connection": [ { "jdbcUrl": "jdbc:mysql://hadoop102:3306/test?useUnicode=true&characterEncoding=utf8&serverTimezone=Asia/Shanghai", "table": [ "measurements" ] } ] } } } ] } }
  2. 啊啊

clickhousereader 组件使用问题

报错

image-20250804175124859

bash
自动换行:关
放大阅读
展开代码
java.lang.ExceptionInInitializerError: null at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_212] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_212] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_212] at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_212] at java.lang.Class.newInstance(Class.java:442) ~[na:1.8.0_212] at com.alibaba.datax.core.util.container.LoadUtil.loadJobPlugin(LoadUtil.java:98) ~[datax-core-0.0.1-SNAPSHOT.jar:na] at com.alibaba.datax.core.job.JobContainer.initJobReader(JobContainer.java:661) ~[datax-core-0.0.1-SNAPSHOT.jar:na] at com.alibaba.datax.core.job.JobContainer.init(JobContainer.java:303) ~[datax-core-0.0.1-SNAPSHOT.jar:na] at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:113) ~[datax-core-0.0.1-SNAPSHOT.jar:na] at com.alibaba.datax.core.Engine.start(Engine.java:86) [datax-core-0.0.1-SNAPSHOT.jar:na] at com.alibaba.datax.core.Engine.entry(Engine.java:168) [datax-core-0.0.1-SNAPSHOT.jar:na] at com.alibaba.datax.core.Engine.main(Engine.java:201) [datax-core-0.0.1-SNAPSHOT.jar:na] Caused by: java.util.MissingResourceException: Can't find bundle for base name com.alibaba.datax.plugin.reader.clickhousereader.LocalStrings, locale zh_CN at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1581) ~[na:1.8.0_212] at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1396) ~[na:1.8.0_212] at java.util.ResourceBundle.getBundle(ResourceBundle.java:1091) ~[na:1.8.0_212] at com.alibaba.datax.common.util.MessageSource.loadResourceBundle(MessageSource.java:92) ~[datax-common-0.0.1-SNAPSHOT.jar:na] at com.alibaba.datax.common.util.MessageSource.loadResourceBundle(MessageSource.java:37) ~[datax-common-0.0.1-SNAPSHOT.jar:na] at com.alibaba.datax.common.util.MessageSource.loadResourceBundle(MessageSource.java:46) ~[datax-common-0.0.1-SNAPSHOT.jar:na] at com.alibaba.datax.plugin.reader.clickhousereader.ClickhouseReader$Job.<clinit>(ClickhouseReader.java:30) ~[clickhousereader-0.0.1-SNAPSHOT.jar:na] ... 12 common frames omitted 2025-08-04 17:50:42.629 [job-0] INFO StandAloneJobContainerCommunicator - Total 0 records, 0 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 0.00% 2025-08-04 17:50:42.630 [job-0] ERROR Engine -

原因

由于clickhousereader插件中没有打包入LocalStrings, locale zh_CN等资源报错

解决

下载datax源码重新打包clickhousereader组件即可

image-20250804175428371

由于clickhousereader组件需要其他的datax组件,比如datax-core,datax-transformer等,需要先 mvn install安装依赖到本地后方可正常打包 clickhousereader

image-20250804175658648

上传重新编译的包替换服务器上的clickhousereader包

clickhousereader-0.0.1-SNAPSHOT.jar.back为旧的jar备份,clickhousereader-0.0.1-SNAPSHOT.jar为新的编译的包

bash
自动换行:关
放大阅读
展开代码
[atguigu@hadoop102 ~]$ cd /home/atguigu/test/datax/plugin/reader/clickhousereader/ [atguigu@hadoop102 clickhousereader]$ pwd /home/atguigu/test/datax/plugin/reader/clickhousereader [atguigu@hadoop102 clickhousereader]$ ll total 40 -rw-rw-r--. 1 atguigu atguigu 18285 Aug 4 17:40 clickhousereader-0.0.1-SNAPSHOT.jar -rw-r--r--. 1 atguigu atguigu 5659 Aug 21 2023 clickhousereader-0.0.1-SNAPSHOT.jar.back drwxrwxr-x. 2 atguigu atguigu 4096 Aug 4 16:48 libs -rw-r--r--. 1 atguigu atguigu 334 Aug 21 2023 plugin_job_template.json -rw-r--r--. 1 atguigu atguigu 241 Aug 21 2023 plugin.json

执行任务测试

执行一个从clickhouse读取,写入到mysql的任务

任务配置:

json
自动换行:关
放大阅读
展开代码
{ "job": { "setting": { "speed": { "channel": 1 }, "errorLimit": { "record": 0, "percentage": 0.02 } }, "content": [ { "reader": { "name": "clickhousereader", "parameter": { "username": "default", "password": "default", "column": [ "city","temperature" ], "connection": [ { "table": [ "measurements" ], "jdbcUrl": [ "jdbc:clickhouse://hadoop104:8123/test" ] } ], "where": "city = 'Brisbane'" } }, "writer": { "name": "mysqlwriter", "parameter": { "writeMode": "insert", "username": "root", "password": "aaaaaa", "column": [ "city", "temperature" ], "session": [ "set session sql_mode='ANSI'" ], "preSql": [ "" ], "connection": [ { "jdbcUrl": "jdbc:mysql://hadoop102:3306/test?useUnicode=true&characterEncoding=utf8&serverTimezone=Asia/Shanghai", "table": [ "measurements" ] } ] } } } ] } }

任务执行:

image-20250804180200048

本文作者:hedeoer

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!