开源的组件版本
单台配置,128G,32C,磁盘20T
选择的aliyun的云服务EMR,单台年费大概在5W左右 [[集群规模及部门介绍#^a46fb1|hadoop集群规模]]
plaintext自动换行:关放大阅读展开代码- 确定主维表和相关维表,主维表一般与事实表直接相关,其他的围绕业务过程的就可选为相关维表;维度字段一般需要多个维表相会关联,或者维度属性拆解,维度属性合并等 - 维度需要尽可能丰富 - 维度属性应该包含属性的描述
维度建模注意的问题
collect_set()或者collect_listsql自动换行:关放大阅读展开代码create temporary table dim_user_zip ( id string comment '用户id', login_name string comment '用户名称', nick_name string comment '用户昵称', name string comment '用户姓名', phone_num string comment '手机号码', email string comment '邮箱', user_level string comment '用户等级', birthday string comment '生日', gender string comment '性别', create_time string comment '创建时间', operate_time string comment '操作时间', start_date string comment '开始日期', -- end_date string comment '结束日期', -- 生命周期结束 dt string )
项目中用到的维度拉链方式:
sql自动换行:关放大阅读展开代码--首日数据加载: insert overwrite table dim_user_zip partition (dt='9999-12-31') select data.id, data.login_name, data.nick_name, data.name, data.phone_num, data.email, data.user_level, data.birthday, data.gender, data.create_time, data.operate_time, date_format(nvl(data.operate_time,data.create_time),'yyyy-MM-dd'), '9999-12-31' from ods_user_info_inc where dt='2020-06-14' and type='bootstrap-insert'; set hive.exec.dynamic.partition.mode=nonstrict; with old as ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from dim_user_zip where dt='9999-12-31' ),new as ( --一个用户一天可能更新多次,此时只需取最新状态数据即可 select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, '2020-06-15' start_date, '9999-12-31' end_date from ( select data.id, data.login_name, data.nick_name, data.name, data.phone_num, data.email, data.user_level, data.birthday, data.gender, data.create_time, data.operate_time, row_number() over (partition by data.id order by ts desc) rn from ods_user_info_inc where dt = '2020-06-15' ) t1 where rn=1 ), full_user as ( select old.id old_id, old.login_name old_login_name, old.nick_name old_nick_name, old.name old_name, old.phone_num old_phone_num, old.email old_email, old.user_level old_user_level, old.birthday old_birthday, old.gender old_gender, old.create_time old_create_time, old.operate_time old_operate_time, old.start_date old_start_date, old.end_date old_end_date, new.id new_id, new.login_name new_login_name, new.nick_name new_nick_name, new.name new_name, new.phone_num new_phone_num, new.email new_email, new.user_level new_user_level, new.birthday new_birthday, new.gender new_gender, new.create_time new_create_time, new.operate_time new_operate_time, new.start_date new_start_date, new.end_date new_end_date from old full join new on old.id = new.id ) insert overwrite table dim_user_zip partition (dt) --查询最新数据 select nvl(new_id,old_id) id, if( new_id is not null, new_login_name,old_login_name ), if( new_id is not null, new_nick_name,old_nick_name ), if( new_id is not null, new_name,old_name ), if( new_id is not null, new_phone_num,old_phone_num ), if( new_id is not null, new_email,old_email ), if( new_id is not null, new_user_level,old_user_level ), if( new_id is not null, new_birthday,old_birthday ), if( new_id is not null, new_gender,old_gender ), if( new_id is not null, new_create_time,old_create_time ), if( new_id is not null, new_operate_time,old_operate_time ), if( new_id is not null, new_start_date,old_start_date ), if( new_id is not null, new_end_date,old_end_date ), if( new_id is not null, new_end_date,old_end_date ) from full_user union all select old_id, old_login_name, old_nick_name, old_name, old_phone_num, old_email, old_user_level, old_birthday, old_gender, old_create_time, old_operate_time, old_start_date, cast(date_sub('2020-06-15',1) as string) , cast(date_sub('2020-06-15',1) as string) from full_user where new_id is not null and old_id is not null; with old as ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from dim_user_zip where dt='9999-12-31' ),new as ( --一个用户一天可能更新多次,此时只需取最新状态数据即可 select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, '2020-06-15' start_date, '9999-12-31' end_date from ( select data.id, data.login_name, data.nick_name, data.name, data.phone_num, data.email, data.user_level, data.birthday, data.gender, data.create_time, data.operate_time, row_number() over (partition by data.id order by ts desc) rn from ods_user_info_inc where dt = '2020-06-15' ) t1 where rn=1 ),full_user as ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date, row_number() over (partition by id order by start_date desc) rn from ( select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from old union all select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from new ) t1 ) insert overwrite table dim_user_zip partition (dt) --查询最新数据 select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date, end_date dt from full_user where rn = 1 union all --失效数据 select id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, cast(date_sub('2020-06-15',1) as string), cast(date_sub('2020-06-15',1) as string) from full_user where rn != 1
拉链表的退链处理 数据错误时间 e_t
数仓数据域
按照划分的数据域,和之前业务调研,确定维度,事实表,事实表度量,建立矩阵
200张事实表??
6张维度表
确定数仓架构和架构选型
业务数据:
plaintext自动换行:关放大阅读展开代码1. mysql -> maxwell -> kafka -> flume -> hdfs -> datahouse 2. mysql -> datax -> hdfs
日志数据:
plaintext自动换行:关放大阅读展开代码1. springboot -> nginx -> 日志服务器 -> flume -> kakfa -> flume -> hdfs
主要思路:
数据调研
划分数据域
建立业务总线矩阵
确定数仓架构和选型
维度建模
sql编写,测试,上线
| 指标名 | 数据域 | 解释 |
|---|---|---|
| 各个引流渠道的跳出率 | 流量域 | 只有一个页面的会话 / 该渠道的总会话数 |
| 路径分析 | 流量域 | 页面跳转统计 |
| 注册漏斗分析 | 流量域 | 启动app,浏览页面,注册结果,提交订单,支付订单 |
| 关键词搜索频次分布 | 流量域 | 用户输入关键词搜索后返回结果统计,用于调整用户搜索体验 |
| 商品复购率 | 商品域 | 重复购买次数 / 总购买次数 |
| 最近1,7,30日的下单人数,订单数 | 商品域 | |
| 各品类购车车存量top3 | 商品域 | |
| 商品收藏量top3 | 商品域 | |
| 下单成功的订单中,下单到支付的平均时间间隔 | 交易域 | |
| 线上会员订单业务类型分布 | 交易域 | 各个注册来源的会员在支付时支付方式统计,比如支付包,微信,收银机等 |
| 各省份GMV(交易总额) | 交易域 | |
| 客单价 | 交易域 | 下单总额 / 下单人数 |
| 购买频次分布 | 交易域 | 当日支付1次的人数,支付2次的人数... |
| 会员客单价 | 交易域 | 会员的总支付金额 / 总会员人数 |
| 会员消费频次 | 交易域 | |
| 会员复购分析 | 交易域 | |
| 消费时段偏好 | 交易域 | |
| 各个优惠券使用情况,比如领取,下单,支付 | 工具域 | |
| 7日新增和活跃统计 | 用户域 | |
| 用户的留存分析 | 用户域 | |
| 用户流失统计 | 用户域 | |
| 注册用户LTV价值分析 | 用户域 | 注册用户随着时间流逝,每天支付总额分析 |
| 用户漏斗分析 | 用户域 | |
| 新增下单用户统计 | 用户域 | |
| 会员身份分布 | 用户域 | |
| 会员来源统计 | 用户域 | |
自己写SQL维护,并结合后端工程师的springboot项目,提供置空率,指标波动等告警
本文作者:hedeoer
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!