【hive】一种高效增量表的实现
一个hive 的ia表的模版,如何更新历史数据+合并新数据最简洁的算法
还有算法诸如:row_number()暴力去重等等
最简洁算法演示
旧数据:1,2,3,4,5
新数据:4,5,6,7
代码第一块(union前):旧数据为左表,left join新表,能关联上,即数据被更新过,由于混杂着未更新的数据,即用case when判断
代码第二块(union后): 新数据为左表,left join旧表,去掉重叠部分的数据,只要纯新的数据
举例,如果是每日更新数据进hive
ods中存储最新的每日的数据,假设当日分区为20251014
dwd中存储历史数据,每个分区都是完整的数据,最后的一个分区为20251013
传参即:20251014 20251013 20251014 20251014
含义:将dwd中的昨日分区的完整数据取出来,与每日新增的数据进行计算,得出一个新版本的全量的数据
SET destDay; --目标分区参数
SET startDayP1;--开始分区的前一天的参数
SET startDay;--开始分区的参数
SET endDay;--结束分区的参数SELECT --历史数据部分:判断数据是否更新过,如果更新过,则取更新后的数据(CASE WHEN t2.org_id IS NOT NULL THEN t2.org_id ELSE t1.org_id END) AS org_id,(CASE WHEN t2.trace_no IS NOT NULL THEN t2.trace_no ELSE t1.trace_no END) AS trace_no,(CASE WHEN t2.bc_id IS NOT NULL THEN t2.bc_id ELSE t1.bc_id END) AS bc_id,(CASE WHEN t2.creation_date IS NOT NULL THEN t2.creation_date ELSE t1.creation_date END) AS creation_date,(CASE WHEN t2.created_by IS NOT NULL THEN t2.created_by ELSE t1.created_by END) AS created_by,(CASE WHEN t2.last_update_date IS NOT NULL THEN t2.last_update_date ELSE t1.last_update_date END) AS last_update_date,(CASE WHEN t2.last_update_by IS NOT NULL THEN t2.last_update_by ELSE t1.last_update_by END) AS last_update_by,(CASE WHEN t2.lot_code IS NOT NULL THEN t2.lot_code ELSE t1.lot_code END) AS lot_code,(CASE WHEN t2.scan_code IS NOT NULL THEN t2.scan_code ELSE t1.scan_code END) AS scan_code,(CASE WHEN t2.card_no IS NOT NULL THEN t2.card_no ELSE t1.card_no END) AS card_no,(CASE WHEN t2.card_ser IS NOT NULL THEN t2.card_ser ELSE t1.card_ser END) AS card_ser,(CASE WHEN t2.prod_line_id IS NOT NULL THEN t2.prod_line_id ELSE t1.prod_line_id END) AS prod_line_id,(CASE WHEN t2.card_qty IS NOT NULL THEN t2.card_qty ELSE t1.card_qty END) AS card_qty,(CASE WHEN t2.card_id IS NOT NULL THEN t2.card_id ELSE t1.card_id END) AS card_id,(CASE WHEN t2.item_no IS NOT NULL THEN t2.item_no ELSE t1.item_no END) AS item_no,(CASE WHEN t2.is_chong IS NOT NULL THEN t2.is_chong ELSE t1.is_chong END) AS is_chong,(CASE WHEN t2.real_code IS NOT NULL THEN t2.real_code ELSE t1.real_code END) AS real_code,current_timestamp AS sys_update_time FROM dwd.mesbh_mes_reck_t_a t1 --取前天的数据
LEFT JOIN ods.mesbh_mes_reck_t_i t2 --取昨天的数据ON t2.ss_dt = '${startDay}' AND t1.BC_ID = t2.BC_ID AND t1.trace_no = t2.trace_no
WHERE t1.ss_dt = '${startDayP1}' --保留一份完整的旧的数据,两表交叉取交叉部分,能重叠代表有数据被更新过UNION ALL --新数据部分:直接合并新进入的数据
SELECT t1.org_id ,t1.trace_no ,t1.bc_id ,t1.creation_date ,t1.created_by ,t1.last_update_date ,t1.last_update_by ,t1.lot_code ,t1.scan_code ,t1.card_no ,t1.card_ser ,t1.prod_line_id ,t1.card_qty ,t1.card_id ,t1.item_no ,t1.is_chong ,t1.real_code ,current_timestamp AS sys_update_time FROM ods.mesbh_mes_reck_t_i t1
LEFT JOIN dwd.mesbh_mes_reck_t_a t2 ON t2.ss_dt = '${startDayP1}' AND t1.BC_ID = t2.BC_ID AND t1.trace_no = t2.trace_no
WHERE t1.ss_dt = '${startDay}' --只保留最新分区的那一份的数据AND t2.org_id IS NULL --去掉新分区数据中,和旧表重叠的数据部分,只保留新数据AND t2.trace_no IS NULL
;