hive小文件问题
以此表为例子
CREATE TABLE `saylo.t_saylo_user_feature`(`user_id` string, `session_id` string, `value` string)
PARTITIONED BY ( `app_id` string, `datetime` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES ('hive.merge.size.per.task'='256000000', 'hive.merge.smallfiles.avgsize'='16000000', 'hive.merge.sparkfiles'='true')
1. SERDE
建议使用Orc,读写性能更好
2. 文件大小
‘hive.merge.size.per.task’=‘256000000’, 最大文件大小,影响最终的文件个数
‘hive.merge.smallfiles.avgsize’=‘16000000’, 触发合并的最小文件大小
3. 文件合并
INSERT OVERWRITE TABLE t_saylo_user_feature PARTITION(app_id=‘30005’,datetime=‘2025-07-09 00:00:00’)
SELECT user_id,session_id,value FROM t_saylo_user_feature
WHERE app_id=‘30005’ AND datetime=‘2025-07-09 00:00:00’;
或者
ALTER TABLE t_saylo_user_feature_test PARTITION(app_id=‘30005’,datetime=‘2025-07-03 20:00:00’) CONCATENATE;