数据分析师常用命令
SQL高级技巧实战篇
横向展开:行转列(EXPLODE函数)
核心语法解析
lateral view explode(split(表达式)) tableName as columnName
- split(字段, 分隔符):将字符串按分隔符切割成数组
- explode():将数组的每个元素展开为单独的行(行转列关键函数)
- lateral view:将展开结果与主表进行笛卡尔积关联
执行逻辑流程
原始表 → split字符串 → 数组 → explode → 新临时表 → lateral view关联
实战案例:用户兴趣标签分析
原始表 user_tags
:
user_id tags
U001 sports,music
U002 movie,book
SQL操作:
SELECT user_id,single_tag
FROM user_tags
LATERAL VIEW EXPLODE(SPLIT(tags, ',')) tmp AS single_tag;
执行结果:
user_id single_tag
U001 sports
U001 music
U002 movie
U002 book
应用场景
将CSV格式的标签数据拆解为标准化标签体系
纵向聚合:列转行(COLLECT_SET函数)
核心语法解析
SELECT uid,sort_array(collect_set(phone)) as phone_list
FROM test_table1
GROUP BY uid;
- collect_set(字段):聚合去重值并生成数组(与
collect_list
保留重复值区分) - sort_array():对结果数组进行字典序排序
- 执行特点:必须在GROUP BY后使用,返回数组类型数据(可用
[0]
索引访问元素)
实战案例:用户多电话号合并
原始表 user_phones
:
uid phone
U001 1380000111
U001 1390000222
U002 1350000333
U001 1380000111 (重复数据)
SQL操作:
SELECT uid,sort_array(collect_set(phone)) as distinct_phones
FROM user_phones
GROUP BY uid;
执行结果:
uid distinct_phones
U001 ["1380000111","1390000222"] (自动去重&排序)
U002 ["1350000333"]
生产技巧
- 取聚合结果首号:
distinct_phones[0] as primary_phone
- 统计号码数量:
size(distinct_phones) as phone_count
行列互转综合应用:订单数据处理
数据样本 orders
order_id user_id items
O1001 U001 iPhone14,AirPods,Charger
O1002 U002 iPad,ApplePencil
需求:统计用户购买的商品种类数
步骤1:行转列拆分商品
WITH exploded_data AS (SELECT order_id, user_id,single_itemFROM ordersLATERAL VIEW EXPLODE(SPLIT(items, ',')) tmp AS single_item
)
步骤2:列转行聚合统计
SELECT user_id,collect_set(single_item) as bought_items,size(collect_set(single_item)) as item_type_count
FROM exploded_data
GROUP BY user_id;
最终结果
user_id bought_items item_type_count
U001 ["AirPods","Charger","iPhone14"] 3
U002 ["ApplePencil","iPad"] 2
避坑指南
-
空值处理
使用LATERAL VIEW OUTER EXPLODE(...)
保留空记录 -
多列爆炸冲突
为每个explode
创建独立视图:LATERAL VIEW EXPLODE(arr1) v1 AS e1 LATERAL VIEW EXPLODE(arr2) v2 AS e2
-
数据倾斜优化
对大数组预过滤:WHERE size(arr) < 100
适用场景
用户画像分析、订单行为统计等需要行列转换的数据处理场景
行列互转实现数据重构(PIVOT与UNPIVOT)
-- 销售数据行转列示例
SELECT *
FROM (SELECT product_id, region, sales_amountFROM sales_data
) PIVOT (SUM(sales_amount) FOR region IN ('North' AS North, 'South' AS South, 'East' AS East, 'West' AS West)
);
-- 列转行(UNPIVOT)实现交叉表转置
SELECT product_id, region, sales
FROM (SELECT product_id, North, South, East, WestFROM pivoted_sales
)
UNPIVOT (sales FOR region IN (North, South, East, West)
);
ROW_NUMBER()高级应用场景
-- 为每个地区销售前3名的产品排序
SELECT region,product_id,sales_amount,row_num
FROM (SELECTregion,product_id,sales_amount,ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales_amount DESC) AS row_numFROM sales_data
)
WHERE row_num <= 3;
Sublime Text编写SQL高效技巧
智能插件配置:
- SQLTools插件:自动补全表名和字段
- DBEaver集成:直接在Sublime中执行SQL
- Alignment插件:自动格式化SQL缩进
快捷键大全:
- Ctrl+/:行注释/取消注释
- Ctrl+Shift+↑/↓:交换行位置
- Ctrl+D:多重选择相同关键词
代码片段(Snippet)示例:
{"scope": "source.sql","completion": "SELECT","trigger": "ssel","contents": "SELECT ${1:column}\nFROM ${2:table}\nWHERE ${3:condition};"
}
Shell脚本自动化处理篇
批量生成SQL语句
# 根据产品列表生成查询语句
products=("P100" "P205" "P309")
for product in "${products[@]}"
doecho "SELECT * FROM inventory WHERE product_id='$product';" >> batch_query.sql
done
crontab定时任务实例
# 每天凌晨1点执行数据同步
00 1 * * * /usr/bin/bash /scripts/sync_sales_data.sh# 每周一上午5点生成周报
0 5 * * 1 /usr/bin/sqlplus user/pass@db @/scripts/weekly_report.sql
文本处理三剑客实战
grep高级搜索:
# 查找包含error的关键行并显示前后5行
grep -A 5 -B 5 'ERROR' system.log# 匹配2023年的CSV文件
grep -rl '^2023' data/ --include='*.csv'
sed流式编辑实战:
# 批量替换日期格式(YYYY-MM-DD -> MM/DD/YYYY)
sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\2\/\3\/\1/g' dates.csv# 删除SQL文件中的注释行
sed -i '/^--/d' query.sql
awk数据处理专家:
# 统计各区域销售总额
awk -F ',' '{sales[$2] += $3} END {for(reg in sales) print reg, sales[reg]}' sales.csv# 格式化输出产品信息
awk 'BEGIN {FS=","; printf "%-10s %-15s %-10s\n", "ID", "Name", "Price"}{printf "%-10s %-15s $%-10.2f\n", $1, $2, $3}' products.csv
Shell脚本综合应用
#!/bin/bash
# 文件名:monthly_report.shREPORT_DATE=$(date +%Y-%m-%d)
SQL_TEMPLATE="/templates/sales_report.sql"
OUTPUT_DIR="/reports"# 动态生成SQL
sed "s/REPORT_DATE/$REPORT_DATE/" $SQL_TEMPLATE > current_report.sql# 执行SQL并导出结果
mysql -u report_user -p$DB_PASS < current_report.sql | grep -v '^ID' > $OUTPUT_DIR/sales_${REPORT_DATE}.csv# 发送邮件通知
echo "月度销售报告已生成,请查收附件" | mailx -a $OUTPUT_DIR/sales_${REPORT_DATE}.csv -s "销售报告($REPORT_DATE)" team@company.com
高效数据分析工作流
效率对比统计表:
操作方式 | 手工处理 | 自动化脚本 | 效率提升 |
---|---|---|---|
数据提取 | 30分钟 | 0分钟 | ∞ |
报告生成 | 2小时 | 5分钟 | 24倍 |
数据清洗 | 45分钟 | 10秒 | 270倍 |
工作流示意图: