Hive - Count && Sum 使用与性能对比

吕衡

2023-12-01

一.引言

使用 hive 计数时常使用 Count 和 Sum 两个函数进行统计，下面看看二者的使用方法。

count 方法可以统计有效行数

select count(*) from table

select count(col) from table

上一篇文章使用 Hive 统计 UV,PV 中介绍了相关的使用：

A.distinct

对 uid 去重得到 uv

select count(distinct click.uid) as send_uv from table

B.case

统计点击 click=1 的样本

select count(case when label='1' then click.uid else NULL end) from table

C.distinct + case

通过 count 不统计 null 和 distinct + case 实现 uv 的统计

select count(distinct case when label='1' then click.uid else NULL end) as click_uv from table

sum 求和函数可以将某列的值进行累加，同样对 null 值进行忽略

sum(1) 时和 count(1) 或者 count(*) 方法效果相同

select sum(1) from table

select sum(cost) from table

select sum(case when cost == 'none' then 0 else 100 end) from table

统计总数时经常使用 count(1)，count(*)，sum(1) ，下面看看他们的效率如何：

整体效率看是 sum(1) 略快于 count(1)，count(*) 最慢。