redshift：通过窗口分区计算不重复的客户

纪翰

2023-03-14

问题内容：

RedshiftDISTINCT在其窗口函数中不支持聚合。
AWS文档的COUNT状态为this
，distinct任何窗口功能均不支持。

我的用例：在不同的时间间隔和流量渠道上统计客户

我希望获得当年的月度和年初至今唯一
客户数，并希望按流量渠道以及所有渠道的总数进行划分。由于一个客户可以拜访不止一次，因此我只需要计算不同的客户，因此Redshift窗口汇总将无济于事。

我可以使用来计算不同的客户count(distinct customer_id)...group by，但这只会给我四个所需结果的一个。
我并不想进入运行了一堆之间堆积每个需要计数一个完整的查询习惯union all。我希望这不是唯一的解决方案。

这就是我在postgres（或Oracle）中写的内容：

select order_month
       , traffic_channel
       , count(distinct customer_id) over(partition by order_month, traffic_channel) as customers_by_channel_and_month
       , count(distinct customer_id) over(partition by traffic_channel) as ytd_customers_by_channel
       , count(distinct customer_id) over(partition by order_month) as monthly_customers_all_channels
       , count(distinct customer_id) over() as ytd_total_customers

from orders_traffic_channels
/* otc is a table of dated transactions of customers, channels, and month of order */

where to_char(order_month, 'YYYY') = '2017'

如何在Redshift中解决此问题？

结果需要在redshift集群上工作，此外，这是一个简化的问题，实际的期望结果具有产品类别和客户类型，这乘以所需分区的数量。因此，堆栈union all汇总不是一个很好的解决方案。

问题答案：

2016年的博客文章指出了这个问题，并提供了一个基本的解决方法，因此谢谢Mark D.
Adams。奇怪的是，我在所有的网络上都找不到，因此我正在共享我的（经过测试的）解决方案。

关键的见解是dense_rank()，按相关商品排序，可以为相同商品提供相同的排名，因此，最高排名也是唯一商品的计数。如果您尝试为我想要的每个分区交换以下内容，那就太糟了：

dense_rank() over(partition by order_month, traffic_channel order by customer_id)

由于您需要最高的排名，因此您必须对所有内容进行子查询，然后从每个获得的排名中选择最大值。 重要的是将外部查询中的分区与子查询中的相应分区进行匹配。

/* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
select distinct
       order_month
       , traffic_channel
       , max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
       , max(tc_rnk) over(partition by traffic_channel)  ytd_customers_by_channel
       , max(mth_rnk) over(partition by order_month)  monthly_customers_all_channels
       , max(cust_rnk) over()  ytd_total_customers

from (
       select order_month
              , traffic_channel
              , dense_rank() over(partition by order_month, traffic_channel order by customer_id)  tc_mth_rnk
              , dense_rank() over(partition by traffic_channel order by customer_id)  tc_rnk
              , dense_rank() over(partition by order_month order by customer_id)  mth_rnk
              , dense_rank() over(order by customer_id)  cust_rnk

       from orders_traffic_channels

       where to_char(order_month, 'YYYY') = '2017'
     )

order by order_month, traffic_channel
;

笔记

max()且dense_rank()必须匹配的分区
dense_rank()将对null值进行排名（所有排名都在同一排名，即最大值）。如果您不希望对null值进行计数，则需要一个case when customer_id is not null then dense_rank() ...etc...，或者，max()如果您知道存在空值，则可以从中减去一个。

redshift：通过窗口分区计算不重复的客户

我的用例：在不同的时间间隔和流量渠道上统计客户

笔记

相关阅读

相关文章

相关问答

相关工具

相关文档