当前位置: 首页 > 面试题库 >

连续重复/重复的有序计数

赫连华皓
2023-03-14
问题内容

我非常怀疑我是否以最有效的方式执行此操作,这就是为什么我plpgsql在此处进行了标记。对于 一千个测量系统,* 我需要在 20亿行
上运行。
*

您拥有一些测量系统,当它们失去连接时,它们通常会报告以前的值,并且有时会(但有时很长)会突然断开连接。您需要汇总,但是这样做时,您需要查看其重复的时间,并根据该信息进行各种筛选。假设您正在测量汽车的mpg,但它在20
mpg的情况下停留一个小时,然后移动到20.1,依此类推。您将需要评估卡住时的准确性。您还可以放置一些替代规则,以查找汽车何时在高速公路上,并使用窗口功能来生成汽车的“状态”并进行分组。无需再费周折:

--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
    (
    select 
    system_measured, time_of_measurement, measurement, 
    case when 
     measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc) 
     then 1 else 0 end as repeat
    FROM
    (
    SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
    UNION
    SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
    UNION
    SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
    ) as data
) as data;

--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
    (
    select 
    *,
    sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
    from cumulative_repeat_calculator_data
    order by system_measured, time_of_measurement
) as data;

--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example  
select *, 
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement

那么,为了在一个巨大的表上运行它,您将采取什么不同的操作,或者您将使用哪些替代工具?我正在考虑使用plpgsql,因为我怀疑这需要在数据库中完成,也可以在数据插入过程中完成,尽管我通常在加载数据后使用它。有什么方法可以一目了然地实现,而无需诉诸子查询?

我已经测试了一种 替代方法
,但是它仍然依赖于子查询,并且我认为这更快。对于该方法,您将使用start_timestamp,end_timestamp,system创建一个“开始和停止”表。然后,您加入更大的表,如果时间戳介于两者之间,则将其分类为处于该状态,这实际上是的替代方案cumlative_sum_of_nonrepeats_by_system。但是,当您执行此操作时,成千上万个设备和成千上万个“事件”将以1
= 1的比例加入。您认为这是更好的方法吗?


问题答案:

测试用例

首先,一种更有用的方式来呈现您的数据-甚至更好的方式是,在
sqlfiddle中
使用:

CREATE TEMP TABLE data(
   system_measured int
 , time_of_measurement int
 , measurement int
);

INSERT INTO data VALUES
 (1, 1, 5)
,(1, 2, 150)
,(1, 3, 5)
,(1, 4, 5)
,(2, 1, 5)
,(2, 2, 5)
,(2, 3, 5)
,(2, 4, 5)
,(2, 5, 150)
,(2, 6, 5)
,(2, 7, 5)
,(2, 8, 5);

简化查询

由于尚不清楚,因此我仅假设以上给出的内容。
接下来,我简化了查询,以得出:

WITH x AS (
   SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
                               ORDER BY time_of_measurement) = measurement
                  THEN 0 ELSE 1 END AS step
   FROM   data
   )
   , y AS (
   SELECT *, sum(step) OVER(PARTITION BY system_measured
                            ORDER BY time_of_measurement) AS grp
   FROM   x
   )
SELECT * ,row_number() OVER (PARTITION BY system_measured, grp
                             ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM   y
ORDER  BY system_measured, time_of_measurement;

现在,虽然它是所有好的和有光泽使用纯SQL,这将是 很大 一个PLPGSQL功能更快,因为它可以在一个表扫描,其中该查询至少需要三次扫描做到这一点。

使用plpgsql函数更快:

CREATE OR REPLACE FUNCTION x.f_repeat_ct()
  RETURNS TABLE (
    system_measured int
  , time_of_measurement int
  , measurement int, repeat_ct int
  )  LANGUAGE plpgsql AS
$func$
DECLARE
   r    data;     -- table name serves as record type
   r0   data;
BEGIN

-- SET LOCAL work_mem = '1000 MB';  -- uncomment an adapt if needed, see below!

repeat_ct := 0;   -- init

FOR r IN
   SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement
LOOP
   IF  r.system_measured = r0.system_measured
       AND r.measurement = r0.measurement THEN
      repeat_ct := repeat_ct + 1;   -- start new array
   ELSE
      repeat_ct := 0;               -- start new count
   END IF;

   RETURN QUERY SELECT r.*, repeat_ct;

   r0 := r;                         -- remember last row
END LOOP;

END
$func$;

称呼:

SELECT * FROM x.f_repeat_ct();

确保始终使用这种plpgsql函数对表名进行表限定,因为我们使用与输出参数相同的名称,如果不限定则使用优先级。

数十亿行

如果您 有数十亿
,则可能需要拆分此操作。我在这里引用手册:

注意:如上所述,当前的实现RETURN NEXTRETURN QUERY 存储整个结果集,然后从函数返回。这意味着,如果PL /
pgSQL函数产生非常大的结果集,则性能可能会很差:将数据写入磁盘以避免内存耗尽,但是在生成整个结果集之前,函数本身不会返回。PL /
pgSQL的未来版本可能允许用户定义不具有此限制的集合返回函数。当前,开始将数据写入磁盘的位置由work_mem 配置变量控制。有足够内存在内存中存储更大结果集的管理员应考虑增加此参数。

考虑一次为一个系统计算行,或设置足够高的值work_mem来应付负载。请遵循报价中提供的链接,以获取更多有关work_mem的信息。

一种方法是在函数中为work_memwith设置一个很高的值SET LOCAL,这仅对当前事务有效。我在函数中添加了注释行。不要 设置非常高的全球范围内,因为这可能核弹攻击你的服务器。阅读手册。



 类似资料:
  • 我知道这是一个复杂的查询,因此我有一个单独的列,但该列包含几乎99%的null值(这是非常无效的),因此我想知道是否可以通过查询来实现这一点。 我有这样的桌子: 现在的查询是,如果对于3个连续的产品ID,对应的值是0(按照TransactionId ASC的顺序),那么对于,它将被计为1,即。 我们如何查询这个? 一个可选的简短问题:p 如果我有很多空值的列,可以吗?

  • 我是ES6新手,所以我学习Javascript语句。 在测试async/await时,我发现了一些奇怪的行为。 我这样写代码, 输出在这里, 我定义async来运行,并等待每一行同步工作。 预期产量在这里, 谢谢。

  • 问题内容: 我想删除彼此跟随的重复项,但不要删除整个数组中的重复项。另外,我想保持顺序不变。 所以如果输入是输出 我找到了一种使用方法,但我正在寻找更快的NumPy解决方案。 问题答案: 一般的想法是用来查找数组中两个连续元素之间的差异。然后我们只索引那些给出差异元素的索引。但由于长度 是1那么短的索引之前,我们需要对涉及diff阵列的开始。 说明:

  • 问题内容: 我正在致力于Twitter数据标准化。Twitter用户经常使用诸如此类的术语来强调“爱”一词。我希望通过替换重复的字符,直到获得合适的有意义的词,将这样的重复字符变成一个适当的英语单词(我知道通过这种机制我无法区分善与恶)。 我的策略是 识别此类重复字符串的存在。我会寻找两个以上相同的字符,因为可能没有一个英文单词包含两个以上的重复字符。 在诸如Wordnet之类的词典中搜索此类单词

  • 问题内容: 在numpy数组中找到最大连续重复nan的最佳方法是什么? 例子: 输入1: 输出1: 输入2: 输出2: 问题答案: 这是一种方法- 这是一个改进的版本- 基准测试-

  • 表名称包含一列 目标是计算重复次数以得到以下结果 如何做到这一点?