Notes of “Quotient Cube: How to Summarize the Semantics of a Data Cube”

符献

2023-12-01

Notes of "Quotient Cube: How to Summarize the Semantics of a Data Cube"

•1. Terminology(相关术语)

Roll up：向上综合

Drill down：向下细化或者向下钻取

w.r.t. ：with respect to 关于

lattice: A lattice is partially ordered set (L, ≼) such that every pair of elements in has a least upper bound (lub) and a greatest lower bound (glb)

•2. Cube lattice and Partition

2.1 Convex Partitions

A convex partition retains semantics:

C₁roll up C₂, C₂roll up C₃, C₁C₃∈CLS, and we get C₂∈CLS

Convexity means "holes" cannot exist in classes.

2.2 Proposition 1 [Count and sum]

The equivalence relation defined solely on the basis of equality of count values is always convex. Suppose the domain of the measure attribute contain only non-negative (or only non-positive) values. Then equivalence defined solely on the basis of equality of sum values is convex.

The proposition follows from the observation that whenever there is a cell c'' in between c and c', i.e. c≼c''≼c', c'' must contain all tuples that c' has, and for COUNT and SUM on positive measure, it cannot form a hole.

We say an equivalence class is connected if its local internal structure is a connected DAG. DAG is short for Directed Acyclic Graphic ( 有向无圈图 ).

Specifically, say that two cells c and c' are cover equivalent, c ≡ Cov c', provided the tuples contained in those cells is the same.

2.3 Lemma 1 [Cover Partition]

Let Pcov be the partition associated with the cover equivalence relation ≡Cov. Then Pcov is necessarily convex.

For a cell c, a tuple t in base table is in c's cover if can be rolled up to c.

All cells having the same cover are in a class.

1. Cover Partitions are convex.
2. Cover partitions are connected.

Cells c1 and c2 have the same cover -> there must be some common ancestor c3 of c1 and c2 st c3 has the same cover.

2.4 Definition 2 [Connected Partitions]

Cells c1 and c2 are connected if a series of rollup/drilldown operation starting from c1 can touch c2.

Intuitively, (each class of ) a partition should be connected.

2.5 Cover Partitions & Aggregates

All cells in a cover partition carry the same aggregate value w.r.t. any aggregate function. But cells in a class of MIN() may have different covers.

For COUNT() and SUM() (Positive), cover equivalence coincides with aggregate equivalence.

3. Partitions Preserving Semantics

3.1 Congruence

We say that ≡ is a congruence provided for every c, c', d, d' ∈ L, whenever we have c≡c', d≡d', and c ≼ d, we have c' ≼ d.

3.2 Weak Congruence

Let (L, ≼) be any cube lattice and ≡ any equivalence relation on its cells. We say that ≡ is a weak congruence provided for every c, c', d, d' ∈ L, whenever we have c≡c', d≡d', c ≼ d, and d' ≼ c, we also have c ≡ d

3.3 Weak Congruence = Convex

Convex ó no "hole" in the class ó weak Congruence

They preserve the rollup/drilldown semantics.

Quotient cube lattice is the lattice of convex classes.

3.4 Monotone Aggregate Functions

1. Monotone functions
a) S⊆T -> f(S) ≥ f(T)
b) S⊆T -> f(S) ≤ f(T)
c) MIN(), MAX(), COUNT(), PSUM(),...
2. The aggregate function f is monotone -> ≡_fis the unique coarsest partition

3.5 Non-monotone Functions

1. Bad news : ≡_fmay or may not be a convex/weak congruence.
2. Good news : Cover partition is convex and always yields a quotient cube w.r.t. any aggregate function

4. Algorithms

4.1 Depth-First Search Algorithm

Input: base table B, monotone aggregate function f;

Output: Quotient cube

Method:

Step 1: let b = (all, ..., all); call DFS(b, B, 0);

Step 2: merge those temp classes sharing some comm. Upper bounds; if C₁ and C₂ sharing a same upper bound c, then merge them;

//e.g. if MIN((a, b)) = MIN((a, all)) = MIN((all, b)); the temp classes of the two cells (a, all) and (all, b) should be merged, since they share the upper bound(a, b)

Step 3: output the classes, and their bounds, but only output true lower bounds, by removing lower bounds that have descendants in the merged class;

//e.g. when we process DFS on (all, b, all), it may in turn call a DFS on (all, b, c) and then form a temp class C1 = {(all, b, c), (d, b, c)}. Later, when the search branches to (all, all, c), it may form another temp class {(all, all, c),(d, b, c)}. The two classes share a common upper bound and so are merged. In the merged class, (all, b, c) is not a lower bound anymore and hence should be removed.

Function DFS(c, B_c, k)

//c is a cell and B_c is the corresponding partition of the base table;

Step 1: Compute aggregate of cell c by one scan of B_c, in the same scan, collect dimension-value statistics info for CountSort;

//similarly to the BUC algorithm;

Step 2: compute the set of upper bounds UB(c) of the class of c, by "jumping" to the appropriate upper bounds depending on the aggregate function used;

//see text for details

Step 3: Record a temp class with lower bound c and upper bound(s) in UB(c);

Step 4: for each upper bound in UB(c), do

[4.1] if there is some j<k s.t. c[j]=all and d[j]!=all then such a bound has been examined before;do nothing;

//e.g. suppose when searching (all, all, c, all), we find that (a, all, c, d) is an upper bound. Then it must have been explored in the (a, all, all, all) branch.

[4.2] else for each k<j≤n, s.t. d[j] = all do

For each value x in dimension j of base table

Let d[j] = x ; form B_d;

If B_d is not empty, call DFS(d, B_d, j);

Step 5: Return

4.2 Depth-First Search Algorithm（翻译）

输入：基本表 B，单一的聚合函数f

输出：Quotient Cube；

方法：

1. 设b=(all,all,...,all)，调用函数DFS(b,B,0);
2. 合并具有相同上确界的临时类；如果C₁和C₂具有相同的上确界，合并它们；

//如果MIN((a, b))=MIN( (a, all))=MIN((all, b))，临时类的两个方格(a, all) 和 (all, b)应该合并，因为它们具有相同的上确界(a, b)

3. 输出所有类，以及他们的上确界，但是只输出真的下确界，通过移除在合并类中含有儿子的下确界。

//e.g.当我们在(all, b, all)调用DFS时，它可能在(all, b, c)上轮流调用DFS，并生成一个临时类C1={ (all, b, c), (d, b, c)}。然后，当它查找到分支(all, all, c)，他可能生成成另外一个临时类{ (all, all ,c), (d, b, c)}。这两个临时类具有相同的上确界，因此合并他们。在这两个被合并的类中，(all, b, c)不在是一个下确界了，因此必须移除它。

函数 DFS(c, Bc, k)

//c 是一个数据方格，B_c是相对应的基本表的一部分

1. 通过检索基本表B_c计算方格c 的聚合函数值，在相同的一次检索中，统计维度值的个数。
2. 计算c的类的上确界集合UB(c)，通过跳跃到正确的上确界依赖于所使用的聚合函数。
3. 利用一个临时类来保存下确界c 和上确界在UB (c);
4. 对于每一个在UB(c)中的上确界d，进行一下操作
a) 如果存在j <k, s.t. c[j] = all 且 d[j] != all，则这个上确界已经被检查过，不做任何操作。

//e.g. 当我们要查找(all ,all ,c ,all)的时候，我们发现(a, all ,c ,d ) 是一个上确界。于是它一定已在(a, all, all, all)分支查找过了。

b) 如果对于每一个 k <j≤n, s.t. c[j] = all，进行一下操作

对于每一个在基本表的维度j中的x，让d[j]=x; 生成B_d；

如果B_b是非空的，调用DFS(d, B_d, j);

5. 返回

4.3 Lemma 3 [Correctness of Class-Merge]: Class-Merge Algorithm

Input: Base table B, non-monotone aggregate function f;

Output: Quotient cube w.r.t a maximal convex partition;

Method:

1. Obtain quotient cube Q of B w.r.t cover partition using Algorithm DFS.
2. Group classes at each level of Q by aggregate value, using hashing.
3. Process lattice Q by level, bottom up;
4. For each unprocessed class P of C at the current level {

For each parent class P of C at the next higher level with the same measure value as C {

l If (( descendent(p) ∩ ancestor(C) ⊆ C))
l Add P as well as all children of P in C's measure group to C, in this case, mark all the latter children "processed", replace c, p, and the above children by the new merged class;}}