使用saiku3.9的过程中遇到了不少坑
saiku自带了两个例子,在查询schema的时候,指标和维度无法展示。通过查看日志文件/home/saiku/software/saiku-server/tomcat/logs/catalina.out发现抛异常如下
java.lang.NoClassDefFoundError: mondrian/olap/LevelType
saiku不支持Impala数据源,如果使用Impala作为数据源,需要我们自己上传Impala相关的驱动,但是在将Impala相关的驱动jar上传到/home/saiku/software/saiku-server/tomcat/webapps/saiku/WEB-INF/lib之后,异常如下
Caused by: java.lang.StackOverflowError
at org.apache.catalina.startup.ContextConfig.populateSCIsForCacheEntry(ContextConfig.java:2269)
at org.apache.catalina.startup.ContextConfig.populateSCIsForCacheEntry(ContextConfig.java:2269)
at org.apache.catalina.startup.ContextConfig.populateSCIsForCacheEntry(ContextConfig.java:2269)
tomcat.util.scan.StandardJarScanFilter.jarsToSkip=\
annotations-api.jar,\
ant-junit*.jar,\
ant-launcher.jar,\
ant.jar,\
...
xmlParserAPIs.jar,\
xom-*.jar,\
bcprov*.jar
使用hive作为数据源时,查询时抛出异常如下
ERROR [SecurityAwareConnectionManager] Error connecting: hive_ds
mondrian.olap.MondrianException: Mondrian Error:Internal error: while quoting identifier
Caused by: java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveDatabaseMetaData.getIdentifierQuoteString
(HiveDatabaseMetaData.java:342)
ERROR [SecurityAwareConnectionManager] Error connecting: hive_ds
mondrian.olap.MondrianException: Mondrian Error:Internal error: while detecting isReadOnly
Caused by: java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveDatabaseMetaData.isReadOnly
(HiveDatabaseMetaData.java:762)
public boolean isReadOnly() throws SQLException {
return false;
}
public String getIdentifierQuoteString() throws SQLException {
return " ";
}
saiku使用mondrian作为olap计算引擎,使用之前需要学习如何编写schema,schema用mdx语言编写,saiku3.90使用mondiran4,官方文档:https://mondrian.pentaho.com/head/documentation/schema.php
主要是理解cube、维度、层次、指标、事实几个概念。下面使用排队业务建了一个模型,统计每个维度下面有多少条排队数据
<Schema name="queue" metamodelVersion="4.0">
<PhysicalSchema>
<Table name="queue_detail">
<Key>
<Column name='serialid'/>
</Key>
</Table>
<Table name="dim_date">
<Key>
<Column name='day_id'/>
</Key>
</Table>
<Table name="dim_shop_list">
<Key>
<Column name='shop_id'/>
</Key>
</Table>
<Table name="queue_state">
<Key>
<Column name="state_id"/>
</Key>
</Table>
</PhysicalSchema>
<Cube name="queue">
<Dimensions>
<Dimension name="state" table="queue_state" key="state_id">
<Attributes>
<Attribute name="state_id" keyColumn="state_id"/>
</Attributes>
</Dimension>
<Dimension name="shop" table="dim_shop_list" key="shop_id">
<Attributes>
<Attribute name="shop_id" keyColumn="shop_id"/>
</Attributes>
</Dimension>
<Dimension name="time" table="dim_date" key="day">
<Attributes>
<Attribute name="year" keyColumn="year"/>
<Attribute name="quarter">
<Key>
<Column name="year"/>
<Column name="quarter"/>
</Key>
<name>
<Column name="quarter"/>
</name>
</Attribute>
<Attribute name="month">
<Key>
<Column name="year"/>
<Column name="month"/>
</Key>
<name>
<Column name="month"/>
</name>
</Attribute>
<Attribute name="day" keyColumn="day_id"/>
</Attributes>
<Hierarchies>
<Hierarchy name="yearly" hasAll="false">
<Level attribute="year"/>
<Level attribute="quarter"/>
<Level attribute="month"/>
</Hierarchy>
</Hierarchies>
</Dimension>
</Dimensions>
<MeasureGroups>
<MeasureGroup name="queue" table="queue_detail">
<Measures>
<Measure name="queue" column="serialid" aggregator="count" formatString="#,###"/>
</Measures>
<DimensionLinks>
<ForeignKeyLink dimension="time" foreignKeyColumn="date_key"/>
<ForeignKeyLink dimension="shop" foreignKeyColumn="shopid"/>
<ForeignKeyLink dimension="state" foreignKeyColumn="state_id"/>
</DimensionLinks>
</MeasureGroup>
</MeasureGroups>
</Cube>
</Schema>
首先定义四个物理表,排队事实表queue_detail,时间维度表dim_date,门店维度表dim_shop_list,排队状态维度表queue_state
cube下面包含维度和指标,维度下面又包含层次
通过Dimension指定维度,通过Hierarchy指定层次,通过Measures指定指标,指标又通过DimensionLinks与维度关联
上传schema
创建数据源
查看指标
以上都好了之后,就可以通过多个维度对指标进行查询
实际查询中发现,当同时使用三个维度进行指标计算时,查询时间要达到十几秒,远远超出我们直接执行sql花费的时间。超出的时间到底花费在哪里呢?接下来通过源码一探究竟。
@POST
@Consumes({"application/json" })
@Path("/execute")
public QueryResult execute(ThinQuery tq) {
try {
if (thinQueryService.isMdxDrillthrough(tq)) {
Long start = (new Date()).getTime();
ResultSet rs = thinQueryService.drillthrough(tq);
QueryResult rsc = RestUtil.convert(rs);
rsc.setQuery(tq);
Long runtime = (new Date()).getTime()- start;
rsc.setRuntime(runtime.intValue());
return rsc;
}
QueryResult qr = RestUtil.convert(thinQueryService.execute(tq));
ThinQuery tqAfter = thinQueryService.getContext(tq.getName()).getOlapQuery();
qr.setQuery(tqAfter);
return qr;
}
catch (Exception e) {
log.error("Cannot execute query (" + tq + ")",e);
String error = ExceptionUtils.getRootCauseMessage(e);
return new QueryResult(error);
}
}
try {
String mdx = query.getParameterResolvedMdx();
log.info(runId + "\tType:" + query.getType() + ":\n" + mdx);
CellSet cs = stmt.executeOlapQuery(mdx);
queryContext.store(ObjectKey.RESULT, cs);
if (query != null) {
queryContext.store(ObjectKey.QUERY, query);
}
return cs;
} finally {
stmt.close();
queryContext.remove(ObjectKey.STATEMENT);
}
其中花费大部分时间的操作是stmt.executeOlapQuery(mdx),这个是执行mdx查询,这个查询的计算是mondrian做的,也就是说saiku的核心计算引擎是依赖的mondrian,saiku本质上是一个前端的展示工具。
void execute() throws OlapException {
this.result = this.olap4jStatement.olap4jConnection.getMondrianConnection().execute(this);
Axis[] axes = this.result.getAxes();
QueryAxis[] queryAxes = this.result.getQuery().getAxes();
assert axes.length == queryAxes.length;
Axis axis;
for(int i = 0; i < axes.length; ++i) {
axis = axes[i];
QueryAxis queryAxis = queryAxes[i];
this.axisList.add(new MondrianOlap4jCellSetAxis(this, queryAxis, (RolapAxis)axis));
}
QueryAxis queryAxis = this.result.getQuery().getSlicerAxis();
axis = this.result.getSlicerAxis();
if(queryAxis == null) {
queryAxis = new QueryAxis(false, (Exp)null, StandardAxisOrdinal.SLICER, SubtotalVisibility.Undefined);
}
this.filterAxis = new MondrianOlap4jCellSetAxis(this, queryAxis, (RolapAxis)axis);
}
其中
this.result = this.olap4jStatement.olap4jConnection.getMondrianConnection().execute(this);
这里进行sql查询并获取返回结果,这一部分的耗时与直接使用Impala的耗时大致相同
然后获取到结果之后进行的各种矩阵计算,递归计算等消耗了不少时间,代码看的有点费劲
以上参考的资料有