Apache Druid 解析ORC及parquet格式的数据

司寇书
2023-12-01

Apache Druid可以从本地或者HDFS批量摄取数据,现在最新版本(0.18)也支持直接解析ORCparquet格式的数据,但是要使用这个功能还需要进行简单的配置。

官方文档说明

Apache Druid打包了所有的核心扩展(参考本文附件),您可以通过将需要的扩展名添加到common.runtime.properties中的druid.extensions.loadList。例如,要加载postqresql-metadata-storagedruid-hdfs-storage扩展,请使用配置:

druid.extensions.loadList=["postgresql-metadata-storage", "druid-hdfs-storage"]

所以当我们需要Druid 解析ORC及Parquet格式的数据时,就需要这样配置:

druid.extensions.loadList=["druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches","druid-orc-extensions","druid-parquet-extensions"]

配置好后,重启集群即可使用。


附件

Apache Druid内置的核心扩展主要有:

NameDescriptionDocs
druid-avro-extensionsSupport for data in Apache Avro data format.link
druid-azure-extensionsMicrosoft Azure deep storage.link
druid-basic-securitySupport for Basic HTTP authentication and role-based access control.link
druid-bloom-filterSupport for providing Bloom filters in druid queries.link
druid-datasketchesSupport for approximate counts and set operations with Apache DataSketches.link
druid-google-extensionsGoogle Cloud Storage deep storage.link
druid-hdfs-storageHDFS deep storage.link
druid-histogramApproximate histograms and quantiles aggregator. Deprecated, please use the DataSketches quantiles aggregator from the druid-datasketches extension instead.link
druid-kafka-extraction-namespaceApache Kafka-based namespaced lookup. Requires namespace lookup extension.link
druid-kafka-indexing-serviceSupervised exactly-once Apache Kafka ingestion for the indexing service.link
druid-kinesis-indexing-serviceSupervised exactly-once Kinesis ingestion for the indexing service.link
druid-kerberosKerberos authentication for druid processes.link
druid-lookups-cached-globalA module for lookups providing a jvm-global eager caching for lookups. It provides JDBC and URI implementations for fetching lookup data.link
druid-lookups-cached-singlePer lookup caching module to support the use cases where a lookup need to be isolated from the global pool of lookupslink
druid-orc-extensionsSupport for data in Apache ORC data format.link
druid-parquet-extensionsSupport for data in Apache Parquet data format. Requires druid-avro-extensions to be loaded.link
druid-protobuf-extensionsSupport for data in Protobuf data format.link
druid-ranger-securitySupport for access control through Apache Ranger.link
druid-s3-extensionsInterfacing with data in AWS S3, and using S3 as deep storage.link
druid-ec2-extensionsInterfacing with AWS EC2 for autoscaling middle managersUNDOCUMENTED
druid-statsStatistics related module including variance and standard deviation.link
mysql-metadata-storageMySQL metadata store.link
postgresql-metadata-storagePostgreSQL metadata store.link
simple-client-sslcontextSimple SSLContext provider module to be used by Druid's internal HttpClient when talking to other Druid processes over HTTPS.link
druid-pac4jOpenID Connect authentication for druid processes.link

另外还有第三方社区提供的扩展:

NameDescriptionDocs
aliyun-oss-extensionsAliyun OSS deep storagelink
ambari-metrics-emitterAmbari Metrics Emitterlink
druid-cassandra-storageApache Cassandra deep storage.link
druid-cloudfiles-extensionsRackspace Cloudfiles deep storage and firehose.link
druid-distinctcountDistinctCount aggregatorlink
druid-redis-cacheA cache implementation for Druid based on Redis.link
druid-time-min-maxMin/Max aggregator for timestamp.link
sqlserver-metadata-storageMicrosoft SQLServer deep storage.link
graphite-emitterGraphite metrics emitterlink
statsd-emitterStatsD metrics emitterlink
kafka-emitterKafka metrics emitterlink
druid-thrift-extensionsSupport thrift ingestionlink
druid-opentsdb-emitterOpenTSDB metrics emitterlink
materialized-view-selection, materialized-view-maintenanceMaterialized Viewlink
druid-moving-average-querySupport for Moving Average and other Aggregate Window Functions in Druid queries.link
druid-influxdb-emitterInfluxDB metrics emitterlink
druid-momentsketchSupport for approximate quantile queries using the momentsketch librarylink
druid-tdigestsketchSupport for approximate sketch aggregators based on T-Digestlink
gce-extensionsGCE Extensionslink
 类似资料: