问题：

如何过滤数据块自动加载程序流中的文件

曾阳飙

2023-03-14

我想使用数据砖自动加载器设置S3流。我已经设法设置了流，但我的S3存储桶包含不同类型的JSON文件。我想过滤掉它们，最好是在流本身中，而不是使用过滤操作。

根据文档，我应该能够使用全局模式进行过滤。但是，我似乎无法让它工作，因为它无论如何都会加载所有内容。

这是我有的

df = (
  spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.inferColumnTypes", "true")
  .option("cloudFiles.schemaInference.samleSize.numFiles", 1000)
  .option("cloudFiles.schemaLocation", "dbfs:/auto-loader/schemas/")
  .option("includeExistingFiles", "true")
  .option("multiLine", "true")
  .option("inferSchema", "true")
#   .option("cloudFiles.schemaHints", schemaHints)
#  .load("s3://<BUCKET>/qualifier/**/*_INPUT")
  .load("s3://<BUCKET>/qualifier")
  .withColumn("filePath", F.input_file_name())
  .withColumn("date_ingested", F.current_timestamp())
)

我的文件有一个结构为限定符/version/YYYY-MM/DD/的键

这似乎加载了所有内容：. load（"s3：//

是我的glob模式不正确，还是我遗漏了什么？

共有1个答案

拓拔泉

2023-03-14

从留档来看，你似乎可以同时使用加载和path Globfilter选项来实现你所需要的。你在这里尝试过这样的方法吗？

df = (
  spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.inferColumnTypes", "true")
  .option("cloudFiles.schemaInference.samleSize.numFiles", 1000)
  .option("cloudFiles.schemaLocation", "dbfs:/auto-loader/schemas/")
  .option("includeExistingFiles", "true")
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .option("pathGlobfilter", "*_INPUT.json") 
  .load("s3://<BUCKET>/qualifier/")
  .withColumn("filePath", F.input_file_name())
  .withColumn("date_ingested", F.current_timestamp())
)

类似资料：

如何下载Primefaces的过滤数据表中的文件？

我有一个可供您下载文件的日期。它有效。但是，当我尝试下载过滤后返回的文件时，它是列表中的第一个。这是我的文件，其中包含可数据： <代码> 这是支持Bean负责下载的方法： <代码> 这是backingBean负责下载的helper方法:< code> <代码>
$autoload_filters 自动加载过滤器变量

If there are some filters that you wish to load on every template invocation, you can specify them using this variable and Smarty will automatically load them for you. The variable is an associative a
如何在Node.js中自动重新加载文件？

问题内容：关于如何在Node.js中实现文件自动重装的任何想法？每次更改文件时，我都无法重新启动服务器。显然，Node.js的功能不会重新加载文件（如果已经需要），因此我需要执行以下操作：在 app.js 文件中，我有：但这也不起作用-我在声明“ require”未定义的语句中遇到错误。正在逃避 app.js ，但不知道node.js全局变量。问题答案：一个很好的，最新的替代方法是：
自动过滤

autoFilter(string $range): self string $range $config = ['path' => './tests']; $excel = new \Vtiful\Kernel\Excel($config); $filePath = $excel->fileName("tutorial.xlsx") ->header(['name', 'age'
如何将avro文件从blob存储加载到Azure数据工厂移动数据流？

如何将 avro 文件从 Blob 存储加载到 Azure 数据工厂移动数据流？我正在尝试加载，但无法导入架构和预览。我在 Blob 中的 avro 文件是事件中心捕获函数的结果。我必须使用 Azure 数据工厂的移动数据流将数据从 Azure blob 移动到 Azure sql db。
wildfly上的Db2驱动程序/数据源设置：未能加载驱动程序[com.ibm]的模块

我想配置数据源为db2在我的野蝇服务器（Wildfly.8.0.0-最终和8.1.0以及。），并运行到一些问题，这样做。我的研究告诉我这是一个两步的过程将驱动程序作为模块安装在%JBOSS_HOME%/modules/com/ibm/main目录中将数据源子系统配置为在连接设置中包含此模块作为驱动程序到目前为止，我已经在以下结构下安装了模块，module.xml如下：

如何过滤数据块自动加载程序流中的文件

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档