问题：

如果我从分区的配置单元表创建DataFrame，将创建多少个分区？

彭正谊

2023-03-14

我有一个分区的Hive表。如果我想从这个表中创建一个spark数据帧，那么将创建多少个数据帧分区？

共有1个答案

苗盛

2023-03-14

它不依赖于 Hive 表分区，而是取决于您使用的 spark 版本：

为了火花

***using rdd and then creating datframe*** 
If you are creating an RDD , you can explicitly give no of partitions:
val rdd = sc.textFile("filepath" , 4)
as in above example it is 4 .


***directly creating datframe*** 
It depends on the Hadoop configuration (min / max split size)

You can use Hadoop configuration options:
mapred.min.split.size.
mapred.max.split.size

as well as HDFS block size to control partition size for filesystem based formats*.
val minSplit: Int = ???
val maxSplit: Int = ???
sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)

为了火花

***using rdd and then creating datframe*** :
same as mentioned in spark <2.0


***directly creating datframe***

You can use spark.sql.files.maxPartitionBytes configuration:
spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)

Also keep in mind:

Datasets created from RDD inherit number of partitions from its parent.

类似资料：

Spark分区：创建RDD分区，但不创建配置单元分区

这是将Spark dataframe保存为Hive中的动态分区表的后续操作。我试图在答案中使用建议，但无法在Spark 1.6.1中使用任何推动这一进程的帮助都是感激的。编辑：还创建了SPARK-14927
创建配置单元分区表HDFS位置帮助

当然，希望有人能帮助我创建外部配置单元分区表，方法是根据HDFS目录中的逗号分隔文件自动添加数据。我的理解（或缺乏理解）是，当您定义一个已分区的CREATE外部表并为其提供一个位置时，它应该递归地扫描/读取每个子目录，并将数据加载到新创建的已分区的外部表中。下面的内容应该会对我的烦恼提供一些更多的了解… 每个'dt='子目录都包含分隔的文件。
在配置单元中的外部表中创建分区

1-创建了源表 2-将数据从本地加载到源表 3-创建了另一个带有分区的表-partition_table 我不确定如何在外部表中进行分区。有人能帮我一步一步地描述一下吗？。
如何使用sqoop在配置单元中创建多级分区
使用 Scala 数据帧中的分区创建配置单元表

我需要一种从Scala数据框创建hive表的方法。hive表应该具有按日期分区的S3位置的ORC格式的基础文件。以下是我目前得到的信息: 我以 ORC 格式将 scala 数据帧写入 S3 我可以在S3位置看到ORC文件。我现在在这些ORC文件的顶部创建了一个配置单元表：但是配置单元表是空的，即不打印任何结果。但是，当我删除< code>PARTITIONED BY行时: 我看到了选择查询
创建Spark dataframe时的分区数

我创建一个数据文件，导入一个大约8MB的csv文件，如下所示：最后，我打印dataframe的分区数：答案是2。

如果我从分区的配置单元表创建DataFrame，将创建多少个分区？

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档