问题：

使用正确的数据类型在Pyspark中读取CSV

轩辕季同

2023-03-14

当我尝试导入带有火花的本地CSV时，默认情况下每个列都作为字符串读取。但是，我的列只包括整数和时间戳类型。更具体地说，CSV如下所示：

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000

我已经找到了这个问题中应该有效的代码，但当我执行它时，所有条目都返回为NULL。

我使用以下内容来创建自定义架构：

from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType

customSchema = StructType(Array(
        StructField("Customer", IntegerType, true),
        StructField("TransDate", TimestampType, true),
        StructField("Quantity", IntegerType, true),
        StructField("Cost", IntegerType, true),
        StructField("TransKey", IntegerType, true)))

然后使用以下命令读取CSV：

myData = spark.read.load('myData.csv', format="csv", header="true", sep=',', schema=customSchema)

+--------+---------+--------+----+--------+
|Customer|TransDate|Quantity|Cost|Transkey|
+--------+---------+--------+----+--------+
|    null|     null|    null|null|    null|
+--------+---------+--------+----+--------+

我是否错过了关键的一步？我怀疑Date列是问题的根源。注意：我在GoogleCollab中运行这个。

共有2个答案

楚弘益

2023-03-14

您可以向DataFrameReader指定一个选项（'dateFormat'，'d.M.y'），以解析特定格式的日期。

df = spark.read.format("csv").option("header","true").option("dateFormat","M.d.y").schema(my_schema).load("path_to_csv")

参考

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReaderhttps://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

东方骏

2023-03-14

在这里你去！

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000
PATH_TO_FILE="file:///u/vikrant/LocalTestDateFile"
Loading above file to dataframe:
df = spark.read.format("com.databricks.spark.csv") \
  .option("mode", "DROPMALFORMED") \
  .option("header", "true") \
  .option("inferschema", "true") \
  .option("delimiter", ",").load(PATH_TO_FILE)

您的日期将作为字符串列类型加载，但当您将其更改为日期类型时，它会将此日期格式视为空。

df = (df.withColumn('TransDate',col('TransDate').cast('date'))

+--------+---------+--------+-----------+----+---------+--------+
|Customer|TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+---------+--------+-----------+----+---------+--------+
|  149332|     null|       1|     199.95| 107|127998739|  100000|
+--------+---------+--------+-----------+----+---------+--------+

因此，我们需要更改dd.mm的日期格式。年月日至年月日。

from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
from pyspark.sql.functions import col

更改日期格式的Python函数：

  change_dateformat_func =  udf (lambda x: datetime.strptime(x, '%d.%m.%Y').strftime('%Y-%m-%d'))

现在为您的dataframe列调用此函数：

newdf = df.withColumn('TransDate', change_dateformat_func(col('TransDate')).cast(DateType()))

+--------+----------+--------+-----------+----+---------+--------+
|Customer| TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+----------+--------+-----------+----+---------+--------+
|  149332|2005-11-15|       1|     199.95| 107|127998739|  100000|
+--------+----------+--------+-----------+----+---------+--------+

下面是Schema：

 |-- Customer: integer (nullable = true)
 |-- TransDate: date (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- PurchAmount: double (nullable = true)
 |-- Cost: integer (nullable = true)
 |-- TransID: integer (nullable = true)
 |-- TransKey: integer (nullable = true)

如果对你有用，请告诉我。

类似资料：

数据类型读取

读取文件已支持 windows 系统，版本号大于等于 1.3.4.1；扩展版本大于等于 1.2.7； PECL 安装时将会提示是否开启读取功能，请键入 yes；编译编译时需添加 --enable-reader ./configure --enable-reader 类型数组说明文档第三列是时间，你需要这样设置类型： [ 2 => \Vtiful\Kernel\Excel::TYP
使用Tesseract读取正确的OCR数据（精度较低）

嗨，我必须开发一个应用程序来读取OCR数据。谷歌后，我发现我可以实现它使用魔方。我从https://github.com/rmtheis/tess-two获得了Tesseract源代码如果有人做得对的话请帮帮我... 提前多谢了....
如何正确使用泛型类型的数组？

问题内容：我有一个类，可以根据消息的类将传入的消息映射到匹配的读者。所有消息类型都实现接口消息。读者在mapper类中注册，说明它将能够处理的消息类型。这些信息需要以某种方式存储在消息阅读器中，而我的方法是从构造函数中设置一个数组。现在，似乎我对泛型和/或数组有些误解，似乎无法弄清楚，请参见下面的代码。它是什么？ ETA ：正如cletus正确指出的那样，最基本的谷歌搜索表明Java不允许通
在singletons数据类型中使用NAT/Natural的正确方法是什么？

我试过、以及不同的导入（我想也许我没有使用singletons使用的正确的“NAT”或“Natural”），所有这些都给我带来了类似的错误。这里有什么问题？我必须为的类型编写singletons手动生成的定义吗？还是这里缺少了什么？
Cookie在控制器中读取正确，但在中间件中读取不正确-Laravel

我试图设置cookie来定义用户首选的语言。我通过一个指向助手控制器的链接做到了这一点： /设置区域设置/{locale} 我知道这是正确的，因为如果我这样做：它显示所选择的正确区域设置。所以下一步是使用中间件实际应用这个选择的语言环境，我把它命名为“设置语言环境”：但如果我执行在中间件中，它读取所有加扰的cookie。所以我的问题是为什么它会这样做，我如何从这里正确地读取cookie？
使用数据类型读取\u csv，但列[duplicate]中有na值

我使用以下代码通过指定每个列的类型来读取csv：但它有一个错误：不知道如何跳过NA？

使用正确的数据类型在Pyspark中读取CSV

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档