pandas read_csv dtype前导零

东方栋

2023-03-14

问题内容：

因此，我正在从NOAA读取站代码csv文件，如下所示：

"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END"
"006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","",""
"007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127"

前两列包含气象站代码，有时它们的前导零。当熊猫在未指定dtype的情况下导入它们时，它们将变成整数。没什么大不了的，因为我可以遍历数据帧索引并用类似的东西替换它们，”%06d” % i因为它们始终是六位数字，但是您知道…这是懒惰的方式。

使用以下html" target="_blank">代码获取csv：

file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV")
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()

一切都很好，但是当我尝试使用此方法阅读时：

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str})

要么

import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str})

我收到一个讨厌的错误消息：

File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 401, in parser
_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 216, in _read
    return parser.read()
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 633, in read
    ret = self._engine.read(nrows)
  File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 957, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas\src\parser.c:5931)
  File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6148)
  File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6962)
  File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7898)
  File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandas\src\parser.c:8483)
  File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandas\src\parser.c:9535)
  File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandas\src\parser.c:14616)
TypeError: data type not understood

这是一个相当大的csv（3万1千行），所以也许与此有关吗？

问题答案：

这是pandas dtype猜测的问题。

pandas看到数字，然后猜测您希望它是数字。

为了使熊猫不怀疑您的意图，应设置所需的dtype： object

pd.read_csv('filename.csv', dtype={'leading_zero_column_name': object})

会成功的

更新，因为它可以帮助其他人：

要将所有列都设为str，可以执行此操作（根据评论）：

pd.read_csv('sample.csv', dtype = str)

要将大多数或选择性的列设为str，可以执行以下操作：

# lst of column names which needs to be string
lst_str_cols = ['prefix', 'serial']
# use dictionary comprehension to make dict of dtypes
dict_dtypes = {x : 'str'  for x in lst_str_cols}
# use dict on dtypes
pd.read_csv('sample.csv', dtype=dict_dtypes)

pandas read_csv dtype前导零

相关阅读

相关文章

相关问答

相关工具

相关文档