问题：

从谷歌云存储读取csv到熊猫数据框

龙才俊

2023-03-14

我试图读取一个csv文件目前在谷歌云存储桶到熊猫数据帧。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO

from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)

它显示以下错误消息：

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist

我做错了什么，我无法找到任何不涉及谷歌数据实验室的解决方案？

李胡媚

2023-03-14

从pandas的0.24版开始，read\u csv支持直接从Google云存储读取。只需提供指向bucket的链接，如下所示：

df = pd.read_csv('gs://bucket/your_path.csv')

然后，read_csv将使用gcsfs模块读取Dataframe，这意味着必须安装它（否则您将获得指向缺失依赖项的异常）。

为了完整起见，我留下另外三个选择。

自制代码

我将在下面介绍它们。

我写了一些方便的函数从谷歌存储阅读。为了使其更具可读性，我添加了类型注释。如果你碰巧在Python 2上，只需删除这些，代码就会照样工作。

假设您获得授权，它在公共和私人数据集上同样有效。在这种方法中，您不需要先将数据下载到本地驱动器。

如何使用它：

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)

守则：

from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account

def get_byte_fileobj(project: str,
                     bucket: str,
                     path: str,
                     service_account_credentials_path: str = None) -> BytesIO:
    """
    Retrieve data from a given blob on Google Storage and pass it as a file object.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: file object (BytesIO)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    byte_stream.seek(0)
    return byte_stream

def get_bytestring(project: str,
                   bucket: str,
                   path: str,
                   service_account_credentials_path: str = None) -> bytes:
    """
    Retrieve data from a given blob on Google Storage and pass it as a byte-string.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: byte-string (needs to be decoded)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    s = blob.download_as_string()
    return s


def _get_blob(bucket_name, path, project, service_account_credentials_path):
    credentials = service_account.Credentials.from_service_account_file(
        service_account_credentials_path) if service_account_credentials_path else None
    storage_client = storage.Client(project=project, credentials=credentials)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob

gcsfs是“谷歌云存储的Pythonic文件系统”。

如何使用它：

import pandas as pd
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
    df = pd.read_csv(f)

Dask“为分析提供高级并行性，为您喜爱的工具提供大规模性能”。当你需要在Python中处理大量数据时，这是非常好的。Dask试图模仿大部分的熊猫API，使其易于新用户使用。

这是read_csv

如何使用它：

import dask.dataframe as dd

df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!

# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()

从谷歌云存储读取csv到熊猫数据框

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档