<small id='JlZ9z'></small><noframes id='JlZ9z'>

    • <bdo id='JlZ9z'></bdo><ul id='JlZ9z'></ul>
  • <tfoot id='JlZ9z'></tfoot>

    1. <i id='JlZ9z'><tr id='JlZ9z'><dt id='JlZ9z'><q id='JlZ9z'><span id='JlZ9z'><b id='JlZ9z'><form id='JlZ9z'><ins id='JlZ9z'></ins><ul id='JlZ9z'></ul><sub id='JlZ9z'></sub></form><legend id='JlZ9z'></legend><bdo id='JlZ9z'><pre id='JlZ9z'><center id='JlZ9z'></center></pre></bdo></b><th id='JlZ9z'></th></span></q></dt></tr></i><div id='JlZ9z'><tfoot id='JlZ9z'></tfoot><dl id='JlZ9z'><fieldset id='JlZ9z'></fieldset></dl></div>

        <legend id='JlZ9z'><style id='JlZ9z'><dir id='JlZ9z'><q id='JlZ9z'></q></dir></style></legend>

        从 Google Cloud 存储读取 csv 到 pandas 数据框

        时间:2023-11-07

              <tfoot id='aYNV4'></tfoot>

              <small id='aYNV4'></small><noframes id='aYNV4'>

                <bdo id='aYNV4'></bdo><ul id='aYNV4'></ul>
              • <legend id='aYNV4'><style id='aYNV4'><dir id='aYNV4'><q id='aYNV4'></q></dir></style></legend>

                  <tbody id='aYNV4'></tbody>
                  <i id='aYNV4'><tr id='aYNV4'><dt id='aYNV4'><q id='aYNV4'><span id='aYNV4'><b id='aYNV4'><form id='aYNV4'><ins id='aYNV4'></ins><ul id='aYNV4'></ul><sub id='aYNV4'></sub></form><legend id='aYNV4'></legend><bdo id='aYNV4'><pre id='aYNV4'><center id='aYNV4'></center></pre></bdo></b><th id='aYNV4'></th></span></q></dt></tr></i><div id='aYNV4'><tfoot id='aYNV4'></tfoot><dl id='aYNV4'><fieldset id='aYNV4'></fieldset></dl></div>
                • 本文介绍了从 Google Cloud 存储读取 csv 到 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                  问题描述

                  我正在尝试将 Google Cloud Storage 存储桶上的 csv 文件读取到 panda 数据帧中.

                  I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.

                  import pandas as pd
                  import matplotlib.pyplot as plt
                  import seaborn as sns
                  %matplotlib inline
                  from io import BytesIO
                  
                  from google.cloud import storage
                  
                  storage_client = storage.Client()
                  bucket = storage_client.get_bucket('createbucket123')
                  blob = bucket.blob('my.csv')
                  path = "gs://createbucket123/my.csv"
                  df = pd.read_csv(path)
                  

                  它显示了这个错误信息:

                  It shows this error message:

                  FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist
                  

                  我做错了什么,我找不到任何不涉及 google datalab 的解决方案?

                  What am I doing wrong, I am not able to find any solution which does not involve google datalab?

                  推荐答案

                  更新

                  从 pandas 0.24 版开始,read_csv 支持直接从 Google Cloud Storage 读取.只需像这样提供指向存储桶的链接:

                  UPDATE

                  As of version 0.24 of pandas, read_csv supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:

                  df = pd.read_csv('gs://bucket/your_path.csv')
                  

                  read_csv 然后将使用 gcsfs 模块来读取 Dataframe,这意味着它必须被安装(否则你会得到一个指向缺少依赖项的异常).

                  The read_csv will then use gcsfs module to read the Dataframe, which means it had to be installed (or you will get an exception pointing at missing dependency).

                  为了完整起见,我留下了其他三个选项.

                  I leave three other options for the sake of completeness.

                  • 自制代码
                  • gcsfs
                  • 黎明

                  我将在下面介绍它们.

                  我编写了一些方便的函数来从 Google 存储中读取数据.为了使其更具可读性,我添加了类型注释.如果您碰巧在 Python 2 上,只需删除这些代码即可.

                  I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.

                  假设您已获得授权,它同样适用于公共和私人数据集.在这种方法中,您无需先将数据下载到本地驱动器.

                  It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.

                  使用方法:

                  fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
                  df = pd.read_csv(fileobj)
                  

                  代码:

                  from io import BytesIO, StringIO
                  from google.cloud import storage
                  from google.oauth2 import service_account
                  
                  def get_byte_fileobj(project: str,
                                       bucket: str,
                                       path: str,
                                       service_account_credentials_path: str = None) -> BytesIO:
                      """
                      Retrieve data from a given blob on Google Storage and pass it as a file object.
                      :param path: path within the bucket
                      :param project: name of the project
                      :param bucket_name: name of the bucket
                      :param service_account_credentials_path: path to credentials.
                             TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
                      :return: file object (BytesIO)
                      """
                      blob = _get_blob(bucket, path, project, service_account_credentials_path)
                      byte_stream = BytesIO()
                      blob.download_to_file(byte_stream)
                      byte_stream.seek(0)
                      return byte_stream
                  
                  def get_bytestring(project: str,
                                     bucket: str,
                                     path: str,
                                     service_account_credentials_path: str = None) -> bytes:
                      """
                      Retrieve data from a given blob on Google Storage and pass it as a byte-string.
                      :param path: path within the bucket
                      :param project: name of the project
                      :param bucket_name: name of the bucket
                      :param service_account_credentials_path: path to credentials.
                             TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
                      :return: byte-string (needs to be decoded)
                      """
                      blob = _get_blob(bucket, path, project, service_account_credentials_path)
                      s = blob.download_as_string()
                      return s
                  
                  
                  def _get_blob(bucket_name, path, project, service_account_credentials_path):
                      credentials = service_account.Credentials.from_service_account_file(
                          service_account_credentials_path) if service_account_credentials_path else None
                      storage_client = storage.Client(project=project, credentials=credentials)
                      bucket = storage_client.get_bucket(bucket_name)
                      blob = bucket.blob(path)
                      return blob
                  

                  gcsfs

                  gcsfs 是用于谷歌云存储的 Pythonic 文件系统".

                  gcsfs

                  gcsfs is a "Pythonic file-system for Google Cloud Storage".

                  使用方法:

                  import pandas as pd
                  import gcsfs
                  
                  fs = gcsfs.GCSFileSystem(project='my-project')
                  with fs.open('bucket/path.csv') as f:
                      df = pd.read_csv(f)
                  

                  黎明

                  Dask为分析提供高级并行性,为您喜爱的工具实现大规模性能".当您需要在 Python 中处理大量数据时,它非常棒.Dask 尝试模仿 pandas API 的大部分内容,使其易于新手使用.

                  dask

                  Dask "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandas API, making it easy to use for newcomers.

                  这里是 read_csv

                  使用方法:

                  import dask.dataframe as dd
                  
                  df = dd.read_csv('gs://bucket/data.csv')
                  df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!
                  
                  # df is now Dask dataframe, ready for distributed processing
                  # If you want to have the pandas version, simply:
                  df_pd = df.compute()
                  

                  这篇关于从 Google Cloud 存储读取 csv 到 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                  上一篇:模板包括和 django 视图/网址.他们如何(做/应该)工作? 下一篇:如何在 Google Cloud Run 中生成 Blob 签名 URL?

                  相关文章

                  <tfoot id='ZkDVI'></tfoot>
                  <legend id='ZkDVI'><style id='ZkDVI'><dir id='ZkDVI'><q id='ZkDVI'></q></dir></style></legend>
                  <i id='ZkDVI'><tr id='ZkDVI'><dt id='ZkDVI'><q id='ZkDVI'><span id='ZkDVI'><b id='ZkDVI'><form id='ZkDVI'><ins id='ZkDVI'></ins><ul id='ZkDVI'></ul><sub id='ZkDVI'></sub></form><legend id='ZkDVI'></legend><bdo id='ZkDVI'><pre id='ZkDVI'><center id='ZkDVI'></center></pre></bdo></b><th id='ZkDVI'></th></span></q></dt></tr></i><div id='ZkDVI'><tfoot id='ZkDVI'></tfoot><dl id='ZkDVI'><fieldset id='ZkDVI'></fieldset></dl></div>
                  1. <small id='ZkDVI'></small><noframes id='ZkDVI'>

                    • <bdo id='ZkDVI'></bdo><ul id='ZkDVI'></ul>