jupyter colab_如何将大文件上传到Google Colab和远程Jupyter笔记本

白越
2023-12-01

jupyter colab

by Bharath Raj

巴拉斯·拉吉(Bharath Raj)

如何将大文件上传到Google Colab和远程Jupyter笔记本 (How to Upload large files to Google Colab and remote Jupyter notebooks)

If you haven’t heard about it, Google Colab is a platform that is widely used for testing out ML prototypes on its free K80 GPU. If you have heard about it, chances are that you gave it shot. But you might have become exasperated because of the complexity involved in transferring large datasets.

如果您还没有听说过,那么Google Colab是一个广泛用于在其免费的K80 GPU上测试ML原型的平台。 如果您听说过它,很可能会给您开枪。 但是由于传输大型数据集的复杂性,您可能会感到恼火。

This blog compiles some of the methods that I’ve found useful for uploading and downloading large files from your local system to Google Colab. I’ve also included additional methods that can useful for transferring smaller files with less effort. Some of the methods can be extended to other remote Jupyter notebook services, like Paperspace Gradient.

该博客汇总了一些我发现对从本地系统上下载 大文件Google Colab有用的方法 。 我还提供了其他方法,这些方法可用于以较小的工作量传输较小的文件 。 某些方法可以扩展到其他远程Jupyter笔记本服务,例如Paperspace Gradient。

传输大文件 (Transferring Large Files)

The most efficient method to transfer large files is to use a cloud storage system such as Dropbox or Google Drive.

传输大文件的最有效方法是使用云存储系统,例如DropboxGoogle Drive

1.投寄箱 (1. Dropbox)

Dropbox offers upto 2GB free storage space per account. This sets an upper limit on the amount of data that you can transfer at any moment. Transferring via Dropbox is relatively easier. You can also follow the same steps for other notebook services, such as Paperspace Gradient.

Dropbox每个帐户最多可提供2GB的免费存储空间。 这为您随时可以传输的数据量设置了上限。 通过Dropbox传输相对容易一些 。 您还可以对其他笔记本服务 (例如Paperspace Gradient)执行相同的步骤。

Step 1: Archive and Upload

第1步:存档和上传

Uploading a large number of images (or files) individually will take a very long time, since Dropbox (or Google Drive) has to individually assign IDs and attributes to every image. Therefore, I recommend that you archive your dataset first.

由于Dropbox(或Google云端硬盘)必须分别为每个图像分配ID和属性,因此分别上传大量图像(或文件)将花费很长时间。 因此,建议您首先存档数据集。

One possible method of archiving is to convert the folder containing your dataset into a ‘.tar’ file. The code snippet below shows how to convert a folder named “Dataset” in the home directory to a “dataset.tar” file, from your Linux terminal.

存档的一种可能方法是将包含数据集的文件夹转换为“ .tar”文件。 下面的代码段显示了如何从Linux终端将主目录中名为“ Dataset”的文件夹转换为“ dataset.tar”文件。

tar -cvf dataset.tar ~/Dataset

Alternatively, you could use WinRar or 7zip, whatever is more convenient for you. Upload the archived dataset to Dropbox.

另外,您可以使用WinRar或7zip,这对您来说更方便。 将存档的数据集上载到Dropbox。

Step 2: Clone the Repository

步骤2:克隆存储库

Open Google Colab and start a new notebook.

打开Goog​​le Colab并启动一个新笔记本。

Clone this GitHub repository. I’ve modified the original code so that it can add the Dropbox access token from the notebook. Execute the following commands one by one.

克隆此GitHub存储库 。 我已经修改了原始代码,以便可以从笔记本中添加Dropbox访问令牌。 一一执行以下命令。

!git clone https://github.com/thatbrguy/Dropbox-Uploader.git
cd Dropbox-Uploader
!chmod +x dropbox_uploader.sh

Step 3: Create an Access Token

步骤3:建立存取凭证

Execute the following command to see the initial setup instructions.

执行以下命令以查看初始安装说明。

!bash dropbox_uploader.sh

It will display instructions on how to obtain the access token, and will ask you to execute the following command. Replace the bold letters with your access token, then execute:

它将显示有关如何获取访问令牌的说明,并要求您执行以下命令。 将粗体字母替换为访问令牌,然后执行:

!echo "INPUT_YOUR_ACCESS_TOKEN_HERE" > token.txt

Execute !bash dropbox_uploader.sh again to link your Dropbox account to Google Colab. Now you can download and upload files from the notebook.

再次执行!bash dropbox_uploader.sh将您的Dropbox帐户链接到Google Colab。 现在,您可以从笔记本电脑下载和上传文件。

Step 4: Transfer Contents

步骤4:传输内容

Download to Colab from Dropbox:

从Dropbox下载到Colab:

Execute the following command. The argument is the name of the file on Dropbox.

执行以下命令。 参数是Dropbox上文件的名称。

!bash dropbox_uploader.sh download YOUR_FILE.tar

Upload to Dropbox from Colab:

从Colab上传到Dropbox:

Execute the following command. The first argument (result_on_colab.txt) is the name of the file you want to upload. The second argument (dropbox.txt) is the name you want to save the file as on Dropbox.

执行以下命令。 第一个参数(result_on_colab.txt)是您要上传的文件的名称。 第二个参数(dropbox.txt)是要将文件另存为Dropbox的名称。

!bash dropbox_uploader.sh upload result_on_colab.txt dropbox.txt

2. Google云端硬碟 (2. Google Drive)

Google Drive offers upto 15GB free storage for every Google account. This sets an upper limit on the amount of data that you can transfer at any moment. You can always expand this limit to larger amounts. Colab simplifies the authentication process for Google Drive.

Google云端硬盘为每个Google帐户最多提供15GB的免费存储空间。 这为您随时可以传输的数据量设置了上限。 您始终可以将此限制扩大到更大的数量。 Colab简化了Google云端硬盘的身份验证过程。

That being said, I’ve also included the necessary modifications you can perform, so that you can access Google Drive from other Python notebook services as well.

话虽如此,我还包括了可以执行的必要修改,以便您也可以从其他Python笔记本服务访问Google云端硬盘。

Step 1: Archive and Upload

第1步:存档和上传

Just as with Dropbox, uploading a large number of images (or files) individually will take a very long time, since Google Drive has to individually assign IDs and attributes to every image. So I recommend that you archive your dataset first.

与Dropbox一样,单独上传大量图片(或文件)将花费很长时间,因为Google云端硬盘必须分别为每张图片分配ID和属性。 因此,我建议您首先存档数据集。

One possible method of archiving is to convert the folder containing your dataset into a ‘.tar’ file. The code snippet below shows how to convert a folder named “Dataset” in the home directory to a “dataset.tar” file, from your Linux terminal.

存档的一种可能方法是将包含数据集的文件夹转换为“ .tar”文件。 下面的代码段显示了如何从Linux终端将主目录中名为“ Dataset”的文件夹转换为“ dataset.tar”文件。

tar -cvf dataset.tar ~/Dataset

And again, you can use WinRar or 7zip if you prefer. Upload the archived dataset to Google Drive.

同样,如果愿意,可以使用WinRar或7zip。 将存档的数据集上传到Google云端硬盘。

Step 2: Install dependencies

步骤2:安装依赖项

Open Google Colab and start a new notebook. Install PyDrive using the following command:

打开Goog​​le Colab并启动一个新笔记本。 使用以下命令安装PyDrive:

!pip install PyDrive

Import the necessary libraries and methods (The bold imports are only required for Google Colab. Do not import them if you’re not using Colab).

导入必要的库和方法( 粗体导入仅适用于Google Colab。如果不使用Colab,则不要导入它们)。

import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

Step 3: Authorize Google SDK

步骤3:授权Google SDK

For Google Colab:

对于Google Colab:

Now, you have to authorize Google SDK to access Google Drive from Colab. First, execute the following commands:

现在,您必须授权Google SDK从Colab访问Google云端硬盘。 首先,执行以下命令:

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

You will get a prompt as shown below. Follow the link to obtain the key. Copy and paste it in the input box and press enter.

您将收到如下提示。 按照链接获取密钥。 将其复制并粘贴到输入框中,然后按Enter。

For other Jupyter notebook services (Ex: Paperspace Gradient):

对于其他Jupyter笔记本服务(例如:Paperspace Gradient):

Some of the following steps are obtained from PyDrive’s quickstart guide.

以下某些步骤可从PyDrive的快速入门指南中获得。

Go to APIs Console and make your own project. Then, search for ‘Google Drive API’, select the entry, and click ‘Enable’. Select ‘Credentials’ from the left menu, click ‘Create Credentials’, select ‘OAuth client ID’. You should see a menu such as the image shown below:

转到API控制台并创建您自己的项目。 然后,搜索“ Google Drive API”,选择条目,然后单击“启用”。 从左侧菜单中选择“凭据”,单击“创建凭据”,然后选择“ OAuth客户端ID”。 您应该看到一个菜单,如下图所示:

Set “Application Type” to “Other”. Give an appropriate name and click “Save”.

将“应用程序类型”设置为“其他”。 输入适当的名称,然后单击“保存”。

Download the OAuth 2.0 client ID you just created. Rename it to client_secrets.json

下载您刚刚创建的OAuth 2.0客户端ID。 重命名client_secrets.json

Upload this JSON file to your notebook. You can do this by clicking the “Upload” button from the homepage of the notebook (Shown Below). (Note: Do not use this button to upload your dataset, as it will be extremely time consuming.)

将此JSON文件上传到笔记本。 您可以通过单击笔记本计算机主页上的“上传”按钮来完成此操作(如下所示)。 (注意:请勿使用此按钮上传数据集,因为这将非常耗时。)

Now, execute the following commands:

现在,执行以下命令:

gauth = GoogleAuth()
gauth.CommandLineAuth()
drive = GoogleDrive(gauth)

The rest of the procedure is similar to that of Google Colab.

其余过程 Google Colab 相似

Step 4: Obtain your File’s ID

步骤4:取得档案编号

Enable link sharing for the file you want to transfer. Copy the link. You may get a link such as this:

为要传输的文件启用链接共享。 复制链接。 您可能会收到这样的链接:

https://drive.google.com/open?id=YOUR_FILE_ID

Copy only the bold part of the above link.

仅复制上述链接的粗体部分。

Step 5: Transfer contents

步骤5:转移内容

Download to Colab from Google Drive:

从Google云端硬盘下载到Colab:

Execute the following commands. Here, YOUR_FILE_ID is obtained in the previous step, and DOWNLOAD.tar is the name (or path) you want to save the file as.

执行以下命令。 在这里, 在上一步中获得了YOUR_FILE_ID ,而DOWNLOAD.tar是您要将文件另存为的名称(或路径)。

download = drive.CreateFile({'id': 'YOUR_FILE_ID'})
download.GetContentFile('DOWNLOAD.tar')

Upload to Google Drive from Colab:

从Colab上传到Google云端硬盘:

Execute the following commands. Here, FILE_ON_COLAB.txt is the name (or path) of the file on Colab, and DRIVE.txt is the name (or path) you want to save the file as (On Google Drive).

执行以下命令。 在这里, FILE_ON_COLAB.txtColab上文件的名称(或路径),而DRIVE.txt是要将文件另存为(在Google云端硬盘上)的名称(或路径)。

upload = drive.CreateFile({'title': 'DRIVE.txt'})
upload.SetContentFile('FILE_ON_COLAB.txt')
upload.Upload()

传输较小的文件 (Transferring Smaller Files)

Occasionally, you may want to pass just one csv file and don’t want to go through this entire hassle. No worries — there are much simpler methods for that.

有时,您可能只想传递一个csv文件,而不希望经历整个麻烦。 不用担心-有很多更简单的方法。

1. Google Colab文件模块 (1. Google Colab files module)

Google Colab has its inbuilt files module, with which you can upload or download files. You can import it by executing the following:

Google Colab具有其内置文件模块 ,您可以使用该模块上传或下载文件。 您可以通过执行以下操作导入它:

from google.colab import files

To Upload:

上传:

Use the following command to upload files to Google Colab:

使用以下命令将文件上传到Google Colab:

files.upload()

You will be presented with a GUI with which you can select the files you want to upload. It is not recommended to use this method for files of large sizes. It is very slow.

您将看到一个GUI,可通过该GUI选择要上传的文件。 不建议对大文件使用此方法。 非常慢。

To Download:

去下载:

Use the following command to download a file from Google Colab:

使用以下命令从Google Colab下载文件:

files.download('example.txt')

This feature works best in Google Chrome. In my experience, it only worked once on Firefox, out of about 10 tries.

此功能在Google Chrome浏览器中效果最好。 以我的经验,它在Firefox上只能运行一次,大约10次尝试。

2. GitHub (2. GitHub)

This is a “hack-ish” way to transfer files. You can create a GitHub repository with the small files that you want to transfer.

这是一种“骇客”的方式来传输文件。 您可以使用要传输的小文件创建GitHub存储库。

Once you create the repository, you can just clone it in Google Colab. You can then push your changes to the remote repository and pull the updates onto your local system.

创建存储库后,您可以将其克隆到Google Colab中。 然后,您可以将更改推送到远程存储库,并将更新拉到本地系统。

But do note that GitHub has a hard limit of 25MB per file, and a soft limit of 1GB per repository.

但是请注意,GitHub的每个文件的硬限制为25MB,每个存储库的软限制为1GB。

Thank you for reading this article! Leave some claps if you it interesting! If you have any questions, you could hit me up on social media or send me an email (bharathrajn98[at]gmail[dot]com).

感谢您阅读本文! 如果您感兴趣,请留下一些鼓掌! 如有任何疑问,您可以在社交媒体上打我,也可以给我发送电子邮件(bharathrajn98 [at] gmail [dot] com)。

翻译自: https://www.freecodecamp.org/news/how-to-transfer-large-files-to-google-colab-and-remote-jupyter-notebooks-26ca252892fa/

jupyter colab

 类似资料: