当前位置: 首页 > 工具软件 > kangaroo-open > 使用案例 >

600-900 随机数_如何使用900万张Open Images将照片分为600类

轩辕煌
2023-12-01

600-900 随机数

by Aleksey Bilogur

通过Aleksey Bilogur

如何使用900万张Open Images将照片分为600类 (How to classify photos in 600 classes using nine million Open Images)

If you’re looking build an image classifier but need training data, look no further than Google Open Images.

如果您要构建图像分类器,但需要训练数据,那么Google Open Images就是您的理想之选

This massive image dataset contains over 30 million images and 15 million bounding boxes. That’s 18 terabytes of image data!

这个庞大的图像数据集包含超过3000万张图像和1500万个边界框。 那就是18 TB的图像数据!

Plus, Open Images is much more open and accessible than certain other image datasets at this scale. For example, ImageNet has restrictive licensing.

另外,与该规模的某些其他图像数据集相比,“打开图像”更加开放和易于访问。 例如,ImageNet具有限制性许可。

However, it’s not easy for developers on single machines to sift through that much data.You need to download and process multiple metadata files, and roll their own storage space (or apply for access to a Google Cloud bucket).

但是,对于单台计算机上的开发人员而言,筛选大量数据并不容易。您需要下载并处理多个元数据文件,并滚动其自己的存储空间(或申请访问Google Cloud存储桶)。

On the other hand, there aren’t many custom image training sets in the wild, because frankly they’re a pain to create and share.

另一方面,没有太多定制的图像训练集,因为坦率地说,创建和共享它们很麻烦。

In this article, we’ll build and distribute a simple end-to-end machine learning pipeline using Open Images.

在本文中,我们将使用Open Images构建和分发一个简单的端到端机器学习管道。

We’ll see how to create your own dataset around any of the 600 labels included in the Open Images bounding box data.

我们将了解如何围绕“打开图像”边界框数据中包含的600个标签中的任何一个创建自己的数据集。

We’ll show off our handiwork by building “open sandwiches”. These are simple, reproducible image classifiers built to answer an age-old question: is a hamburger a sandwich?

我们将通过制作“开放式三明治”来展示我们的手工作品。 这些是简单,可复制的图像分类器,旨在回答一个古老的问题: 汉堡包是三明治吗?

Want to see the code? You can follow along in the repository on GitHub.

想看代码吗? 您可以在GitHub上的存储库中进行操作

下载数据 (Downloading the data)

We need to download the relevant data before we can do anything with it.

在执行任何操作之前,我们需要下载相关数据。

This is the core challenge when working with Google Open Images (or any external dataset really). There is no easy way to download a subset of the data. We need to write a script that does so for us.

这是使用Google Open Images(或其他任何外部数据集)时的核心挑战。 没有简单的方法可以下载数据的子集。 我们需要编写一个脚本来为我们这样做。

I’ve written a Python script that searches the metadata in the Open Images data set for keywords that you specify. It finds the original URLs of the corresponding images (on Flickr), then downloads them to disk.

我编写了一个Python脚本,该脚本在“ 打开图像”数据集中的元数据中搜索您指定的关键字。 它找到对应图像的原始URL(在Flickr上 ),然后将其下载到磁盘。

It’s a testament to the power of Python that you can do all of this in just 50 lines of code:

这证明了Python的强大功能,您只需50行代码即可完成所有这些工作:

This script enables you to download the subset of raw images which have bounding box information for any subset of categories of our choice:

此脚本使您可以下载原始图像的子集,其中包含我们选择的任何类别的子集的边界框信息:

$ git clone https://github.com/quiltdata/open-images.git$ cd open-images/$ conda env create -f environment.yml$ source activate quilt-open-images-dev$ cd src/openimager/$ python openimager.py "Sandwiches" "Hamburgers"

Categories are organized in a hierarchical way.

类别以分层方式组织。

For example, sandwich and hamburger are both sub-labels of food (but hamburger is not a sub-label of sandwich — hmm).

例如, sandwichhamburger都是food子标签(但是hamburger不是sandwich的子标签-嗯)。

We can visualize the ontology as a radial tree using Vega:

我们可以使用Vega将本体可视化为一棵放射状的树:

You can view an interactive annotated version of this chart (and download the code behind it) here.

您可以在此处查看此图表的交互式注释版本(并下载其背后的代码)。

Not all categories in Open Images have bounding box data associated with them.

并非“打开图像”中的所有类别都具有与其关联的边界框数据。

But this script will allow you to download any subset of the 600 labels that do. Here’s a taste of what’s possible:

但是此脚本将允许您下载600个标签中的任何子集。 这是可能的味道:

football, toy, bird, cat, vase, hair dryer, kangaroo, knife, briefcase, pencil case, tennis ball, nail, high heels, sushi, skyscraper, tree, truck, violin, wine, wheel, whale, pizza cutter, bread, helicopter, lemon, dog, elephant, shark, flower, furniture, airplane, spoon, bench, swan, peanut, camera, flute, helmet, pomegranate, crown

footballtoybirdcatvasehair dryerkangarooknifebriefcasepencil casetennis ballnailhigh heelssushiskyscrapertreetruckviolinwinewheelwhalepizza cutterbread helicopterlemondogelephantsharkflowerfurnitureairplanespoonbenchswanpeanutcameraflutehelmetpomegranatecrown ……

For the purposes of this article, we’ll limit ourselves to just two: hamburger and sandwich.

出于本文的目的,我们将限于两个: hamburgersandwich

清洁,裁剪 (Clean it, crop it)

Once we’ve run the script and localized the images, we can inspect them with matplotlib to see what we’ve got:

一旦运行了脚本并对图像进行了本地化,就可以使用matplotlib检查它们,以了解所得到的:

import matplotlib.pyplot as pltfrom matplotlib.image import imread%matplotlib inlineimport os
fig, axarr = plt.subplots(1, 5, figsize=(24, 4))for i, img in enumerate(os.listdir('../data/images/')[:5]):    axarr[i].imshow(imread('../data/images/' + img))

These images are not easy ones to train on. They have all of the issues associated with building a dataset using an external source from the public Internet.

这些图像不容易训练。 它们具有与使用来自公共Internet的外部源来构建数据集有关的所有问题。

Just this small sample here demonstrates the different sizes, orientations, and occlusions possible in our target classes.

只是这里的这个小样本演示了我们目标类中可能存在的不同大小,方向和遮挡。

In one case, we didn’t even succeed in downloading the actual image. Instead, we got a placeholder telling us that the image we wanted has since been deleted!

在一种情况下,我们甚至没有成功下载实际图像。 取而代之的是,我们有一个占位符告诉我们我们想要的图像此后已被删除!

Downloading this data nets us a few thousand sample images like these. The next step is to take advantage of the bounding box information to clip our images down to just the sandwich-y, hamburger-y parts.

下载此数据可为我们提供几千个这样的示例图像。 下一步是利用边界框信息将我们的图像裁剪为仅包含y形和y形的部分。

Here’s another image array, this time with bounding boxes included, to demonstrate what this entails:

这是另一个图像数组,这次包含边界框,以演示其含义:

This annotated Jupyter notebook in the demo GitHub repository does this work.

演示GitHub存储库中带注释的Jupyter笔记本完成了这项工作。

I will omit showing that code here because it is slightly complicated. This is especially since we also need to (1) refactor our image metadata to match the clipped image outputs and (2) extract the images that have since been deleted. Definitely check out the notebook if you wish to see the code.

我会在这里省略显示该代码,因为它有些复杂。 尤其是因为我们还需要(1)重构图像元数据以匹配裁剪后的图像输出,并且(2)提取此后已删除的图像。 如果您想查看代码,请务必签出笔记本。

After running the notebook code, we will have an images_cropped folder on disk containing all of the cropped images.

运行笔记本代码后,我们将在磁盘上有一个images_cropped文件夹,其中包含所有裁剪的图像。

建立模型 (Building the model)

Once we have downloaded the data, and cropped and cleaned it, we’re ready to train the model.

一旦下载了数据,并裁剪并清理了数据,就可以训练模型了。

We will train a convolutional neural network (or ‘CNN’) on the data.

我们将在数据上训练卷积神经网络 (或“ CNN”)。

CNNs are a special type of neural network which build progressively higher level features out of groups of pixels commonly found in the images.

CNN是一种特殊的神经网络,可以从图像中常见的像素组中逐步构建更高级别的功能。

How an image scores on these various features is then weighted to generate a final classification result.

然后对图像如何在这些各种功能上评分的方式进行加权,以生成最终的分类结果。

This architecture works extremely well because it takes advantage of locality. This is because any one pixel is likely to have far more in common with pixels nearby than those far away.

这种架构非常有效,因为它利用了局部性。 这是因为任何一个像素可能与附近的像素相比远处的像素具有更多的共同点。

CNNs also have other attractive properties, like noise tolerance and scale invariance (to an extent). These further improve their classification properties.

CNN还具有其他吸引人的特性,例如噪声容限和尺度不变性(在一定程度上)。 这些进一步改善了它们的分类特性。

If you’re unfamiliar with CNNs, I recommend skimming Brandon Rohrer’s excellent “How convolutional neural networks work” to learn more about them.

如果您不熟悉CNN,我建议您浏览一下Brandon Rohrer出色的“ 卷积神经网络的工作原理 ”以了解有关CNN的更多信息。

We will train a very simple convolutional neural network and see how even that gets decent results on our problem. I use Keras to define and train the model.

我们将训练一个非常简单的卷积神经网络,并观察它如何在我们的问题上得到不错的结果。 我使用Keras定义和训练模型。

We start by laying out the images in a certain directory structure:

我们首先将图像布置在某个目录结构中:

images_cropped/    sandwich/        some_image.jpg        some_other_image.jpg        ...    hamburger/        yet_another_image.jpg        ...

We then point Keras at this folder using the following code:

然后,使用以下代码将Keras指向此文件夹:

Keras will inspect the input folders, and determine there are two classes in our classification problem. It will assign class names based on the subfolder names, and create “image generators” that serve out of those folders.

Keras将检查输入文件夹,并确定我们的分类问题中有两个类。 它将基于子文件夹名称分配类名称,并创建从这些文件夹中使用的“图像生成器”。

But we don’t just return the images themselves. Instead, we return randomly subsampled, skewed, and zoomed selections from the images (via train_datagen.flow_from_directory).

但是我们不只是返回图像本身。 相反,我们从图像中返回随机的子采样,偏斜和缩放的选择(通过train_datagen.flow_from_directory )。

This is an example of data augmentation in action.

这是数据扩充的一个例子 在行动。

Data augmentation is the practice of feeding an image classifier randomly cropped and distorted versions of an input dataset. This helps us overcome the small size of our dataset. We can train our model on a single image multiple times. Each time we use a slightly different segment of the image preprocessed in a slightly different way.

数据扩充是一种为输入输入数据集的随机裁剪和失真版本的图像分类器提供数据的做法。 这有助于我们克服数据集的小规模问题。 我们可以在单个图像上多次训练模型。 每次我们使用以稍微不同的方式预处理的图像的稍有不同的部分。

With our data input defined, the next step is defining the model itself:

定义好我们的数据输入后,下一步就是定义模型本身:

This is a simple convolutional neural network model. It contains just three convolutional layers: a single densely connected post-processing layer just before the output layer, and strong regularization in the form of a dropout layer and relu activation.

这是一个简单的卷积神经网络模型。 它仅包含三个卷积层: relu在输出层之前的单个密集连接的后处理层,以及以辍学层和relu激活的形式进行的强正则化。

These things all work together to make it more difficult for this model to overfit. This is important, given the small size of our input dataset.

所有这些因素共同作用,使该模型变得更加难以拟合 。 鉴于我们的输入数据集较小,这一点很重要。

Finally, the last step is actually fitting the model.

最后,最后一步实际上是拟合模型。

This code selects an epoch step size determined by our image sample size and chosen batch size (16). Then it trains on that data for 50 epochs.

这段代码选择了一个纪元步长,该步长由我们的图像样本大小和所选的批处理大小确定(16)。 然后,在该数据上训练50个纪元。

Training is likely to be suspended early by the EarlyStopping callback. This returns the best performing model ahead of the 50 epoch limit if it does not see improvement in the validation score in the previous four epochs.

EarlyStopping回调很可能会提前暂停培训。 如果在前四个时期中验证得分没有改善,则它将返回性能最好的模型(超过50个时期)。

We selected such a large patience value because there is a significant amount of variability in model validation loss.

我们之所以选择如此大的耐心值,是因为模型验证损失存在很大的可变性。

This simple training regimen results in a model with about 75% accuracy:

这种简单的训练方案可以产生约75%的准确度的模型:

precision    recall  f1-score   support           0       0.90      0.59      0.71      1399           1       0.64      0.92      0.75      1109   micro avg       0.73      0.73      0.73      2508   macro avg       0.77      0.75      0.73      2508weighted avg       0.78      0.73      0.73      2508

It’s interesting to note that our model is under-confident when classifying hamburgers (class 0), but over-confident when classifying hamburgers (class 1).

有趣的是,我们在对汉堡包进行分类(类0)时过于自信,而在对汉堡包进行分类(类1)时过于自信。

90% of images classified as hamburgers are actually hamburgers. But only 59% of all actual hamburgers are classified correctly.

分类为汉堡包的图像中有90%实际上是汉堡包。 但是,只有59%的实际汉堡包被正确分类。

On the other hand, just 64% of images classified as sandwiches are actually sandwiches. But 92% of sandwiches are classified correctly.

另一方面,归类为三明治的图像中只有64%实际上是三明治。 但是92%的三明治被正确分类。

These results are in line with the 80% accuracy Francois Chollet got by applying a very similar model to a similarly-sized subset of the classic Cats versus Dogs dataset.

这些结果与Francois Chollet通过将非常相似的模型应用于经典Cats vs Dogs数据集的大小相似的子集而获得的80%的准确性相符。

The difference is probably mainly due to increased level of occlusion and noise in the Google Open Images V4 dataset.

造成这种差异的主要原因可能是Google Open Images V4数据集中的遮挡和噪声水平增加。

The dataset also includes illustrations as well as photographic images. These sometimes take large artistic liberties, making classification more difficult. You may choose to remove these when building a model yourself.

数据集还包括插图以及摄影图像。 这些有时需要大量的艺术自由,使分类更加困难。 您可以在自己构建模型时选择删除这些。

This performance can be further improved using transfer learning techniques. To learn more, check out Keras author Francois Chollet’s blog post “Building powerful image classification models using very little data”.

使用转移学习技术可以进一步提高此性能。 要了解更多信息,请查看Keras的作者Francois Chollet的博客文章“ 使用很少的数据构建强大的图像分类模型 ”。

分配模型 (Distributing the model)

Now that we’ve now built a custom dataset and trained a model, it’d be a shame if we didn’t share it.

现在我们已经建立了一个自定义数据集并训练了一个模型,如果我们不共享它,那将是一个耻辱。

Machine Learning projects should be reproducible. I outline the following strategy in a previous article, “Reproduce a machine learning model build in four lines of code”.

机器学习项目应该是可复制的。 我在上一篇文章“ 以四行代码重现机器学习模型 ”中概述了以下策略。

  • Separate dependencies into data, code, and environment components.

    将依赖项分离为数据,代码和环境组件。
  • Data dependencies version control (1) the model definition and (2) the training data. Save these to versioned blob storage, e.g. Amazon S3 with Quilt T4.

    数据依赖性版本控制(1)模型定义和(2)训练数据。 将它们保存到版本化的Blob存储中,例如,带有被子T4的 Amazon S3

  • Code dependencies version controls the code used to train the model (use git).

    代码依赖性版本控制用于训练模型的代码(使用git)。
  • Environment dependencies version control the environment used to train the model. In a production environment this should probably be a Docker file, but you can use pip or conda locally.

    环境依赖项版本控制用于训练模型的环境。 在生产环境中,这可能应该是Docker文件,但是您可以在本地使用pipconda

  • To provide someone with a retrainable copy of the model, give them the corresponding {data, code, environment} tuple.

    要向某人提供模型的可训练副本,请给他们相应的{data, code, environment}元组。

Following these principles makes getting everything you need to train your own copy of this model fit into a handful of lines of code:

遵循这些原则,使您所需的一切来训练自己的模型副本都适合少数几行代码:

git clone https://github.com/quiltdata/open-images.gitconda env create -f open-images/environment.ymlsource activate quilt-open-images-devpython -c "import t4; t4.Package.install('quilt/open_images', dest='open-images/', registry='s3://quilt-example')"

To learn more about {data, code, environment} see the GitHub repository and/or the corresponding article.

要了解有关{data, code, environment}更多{data, code, environment}请参见GitHub存储库和/或相应的文章

结论 (Conclusion)

In this article we demonstrated an end-to-end image classification machine learning pipeline. We covered everything from downloading/transforming a dataset to training a model. We then distributed it in a way that lets anyone else rebuild it themselves later.

在本文中,我们演示了端到端图像分类机器学习管道。 我们涵盖了从下载/转换数据集到训练模型的所有内容。 然后,我们以一种允许其他任何人稍后重新构建的方式分发它。

Because custom datasets are difficult to generate and distribute, over time there has emerged a cabal of example datasets which get used everywhere. This is not because they’re actually that good (they’re not). Instead, it’s because they’re easy.

由于自定义数据集难以生成和分发,因此随着时间的流逝,出现了一系列示例数据集,这些数据集到处都有使用。 这不是因为它们实际上那么好(不是)。 相反,这是因为它们很容易。

For example, Google’s recently released Machine Learning Crash Course makes heavy use of the California Housing Dataset. That data is now almost two decades old!

例如,Google最近发布的“机器学习速成课程”大量使用了“ 加利福尼亚住房数据集” 。 该数据现在已有近二十年的历史了!

Consider instead exploring new horizons. Using real images from the living Internet with interesting categorical breakdowns. It’s easier than you think!

考虑改用新的视野。 使用实时互联网中的真实图像进行有趣的分类分解。 它比您想象的要容易!

翻译自: https://www.freecodecamp.org/news/how-to-classify-photos-in-600-classes-using-nine-million-open-images-65847da1a319/

600-900 随机数

 类似资料: