nsfw
Teaching a machine to recognize indecent content wasn’t difficult in retrospect, but it sure was tough the first time through.
回想起来,教一台机器识别不雅内容并不难,但第一次肯定很难。
Here are some lessons learned, and some tips and tricks I uncovered while building an NSFW model.
这里是一些经验教训,以及在构建NSFW模型时发现的一些技巧。
Though there are lots of ways this could have been implemented, the hope of this post is to provide a friendly narrative so that others can understand what this process can look like.
尽管可以采用很多方法来实现,但本文的希望是提供一个友好的叙述,以便其他人可以理解此过程的外观。
If you’re new to ML, this will inspire you to train a model. If you’re familiar with it, I’d love to hear how you would have gone about building this model and ask you to share your code.
如果您不熟悉ML,这将激发您训练模型的灵感。 如果您熟悉它,我很想听听您如何构建该模型并要求您共享代码。
Fortunately, a really cool set of scraping scripts were released for a NSFW dataset. The code is simple already comes with labeled data categories. This means that just accepting this data scraper’s defaults will give us 5 categories pulled from hundreds of subreddits.
幸运的是,为NSFW数据集发布了一套非常酷的抓取脚本 。 该代码很简单,已经带有标记的数据类别。 这意味着仅接受此数据收集器的默认值将为我们提供数百种subreddit中的5种类别。
The instructions are quite simple, you can simply run the 6 friendly scripts. Pay attention to them as you may decide to change things up.
说明非常简单,您只需运行6个友好的脚本即可。 请注意它们,因为您可能决定进行更改。
If you have more subreddits that you’d like to add, you should edit the source URLs before running step 1.
如果您要添加更多子目录,则应在运行步骤1之前编辑源URL。
E.g. — If you were to add a new source of neutral examples, you’d add to the subreddit list in
nsfw_data_scraper/scripts/source_urls/neutral.txt
.例如-如果要添加新的中性示例来源,则将其添加到
nsfw_data_scraper/scripts/source_urls/neutral.txt
的subreddit列表中。
Reddit is a great resource of content around the web, since most subreddits are slightly policed by humans to be on target for that subreddit.
Reddit是Web上大量内容的资源,因为大多数子Reddit都是由人类轻微监管的,因此该Reddit必定是目标。
The data we got from the NSFW data scraper is already labeled! But expect some errors. Especially since Reddit isn’t perfectly curated.
我们从NSFW数据收集器获得的数据已被标记! 但要期待一些错误。 特别是由于Reddit的策划不完善。
Duplication is also quite common, but fixable without slow human comparison.
复制也很常见,但是可以在不进行人工比较的情况下进行修复。
The first thing I like to run is duplicate-file-finder
which is the fastest exact file match and deleter. It’s powered in Python.
我想运行的第一件事是duplicate-file-finder
,它是最快的精确文件匹配和删除器。 它使用Python供电。
Qarj/duplicate-file-finderFind duplicate files. Contribute to Qarj/duplicate-file-finder development by creating an account on GitHub.github.com
Qarj / duplicate-file-finder 查找重复文件。 通过在GitHub上创建一个帐户为Qarj / duplicate-file-finder开发做出贡献。 github.com
I can generally get a majority of duplicates knocked out with this command.
通常,我可以用此命令删除大多数重复项。
python dff.py --path train/path --delete
Now, this doesn’t catch images that are “essentially” the same. For that, I advocate using a Macpaw tool called “Gemini 2”.
现在,这不会捕获“本质上”相同的图像。 为此,我主张使用Macpaw工具“ Gemini 2”。
While this looks super simple, don’t forget to dig into the automatic duplicates, and select ALL the duplicates until your Gemini screen declares “Nothing Remaining” like so:
尽管这看起来非常简单,但不要忘记研究自动重复项,并选择所有重复项,直到Gemini屏幕声明“ Nothing Remaining”,如下所示:
It’s safe to say this can take an extreme amount of time if you have a huge dataset. Personally, I ran it on each classification before I ran it on the parent folder in order to keep reasonable runtimes.
可以肯定地说,如果您拥有庞大的数据集,这将花费大量时间。 就个人而言,我先在每个分类上运行它,然后再在父文件夹上运行它,以保持合理的运行时间。
I’ve looked at Tensorflow, Pytorch, and raw Python as ways to build a machine learning model from scratch. But I’m not looking to discover something new, I want to effectively do something pre-existing. So I went pragmatic.
我已经将Tensorflow,Pytorch和原始Python视为从头构建机器学习模型的方法。 但是我不想寻找新的东西,我想有效地做某事。 所以我很务实。
I found Keras to be the most practical API for writing a simple model. Even Tensorflow agrees and is currently working to be more Keras-like. Also, with only one graphics card, I’m going to grab a popular pre-existing model + weights, and simply train on top of it with some transfer learning.
我发现Keras是编写简单模型的最实用的API。 甚至Tensorflow也同意,并且目前正在努力使其更像Keras。 此外,仅使用一张图形卡,我将获得一个流行的预先存在的模型+权重,并通过一些迁移学习在其之上进行简单训练。
After a little research, I chose Inception v3 weighted with imagenet. To me, that's like going to the pre-existing ML store and buying the Aston Martin. We’ll just shave off the top layer so we can use that model to our needs.
经过一些研究,我选择了加权为imagenet的 Inception v3 。 对我而言,这就像去已有的ML商店并购买阿斯顿·马丁。 我们将只剃掉顶层,以便我们可以根据需要使用该模型。
conv_base = InceptionV3(
weights='imagenet',
include_top=False,
input_shape=(height, width, num_channels)
)
With the model in place, I added 3 more layers. A 256 hidden neuron layer, followed by a hidden 128 neuron layer, followed by a final 5 neuron layer. The latter being the ultimate classification into the five final classes moderated by softmax.
放置好模型后,我又添加了3层。 256个隐藏的神经元层,然后是隐藏的128个神经元层,最后是5个神经元层。 后者是由softmax主持的五个最终类别的最终分类。
# Add 256
x = Dense(256, activation='relu', kernel_initializer=initializers.he_normal(seed=None), kernel_regularizer=regularizers.l2(.0005))(x)
x = Dropout(0.5)(x)
# Add 128
x = Dense(128,activation='relu', kernel_initializer=initializers.he_normal(seed=None))(x)
x = Dropout(0.25)(x)
# Add 5
predictions = Dense(5, kernel_initializer="glorot_uniform", activation='softmax')(x)
Visually this code turns into this:
在视觉上,这段代码变成了:
Some of the above might seem odd. After all, it’s not everyday you say “glorot_uniform”. But strange words aside, my new hidden layers are being regularized to prevent overfitting.
以上某些内容可能看起来很奇怪。 毕竟,您并不是每天都说“ glorot_uniform”。 但是,除了奇怪的说法外,我对新的隐藏层进行了正则化以防止过度拟合。
I’m using dropout, which will randomly remove neural pathways so no one feature dominates the model.
我正在使用dropout,它会随机删除神经通路,因此没有一个功能可以主导模型。
Additionally, I’ve added L2 regularization to the first layer as well.
此外,我还向第一层添加了L2正则化。
Now that the model is done, I augmented my dataset with some generated agitation. I rotated, shifted, cropped, sheered, zoomed, flipped, and channel shifted my training images. This helps with assuring the images are trained through common noise.
现在已经完成了模型,我通过一些生成的搅动来扩充我的数据集。 我旋转,移动,裁剪,剪切,缩放,翻转和频道移动了我的训练图像。 这有助于确保通过常见噪声训练图像。
All the above systems are meant to prevent overfitting the model on the training data. Even if it is a ton of data, I want to keep the model as generalizable to new data as possible.
以上所有系统都是为了防止模型过度拟合训练数据。 即使是大量数据,我也希望保持模型尽可能地推广到新数据。
After running this for a long time, I got around 87% accuracy on the model! That’s a pretty good version one! Let’s make it great.
运行了很长一段时间后,我在模型上获得了约87%的精度! 那是一个相当不错的版本! 让我们变得很棒。
Once the new layers are trained up, you can unlock some deeper layers in your Inception model for retraining. The following code unlocks everything after as of the layer conv2d_56
.
对新层进行训练后,您可以在Inception模型中解锁一些更深的层以进行重新训练。 以下代码从conv2d_56
层conv2d_56
解锁所有内容。
set_trainable = False
for layer in conv_base.layers:
if layer.name == 'conv2d_56':
set_trainable = True
if set_trainable:
layer.trainable = True
else:
layer.trainable = False
I ran the model for a long time with these newly unlocked layers, and once I added exponential decay (via a scheduled learning rate), the model converged on a 91% accuracy on my test data!
我使用这些新解锁的图层运行了很长时间的模型,并且一旦添加了指数衰减(通过预定的学习率),模型对我的测试数据的收敛度就达到了91%!
With 300,000 images, finding mistakes in the training data was impossible. But with a model with only 9% error, I could break down the errors by category, and then I could look at only around 5,400 images! Essentially, I could use the model to help me find misclassifications and clean the dataset!
有了30万张图像,就不可能在训练数据中发现错误。 但是对于只有9%误差的模型,我可以按类别细分误差,然后只能查看大约5,400张图像! 本质上,我可以使用该模型来帮助我找到错误分类并清理数据集!
Technically, this would find false negatives only. Doing nothing for bias on the false positives, but with something that detects NSFW content, I imagine recall is more important than precision.
从技术上讲,这只会发现假阴性。 我不采取任何措施来弥补误报的偏见,但通过检测NSFW内容,我想起回忆比精确度更重要。
Even if you have a lot of test data, it’s usually pulled from the same well. The best test is to make it easy for others to use and check your model. This works best in open source and simple demos. I released http://nsfwjs.com which helped the community identify bias, and the community did just that!
即使您有很多测试数据,也通常是从同一口井中提取的。 最好的测试是使其他人易于使用和检查您的模型。 这在开源和简单演示中效果最好。 我发布了http://nsfwjs.com ,它帮助社区确定了偏见,而社区正是这样做的!
The community got two interesting indicators of bias fairly quickly. The fun one was that Jeffrey Goldblum kept getting miscategorized, and the not-so-fun one was that the model was overly sensitive to females.
社区很快获得了两个有趣的偏见指标。 有趣的是, 杰弗里·戈德布鲁姆(Jeffrey Goldblum)一直被错误分类 ,而不太有趣的是,该模型对女性过于敏感。
Once you start getting into hundreds of thousands of images, it’s hard for one person (like moi) to identify where an issue might be. Even if I looked through a thousand images in detail for bias, I wouldn’t have even scratched the surface of the dataset as a whole.
一旦开始获取成千上万张图像,一个人(例如moi )就很难确定问题所在。 即使我仔细查看了上千张图像是否存在偏差,我也根本不会刮擦整个数据集的表面。
That’s why it’s important to speak up. Misclassifying Jeff Goldblum is an entertaining data point, but identifying, documenting, and filing a ticket with examples does something powerful and good. I was able to get to work on fixing the bias.
这就是为什么重要的原因。 对Jeff Goldblum的错误分类是一个有趣的数据点,但是使用示例识别,记录和归档故障单确实有力而有益。 我能够解决偏见。
With new images, improved training, and better validation I was able to retrain the model over a few weeks and attain a much better outcome. The resulting model was far more accurate in the wild. Well, unless you laughed as hard as I did about the Jeff Goldblum issue.
通过使用新图像,改进的训练和更好的验证,我能够在几周内重新训练模型并获得更好的结果。 最终的模型在野外更加准确。 好吧,除非您像我对杰夫·戈德布鲁姆问题那样大笑。
If I could manufacture one flaw… I’d keep Jeff. But alas, we have hit 93% accuracy!
如果我能制造一个瑕疵……我会保留杰夫。 但是可惜,我们的准确率达到了93%!
It might have taken a lot of time, but it wasn’t hard, and it was fun to build a model. I suggest you grab the source code and try it for yourself! I’ll probably even attempt to retrain the model with other frameworks for comparison.
可能要花很多时间,但是并不难,建立模型很有趣。 我建议您获取源代码并自己尝试! 我什至可能会尝试使用其他框架重新训练模型以进行比较。
Show me what you’ve got. Contribute or ? Star/watch the repo if you’d like to see progress: https://github.com/GantMan/nsfw_model
告诉我你有什么。 贡献还是? 如果想查看进度,请加注星标/观看回购协议:h ttps://github.com/GantMan/nsfw_model
Gant Laborde is Chief Technology Strategist at Infinite Red, a published author, adjunct professor, worldwide public speaker, and mad scientist in training. Clap/follow/tweet or visit him at a conference.
甘特·劳德 ( Gant Laborde)是Infinite Red的首席技术策略师,他是已发表的作者,兼职教授,全球演讲者和训练中的疯狂科学家。 拍手/跟随/ 发推文或在会议上拜访他。
Avoid Nightmares — NSFW JSClient-side indecent content checking for the soulshift.infinite.red5 Things that Suck about Remote WorkThe Pitfalls of Remote Work + Proposed Solutionsshift.infinite.red
避免噩梦— NSFW JS 客户端不雅内容检查是否为灵魂 shift.infinite.red 5涉及远程工作 的 事情远程工作 的陷阱+提议的解决方案 shift.infinite.red
nsfw