A Deep Learning Approach to Identify Not Suitable for Work Images

Daniel Bicho, Artur Ferreira, Nuno Datia

Abstract


Web Archiving (WA) deals with the preservation of portions of the World Wide Web (WWW) allowing their availability for future access. Arquivo.pt is a WA initiative
holding a huge amount of content, including image files.
However, some of these images contain nudity and pornography, that can be offensive for the users, and thus being Not Suitable
For Work (NSFW). This work proposes a methodology to classify NSFW images available at Arquivo.pt, using deep neural network approaches. A large dataset of images is built using Arquivo.pt data and two pre-trained neural network models, namely ResNet and SqueezeNet, are evaluated and improved for the NSFW classification task, using the dataset.
The evaluation of these models reported an accuracy of 93% and 72%, respectively. After a fine tuning stage, the accuracy of these models improved to 94% and 89%, respectively.
The proposed solution is integrated into the Arquivo.pt Image Search System, enabling the filtering of the problematic NSFW images. At the time of this writing, the proposed solution is in production at https://arquivo.pt/images.jsp


Keywords


Deep Learning; Deep Neural Networks; Image Classification; Not Suitable for Work Images; ResNet; SqueezeNet; Redis Message Queue

Full Text:

PDF

References


I. Archive. Internet Archive: CDX File Format Reference @ONLINE,

Jan. 2018.

Arquivo.pt. Arquivo.pt API v.0.2 (beta version), Mar. 2018.

M. D. Bloice, C. Stocker, and A. Holzinger. Augmentor: An image

augmentation library for machine learning. CoRR, abs/1708.04680,

A. Clark et al. Pillow: 3.1.0, Jan. 2016.

M. Costa. Information Search in Web Archives. PhD thesis, Faculty

of Sciences of the University of Lisbon, December 2014.

T. Deselaers, L. Pimenidis, and H. Ney. Bag-of-visual-words models

for adult image classification and filtering. In 2008 19th International

Conference on Pattern Recognition, pages 1–4, Dec 2008.

I. O. for Standardization. ISO 28500:2017 information and documentation – WARC file format, Jan. 2018.

D. Forsyth and M. Fleck. Automatic detection of human nudes.

International Journal of Computer Vision, 32(1):63–77, Aug 1999.

X. Glorot and Y. Bengio. Understanding the difficulty of training deep

feedforward neural networks. In In Proceedings of the International

Conference on Artificial Intelligence and Statistics (AISTATS’10).

Society for Artificial Intelligence and Statistics, 2010.

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for

Image Recognition. In 2016 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 770–778, 2016.

F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and

K. Keutzer. SqueezeNet. arXiv, 2016.

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep

network training by reducing internal covariate shift. CoRR,

abs/1502.03167, 2015.

K. Janocha and W. M. Czarnecki. On loss functions for deep neural

networks in classification. CoRR, abs/1702.05659, 2017.

Y. Jia et al. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

G. Jiuxiang et al. Recent advances in convolutional neural networks.

CoRR, abs/1512.07108, 2015.

H. Kaiming et al. Delving deep into rectifiers: Surpassing human-level

performance on imagenet classification. CoRR, abs/1502.01852, 2015.

A. Krizhevsky, I. Sulskever, and G. E. Hinton. ImageNet Classification

with Deep Convolutional Neural Networks. Advances in Neural

Information and Processing Systems (NIPS), pages 1–9, 2012.

N. Levitt. Brozzler - Distributed browser-based web crawler @ONLINE, Jan. 2018.

T. Lindeberg. Scale Invariant Feature Transform. Scholarpedia,

(5):10491, 2012. revision #153939.

C. X. Ling, J. Huang, and H. Zhang. AUC: A better measure

than accuracy in comparing learning algorithms. In Lecture Notes

in Computer Science (including subseries Lecture Notes in Artificial

Intelligence and Lecture Notes in Bioinformatics), volume 2671, pages

–341, 2003.

J. Mahadeokar. Open NSFW model code @ONLINE, Jan. 2018.

S. Masood, M. N. Doja, and P. Chandra. Analysis of weight

initialization techniques for Gradient Descent algorithm. In 12th IEEE

International Conference Electronics, Energy, Environment, Communication, Computer, Control: (E3-C3), INDICON 2015, 2016.

D. Mohamed. POESIA - Filtering Software @ONLINE, Jan. 2018.

G. Mohr et al. Introduction to Heritrix, an archival quality web crawler.

In 4th International Web Archiving Workshop (IWAW04), Bath, UK,

S. Ruder. An overview of gradient descent optimization algorithms.

CoRR, abs/1609.04747, 2016.

P. Simard, D. Steinkraus, and J. C. Platt. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis.

Proceedings of the 7th International Conference on Document Analysis

and Recognition, pages 958–963, 2003.

slate.com. Words banned from Bing and Google’s autocomplete

algorithms., Mar. 2018.

C. Szegedy et al. Going deeper with convolutions. CoRR,

abs/1409.4842, 2014.

S. University. Imagenet @ONLINE, Jan. 2018.

S. C. Wong et al. Understanding Data Augmentation for Classification:

When to Warp? In 2016 International Conference on Digital Image

Computing: Techniques and Applications, DICTA 2016, 2016.

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars,

editors, Computer Vision – ECCV 2014, pages 818–833, Cham, 2014.

Springer International Publishing.

H. Zheng, M. Daoudi, and B. Jedynak. Blocking Adult Images Based

on Statistical Skin Detection. Electronic Letters on Computer Vision

and Image Analysis, 4(2):1–14, 2004




DOI: http://dx.doi.org/10.34629/ipl.isel.i-ETC.80

Refbacks

  • There are currently no refbacks.


Copyright (c) 2020 Artur Ferreira, Daniel Bicho

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.