A Deep Learning Approach to Identify Not Suitable for Work Images
Abstract
Web Archiving (WA) deals with the preservation of portions of the World Wide Web (WWW) allowing their availability for future access. Arquivo.pt is a WA initiative
holding a huge amount of content, including image files.
However, some of these images contain nudity and pornography, that can be offensive for the users, and thus being Not Suitable
For Work (NSFW). This work proposes a methodology to classify NSFW images available at Arquivo.pt, using deep neural network approaches. A large dataset of images is built using Arquivo.pt data and two pre-trained neural network models, namely ResNet and SqueezeNet, are evaluated and improved for the NSFW classification task, using the dataset.
The evaluation of these models reported an accuracy of 93% and 72%, respectively. After a fine tuning stage, the accuracy of these models improved to 94% and 89%, respectively.
The proposed solution is integrated into the Arquivo.pt Image Search System, enabling the filtering of the problematic NSFW images. At the time of this writing, the proposed solution is in production at https://arquivo.pt/images.jsp
Keywords
Full Text:
PDFReferences
I. Archive. Internet Archive: CDX File Format Reference @ONLINE,
Jan. 2018.
Arquivo.pt. Arquivo.pt API v.0.2 (beta version), Mar. 2018.
M. D. Bloice, C. Stocker, and A. Holzinger. Augmentor: An image
augmentation library for machine learning. CoRR, abs/1708.04680,
A. Clark et al. Pillow: 3.1.0, Jan. 2016.
M. Costa. Information Search in Web Archives. PhD thesis, Faculty
of Sciences of the University of Lisbon, December 2014.
T. Deselaers, L. Pimenidis, and H. Ney. Bag-of-visual-words models
for adult image classification and filtering. In 2008 19th International
Conference on Pattern Recognition, pages 1–4, Dec 2008.
I. O. for Standardization. ISO 28500:2017 information and documentation – WARC file format, Jan. 2018.
D. Forsyth and M. Fleck. Automatic detection of human nudes.
International Journal of Computer Vision, 32(1):63–77, Aug 1999.
X. Glorot and Y. Bengio. Understanding the difficulty of training deep
feedforward neural networks. In In Proceedings of the International
Conference on Artificial Intelligence and Statistics (AISTATS’10).
Society for Artificial Intelligence and Statistics, 2010.
K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for
Image Recognition. In 2016 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 770–778, 2016.
F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and
K. Keutzer. SqueezeNet. arXiv, 2016.
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. CoRR,
abs/1502.03167, 2015.
K. Janocha and W. M. Czarnecki. On loss functions for deep neural
networks in classification. CoRR, abs/1702.05659, 2017.
Y. Jia et al. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
G. Jiuxiang et al. Recent advances in convolutional neural networks.
CoRR, abs/1512.07108, 2015.
H. Kaiming et al. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. CoRR, abs/1502.01852, 2015.
A. Krizhevsky, I. Sulskever, and G. E. Hinton. ImageNet Classification
with Deep Convolutional Neural Networks. Advances in Neural
Information and Processing Systems (NIPS), pages 1–9, 2012.
N. Levitt. Brozzler - Distributed browser-based web crawler @ONLINE, Jan. 2018.
T. Lindeberg. Scale Invariant Feature Transform. Scholarpedia,
(5):10491, 2012. revision #153939.
C. X. Ling, J. Huang, and H. Zhang. AUC: A better measure
than accuracy in comparing learning algorithms. In Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics), volume 2671, pages
–341, 2003.
J. Mahadeokar. Open NSFW model code @ONLINE, Jan. 2018.
S. Masood, M. N. Doja, and P. Chandra. Analysis of weight
initialization techniques for Gradient Descent algorithm. In 12th IEEE
International Conference Electronics, Energy, Environment, Communication, Computer, Control: (E3-C3), INDICON 2015, 2016.
D. Mohamed. POESIA - Filtering Software @ONLINE, Jan. 2018.
G. Mohr et al. Introduction to Heritrix, an archival quality web crawler.
In 4th International Web Archiving Workshop (IWAW04), Bath, UK,
S. Ruder. An overview of gradient descent optimization algorithms.
CoRR, abs/1609.04747, 2016.
P. Simard, D. Steinkraus, and J. C. Platt. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis.
Proceedings of the 7th International Conference on Document Analysis
and Recognition, pages 958–963, 2003.
slate.com. Words banned from Bing and Google’s autocomplete
algorithms., Mar. 2018.
C. Szegedy et al. Going deeper with convolutions. CoRR,
abs/1409.4842, 2014.
S. University. Imagenet @ONLINE, Jan. 2018.
S. C. Wong et al. Understanding Data Augmentation for Classification:
When to Warp? In 2016 International Conference on Digital Image
Computing: Techniques and Applications, DICTA 2016, 2016.
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars,
editors, Computer Vision – ECCV 2014, pages 818–833, Cham, 2014.
Springer International Publishing.
H. Zheng, M. Daoudi, and B. Jedynak. Blocking Adult Images Based
on Statistical Skin Detection. Electronic Letters on Computer Vision
and Image Analysis, 4(2):1–14, 2004
DOI: http://dx.doi.org/10.34629/ipl.isel.i-ETC.80
Refbacks
- There are currently no refbacks.
Copyright (c) 2020 Artur Ferreira, Daniel Bicho
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.