Hostname: page-component-78c5997874-xbtfd Total loading time: 0 Render date: 2024-11-14T05:21:02.518Z Has data issue: false hasContentIssue false

New access services in HbbTV based on a deep learning approach for media content analysis

Published online by Cambridge University Press:  04 December 2019

Silvia Uribe*
Affiliation:
Grupo de Aplicación de Telecomunicaciones Visuales, ETSIT, Universidad Politécnica de Madrid, Madrid, Spain
Alberto Belmonte
Affiliation:
Grupo de Aplicación de Telecomunicaciones Visuales, ETSIT, Universidad Politécnica de Madrid, Madrid, Spain
Francisco Moreno
Affiliation:
Grupo de Aplicación de Telecomunicaciones Visuales, ETSIT, Universidad Politécnica de Madrid, Madrid, Spain
Álvaro Llorente
Affiliation:
Grupo de Aplicación de Telecomunicaciones Visuales, ETSIT, Universidad Politécnica de Madrid, Madrid, Spain
Juan Pedro López
Affiliation:
Grupo de Aplicación de Telecomunicaciones Visuales, ETSIT, Universidad Politécnica de Madrid, Madrid, Spain
Federico Álvarez
Affiliation:
Grupo de Aplicación de Telecomunicaciones Visuales, ETSIT, Universidad Politécnica de Madrid, Madrid, Spain
*
Author for correspondence: Silvia Uribe, E-mail: [email protected]

Abstract

Universal access on equal terms to audiovisual content is a key point for the full inclusion of people with disabilities in activities of daily life. As a real challenge for the current Information Society, it has been detected but not achieved in an efficient way, due to the fact that current access solutions are mainly based in the traditional television standard and other not automated high-cost solutions. The arrival of new technologies within the hybrid television environment together with the application of different artificial intelligence techniques over the content will assure the deployment of innovative solutions for enhancing the user experience for all. In this paper, a set of different tools for image enhancement based on the combination between deep learning and computer vision algorithms will be presented. These tools will provide automatic descriptive information of the media content based on face detection for magnification and character identification. The fusion of this information will be finally used to provide a customizable description of the visual information with the aim of improving the accessibility level of the content, allowing an efficient and reduced cost solution for all.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agustsson, E, Timofte, R, Escalera, S, Baro, X, Guyon, I and Rothe, R (2017) Apparent and real age estimation in still images with deep residual regressors on APPA-REAL database. 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, pp. 87–94.Google Scholar
Ahmed, AH, Kpalma, K and Guedi, AO (2017) Human detection using HOG-SVM, mixture of Gaussian and background contours subtraction. 2017 13th International Conference on Signal-Image Technology Internet-Based Systems (SITIS), pp. 334–338. doi:10.1109/SITIS.2017.62CrossRefGoogle Scholar
Belhumeur, PN, Jacobs, DW, Kriegman, DJ and Kumar, N (2013) Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 29302940.CrossRefGoogle ScholarPubMed
Bertinetto, L, Valmadre, J, Henriques, JF, Vedaldi, A and Torr, PHS (2016) Fully-convolutional Siamese networks for object tracking. CoRR, abs/1606.0. Available at http://arxiv.org/abs/1606.09549CrossRefGoogle Scholar
Broad, A, Jones, M and Lee, T-Y (2018) Recurrent multi-frame single shot detector for video object detection. British Machine Video Conference (BMVC), Newcastle, UK.Google Scholar
Cao, Z, Simon, T, Wei, S-E and Sheikh, Y (2016) Realtime multi-person 2D pose estimation using part affinity fields. CoRR, abs/1611.0. Available at http://arxiv.org/abs/1611.08050Google Scholar
Castellano, B (2018) Pyscenedetect. Available at https://pyscenedetect.readthedocs.ioGoogle Scholar
Chakraborty, S and Das, D (2014) An overview of face liveness detection. CoRR, abs/1405.2. Available at http://arxiv.org/abs/1405.2227CrossRefGoogle Scholar
Claudy, L (2012) The broadcast empire strikes back. IEEE Spectrum 49, 5258. doi:10.1109/MSPEC.2012.6361764Google Scholar
CNMC (2017) Informe sobre el seguimiento de las obligaciones impuestas en materia de accesibilidad correspondiente al año 2016. Available at https://www.cnmc.es/sites/default/files/1855187_9.pdfGoogle Scholar
CSA (2017) L'accessibilité des programmes de télévision aux personnes handicapées et la représentation du hándicap à l'antenne. Conseil Supérieur de L'audiovisuel. Rapport annuel 2016.Google Scholar
Cuimei, L, Zhiliang, Q, Nan, J and Jianhua, W (2017) Human face detection algorithm via Haar cascade classifier combined with three additional classifiers. 2017 13th IEEE International Conference on Electronic Measurement Instruments (ICEMI), pp. 483–487. doi:10.1109/ICEMI.2017.8265863CrossRefGoogle Scholar
Danelljan, M, Häger, G, Khan, FS and Felsberg, M (2014) Accurate scale estimation for robust visual tracking. British Machine Vision Conference (BMVC), Nottingham, UK.CrossRefGoogle Scholar
Domínguez, A, Agirre, M, Flörez, J, Lafuente, A, Tamayo, I and Zorrilla, M (2018) Deployment of a hybrid broadcast-internet multi-device service for a live TV programme. IEEE Transactions on Broadcasting 64, 153163. doi:10.1109/TBC.2017.2755403CrossRefGoogle Scholar
EasyTV Project (n.d.) EasyTV project website. Available at https://easytvproject.eu/Google Scholar
eMarketer (2017) US simultaneous media users: eMarketer's estimates for 2017. Available at https://www.emarketer.com/Report/US-Simultaneous-Media-Users-eMarketers-Estimates-2017/2002163Google Scholar
ETSI (2016) Hybrid broadcast broadband TV ETSI standard TS 102 796 2016. Available at https://www.etsi.org/deliver/etsi_ts/102700_102799/102796/01.04.01_60/ts_102796v010401p.pdfGoogle Scholar
European Commission (2010) European disability strategy 2010-2020: a renewed commitment to a barrier-free Europe. Available at https://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=COM:2010:0636:FIN:en:PDFGoogle Scholar
Feichtenhofer, C, Pinz, A and Zisserman, A (2017) Detect to track and track to detect. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawai, USA, pp. 3038–3046.Google Scholar
Fiaz, M, Mahmood, A and Jung, SK (2018) Tracking noisy targets: a review of recent object tracking approaches. ArXiv Preprint ArXiv:1802.03098.Google Scholar
Gordon, D, Farhadi, A and Fox, D (2017) Re3: real-time recurrent regression networks for object tracking. CoRR, abs/1705.0. Available at http://arxiv.org/abs/1705.06368Google Scholar
Güler, RA, Neverova, N and Kokkinos, I (2018) DensePose: dense human pose estimation in the wild. CoRR, abs/1802.0. Available at http://arxiv.org/abs/1802.00434Google Scholar
Hassaballah, M, Abdelmgeid, AA and Alshazly, HA (2016) Image Feature Detectors and Descriptors. In Awad, Ali Ismail and Hassaballah, Mahmoud (eds), Image Feature Detectors and Descriptors. Springer International Publishing (Verlag), pp. 1145.CrossRefGoogle Scholar
He, K, Zhang, X, Ren, S and Sun, J (2016) Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, pp. 770–778.CrossRefGoogle Scholar
He, K, Gkioxari, G, Dollár, P and Girshick, RB (2017) Mask {R-CNN}. CoRR, abs/1703.0. Available at http://arxiv.org/abs/1703.06870Google Scholar
Held, D, Thrun, S and Savarese, S (2016) Learning to track at 100 {FPS} with deep regression networks. CoRR, abs/1604.0. Available at http://arxiv.org/abs/1604.01802Google Scholar
Henriques, JF, Caseiro, R, Martins, P and Batista, J (2014) High-speed tracking with kernelized correlation filters. CoRR, abs/1404.7. Available at http://arxiv.org/abs/1404.7584Google Scholar
Howard, AG, Zhu, M, Chen, B, Kalenichenko, D, Wang, W, Weyand, T, Andreetto, M, Adam, H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. ArXiv Preprint ArXiv:1704.04861.Google Scholar
Immersive Accessibility Project (n.d.) Immersive accessibility project website. Available at http://www.imac-project.eu/Google Scholar
Jain, V and Learned-Miller, E (2010) FDDB: a benchmark for face detection in unconstrained settings.Google Scholar
Le, V, Brandt, J, Lin, Z, Bourdev, L and Huang, TS (2012) Interactive facial feature localization. European Conference on Computer Vision, Florence, Italy, pp. 679–692.CrossRefGoogle Scholar
Lin, T-Y, Goyal, P, Girshick, RB, He, K and Dollár, P (2017) Focal loss for dense object detection. CoRR, abs/1708.0. Available at http://arxiv.org/abs/1708.02002Google Scholar
Liu, A, Du, Y, Wang, T, Li, J, Li, EQ, Zhang, Y and Zhao, Y (2011) Fast facial landmark detection using cascade classifiers and a simple 3D model. 2011 18th IEEE International Conference on Image Processing (ICIP), Brussels, Belgium, pp. 845–848.CrossRefGoogle Scholar
Liu, W, Anguelov, D, Erhan, D, Szegedy, C, Reed, SE, Fu, C-Y and Berg, AC (2015) SSD: Single Shot MultiBox Detector. CoRR, abs/1512.0. Available at http://arxiv.org/abs/1512.02325.Google Scholar
Lukezic, A, Vojir, T, Cehovin, L, Matas, J and Kristan, M (2016) Discriminative correlation filter with channel and spatial reliability. CoRR, abs/1611.0. Available at http://arxiv.org/abs/1611.08461Google Scholar
Luo, W, Xing, J, Milan, A, Zhang, X, Liu, W, Zhao, X and Kim, T-K (2014) Multiple object tracking: a literature review. ArXiv Preprint ArXiv:1409.7618.Google Scholar
Malhotra, R (2013) Hybrid broadcast broadband TV: the way forward for connected TVs. IEEE Consumer Electronics Magazine 2, 1016. doi:10.1109/MCE.2013.2251760CrossRefGoogle Scholar
Matamala, A, Orero, P, Rovira-Esteva, S, Casas Tost, H, Morales Morante, F, Soler Vilageliu, O and Tor-Carroggio, I (2018) User-centric approaches in access services evaluation: profiling the end user. Proceedings of the Eleventh International Conference on Language Resources Evaluation (LREC 2018), Miyazaki, Japan, pp. 1–7.Google Scholar
McNally, J and Harrington, B (2017) How millennials and teens consume mobile video. Proceedings of the 2017 ACM International Conference on Interactive Experiences for TV and Online Video. New York, NY, USA: ACM, pp. 31–39. doi:10.1145/3077548.3077555.CrossRefGoogle Scholar
Messer, K, Matas, J, Kittler, J, Luettin, J and Maitre, G (1999) XM2VTSDB: The extended M2VTS database. Second International Conference on Audio and Video-Based Biometric Person Authentication, Washington, DC, USA, Vol. 964, pp. 965–966.Google Scholar
Monzo, D, Albiol, A, Albiol, A and Mossi, JM (2010) A comparative study of facial landmark localization methods for face recognition using hog descriptors. 2010 20th International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, pp. 1330–1333.CrossRefGoogle Scholar
NIELSEN a (2017) The Nielsen comparable metrics report, Q1-2016. Available at https://www.nielsen.com/us/en/insights/reports/2016/the-comparable-metrics-report-q1-2016.htmlGoogle Scholar
NIELSEN b (2017) The Nielsen comparable metrics report, Q2-2016. Available at https://www.nielsen.com/us/en/insights/reports/2016/the-comparable-metrics-report-q2-2016.htmlGoogle Scholar
NIELSEN c (2017) The Nielsen comparable metrics report, Q3-2016. Available at https://www.nielsen.com/us/en/insights/reports/2017/the-comparable-metrics-report-q3-2016.htmlGoogle Scholar
NIELSEN d (2017) The Nielsen comparable metrics report, Q4-2016. Available at https://www.nielsen.com/us/en/insights/reports/2017/the-comparable-metrics-report-q4-2016.htmlGoogle Scholar
NIELSEN e (2018) The Nielsen comparable metrics report, Q1-2017. Available at https://www.nielsen.com/us/en/insights/reports/2017/the-nielsen-comparable-metrics-report-q1-2017.htmlGoogle Scholar
NIELSEN f (2018) The Nielsen comparable metrics report, Q2-2017. Available at https://www.nielsen.com/us/en/insights/reports/2017/the-nielsen-comparable-metrics-report-q2-2017.htmlGoogle Scholar
Ning, G, Zhang, Z, Huang, C, Ren, X, Wang, H, Cai, C and He, Z (2017) Spatially supervised recurrent convolutional neural networks for visual object tracking. 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4.CrossRefGoogle Scholar
Orero, P, Martín, CA and Zorrilla, M (2015) HBB4ALL: deployment of HbbTV services for all. 2015 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, Baltimore, Maryland, USA, pp. 1–4, doi:10.1109/BMSB.2015.7177252.CrossRefGoogle Scholar
Padilla, R, Filho, C and Costa, M (2012) Evaluation of Haar cascade classifiers designed for face detection. World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering 6, 466469Google Scholar
Prosperity4All Project (n.d.) Prosperity 4All project website. Available at http://www.prosperity4all.eu/Google Scholar
Redmon, J, Divvala, SK, Girshick, RB and Farhadi, A (2015) You only look once: unified, real-time object detection. CoRR, abs/1506.0. Available at http://arxiv.org/abs/1506.02640Google Scholar
Ren, S, He, K, Girshick, RB and Sun, J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.0. Available at http://arxiv.org/abs/1506.01497.Google Scholar
Rothe, R, Timofte, R and Van Gool, L (2018) Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision 126, 144157.CrossRefGoogle Scholar
Sáez Trigueros, D, Meng, L and Hartnett, M (2018) Face recognition: from traditional to deep learning methods. CoRR, abs/1811.00116.Google Scholar
Sagonas, C, Antonakos, E, Tzimiropoulos, G, Zafeiriou, S and Pantic, M (2016) 300 faces in-the-wild challenge: database and results. Image and Vision Computing 47, 318.CrossRefGoogle Scholar
Simonyan, K and Zisserman, A (2014) Very deep convolutional networks for large-scale image recognition. ArXiv Preprint ArXiv:1409.1556.Google Scholar
Sodagar, I (2011) The MPEG-DASH standard for multimedia streaming over the internet. IEEE MultiMedia 18, 6267. doi:10.1109/MMUL.2011.71CrossRefGoogle Scholar
Vinayagamoorthy, V, Allen, P, Hammond, M and Evans, M (2012) Researching the user experience for connected Tv: a case study. CHI ‘12 Extended Abstracts on Human Factors in Computing Systems. New York, NY, USA: ACM, pp. 589–604. doi:10.1145/2212776.2212832.CrossRefGoogle Scholar
Voulodimos, A, Doulamis, N, Doulamis, A and Protopapadakis, E (2018) Deep learning for computer vision: a brief review. Computational Intelligence and Neuroscience 2018, 7068349, 13 pages.CrossRefGoogle ScholarPubMed
Wang, M and Deng, W (2018) Deep face recognition: a survey. ArXiv Preprint ArXiv:1804.06655.Google Scholar
Wolf, L, Hassner, T and Maoz, I (2011) Face recognition in unconstrained videos with matched background similarity. 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, pp. 529–534.Google Scholar
Woods, RL and Satgunam, P (2011) Television, computer and portable display device use by people with central vision impairment. Ophthalmic and Physiological Optics 31, 258274CrossRefGoogle ScholarPubMed
World Health Organization and others (2013) Universal eye health: a global action plan 2014-2019.Google Scholar
Xu, Y, Xu, L, Li, D and Wu, Y (2011) Pedestrian detection using background subtraction assisted Support Vector Machine. 2011 11th International Conference on Intelligent Systems Design and Applications, pp. 837–842. doi:10.1109/ISDA.2011.6121761CrossRefGoogle Scholar
Yuheng, S and Hao, Y (2017) Image segmentation algorithms overview. CoRR, abs/1707.0. Available at http://arxiv.org/abs/1707.02051Google Scholar
Zagoruyko, S and Komodakis, N (2016) Wide residual networks. ArXiv Preprint ArXiv:1605.07146.CrossRefGoogle Scholar
Zhang Zhifei, SY and Qi, H (2017) Age progression/regression by conditional adversarial autoencoder. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawai, USA.CrossRefGoogle Scholar
Zhu, X and Ramanan, D (2012) Face detection, pose estimation, and landmark localization in the wild. 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Provicende, Rhode Island, USA, pp. 2879–2886.Google Scholar
Ziegler, C (2013) Second screen for HbbTV — Automatic application launch and app-to-app communication enabling novel TV programme related second-screen scenarios. 2013 IEEE Third International Conference on Consumer Electronics - Berlin (ICCE-Berlin), pp. 1–5. doi:10.1109/ICCE-Berlin.2013.6697990.CrossRefGoogle Scholar