Machine Learning and Neural Networks Tools to Address Noisy Data Issues

Authors

  • Maria Teresa Artese IMATI - CNR (National Research Council), Milan, Italy
  • Isabella Gagliardi IMATI - CNR (National Research Council), Milan, Italy

DOI:

https://doi.org/10.55630/dipp.2021.11.8

Keywords:

Digital Library, Unsupervised Tools, Noisy Data, Tags, Content Based Retrieval

Abstract

In this paper, we present tools for addressing noisy keyword issues in digital libraries. Two tasks, language detection and misspelling detection and correction, are addressed using both machine learning and deep learning techniques. To train and validate the models, different datasets were used/created/scraped. Encouraging preliminary results are presented and discussed.

References

Botha, G. R. (2012). Factors that affect the accuracy of text-based language identification. Computer Speech & Language , 307-320.

Cho, K. V. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. ArXiv preprint .

Cornell_University. (2020). ArXiv dataset. Tratto da Kaggle: https://www.kaggle.com/Cornell-University/arxiv

Devlin, J. M.-W. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint .

Etoori, P. M. (2018). Automatic spelling correction for resource-scarce languages using deep learning. Proceedings of ACL 2018, Student Research Workshop.

Fellbaum, C. (2010). WordNet. In Theory and applications of ontology: computer applications. (p. 231-243). Springer.

Goldberg, Y. (2015). A Primer on Neural Network Models for Natural Language. ArXiv Preprint . Tratto da https://arxiv.org/pdf/1510.00726.pdf

Hládek, D. J. (2020). Survey of Automatic Spelling Correc-tion. Electronics .

Jauhiainen, T. L. (2019). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research , 675-782.

Lopez-Moreno, I. G.-D.-R. (2014). Automatic language identification using deep neural networks. . 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) , (p. 5337-5341).

Mikolov, T., Chen, K., Corrado, G., Dean, J., Sutskever, L., & Zweig, G. (2020, 3 27). Tool for compu-ting continuous distributed representations of words: word2vec. Tratto da google: https://code.google.com/p/word2vec

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information pro-cessing systems 26 , 3111-3119.

Mukherjee, H. D. (2020). An ensemble learning-based language identification system. Computational Advancement in Comm. Circuits and Systems , 129-138.

Ramakrishnan, M. Z. (2019). UVA wahoos at SemEval-2019 task 6: Hate speech identification using ensemble machine learning. Proceedings of the 13th International Workshop on Semantic Evaluation , (p. 806-811).

Simões, A. A. (2014). Language Identification: a Neural Network Approach. 3rd Symposium on Languages, Applications and Technologies.

Sutskever, I. V. (2014). Sequence to sequence learning with neural networks. . ArXiv preprint .

Downloads

Published

2021-09-10

How to Cite

Teresa Artese, M., & Gagliardi, I. (2021). Machine Learning and Neural Networks Tools to Address Noisy Data Issues. Digital Presentation and Preservation of Cultural and Scientific Heritage, 11, 89–98. https://doi.org/10.55630/dipp.2021.11.8

Most read articles by the same author(s)