Machine Learning and Neural Networks Tools to Address Noisy Data Issues


  • Maria Teresa Artese IMATI - CNR (National Research Council), Milan, Italy
  • Isabella Gagliardi IMATI - CNR (National Research Council), Milan, Italy



Digital Library, Unsupervised Tools, Noisy Data, Tags, Content Based Retrieval


In this paper, we present tools for addressing noisy keyword issues in digital libraries. Two tasks, language detection and misspelling detection and correction, are addressed using both machine learning and deep learning techniques. To train and validate the models, different datasets were used/created/scraped. Encouraging preliminary results are presented and discussed.


Botha, G. R. (2012). Factors that affect the accuracy of text-based language identification. Computer Speech & Language , 307-320.

Cho, K. V. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. ArXiv preprint .

Cornell_University. (2020). ArXiv dataset. Tratto da Kaggle:

Devlin, J. M.-W. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint .

Etoori, P. M. (2018). Automatic spelling correction for resource-scarce languages using deep learning. Proceedings of ACL 2018, Student Research Workshop.

Fellbaum, C. (2010). WordNet. In Theory and applications of ontology: computer applications. (p. 231-243). Springer.

Goldberg, Y. (2015). A Primer on Neural Network Models for Natural Language. ArXiv Preprint . Tratto da

Hládek, D. J. (2020). Survey of Automatic Spelling Correc-tion. Electronics .

Jauhiainen, T. L. (2019). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research , 675-782.

Lopez-Moreno, I. G.-D.-R. (2014). Automatic language identification using deep neural networks. . 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) , (p. 5337-5341).

Mikolov, T., Chen, K., Corrado, G., Dean, J., Sutskever, L., & Zweig, G. (2020, 3 27). Tool for compu-ting continuous distributed representations of words: word2vec. Tratto da google:

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information pro-cessing systems 26 , 3111-3119.

Mukherjee, H. D. (2020). An ensemble learning-based language identification system. Computational Advancement in Comm. Circuits and Systems , 129-138.

Ramakrishnan, M. Z. (2019). UVA wahoos at SemEval-2019 task 6: Hate speech identification using ensemble machine learning. Proceedings of the 13th International Workshop on Semantic Evaluation , (p. 806-811).

Simões, A. A. (2014). Language Identification: a Neural Network Approach. 3rd Symposium on Languages, Applications and Technologies.

Sutskever, I. V. (2014). Sequence to sequence learning with neural networks. . ArXiv preprint .




How to Cite

Teresa Artese, M., & Gagliardi, I. (2021). Machine Learning and Neural Networks Tools to Address Noisy Data Issues. Digital Presentation and Preservation of Cultural and Scientific Heritage, 11, 89–98.

Most read articles by the same author(s)