Machine Learning and Neural Networks Tools to Address Noisy Data Issues
DOI:
https://doi.org/10.55630/dipp.2021.11.8Keywords:
Digital Library, Unsupervised Tools, Noisy Data, Tags, Content Based RetrievalAbstract
In this paper, we present tools for addressing noisy keyword issues in digital libraries. Two tasks, language detection and misspelling detection and correction, are addressed using both machine learning and deep learning techniques. To train and validate the models, different datasets were used/created/scraped. Encouraging preliminary results are presented and discussed.References
Botha, G. R. (2012). Factors that affect the accuracy of text-based language identification. Computer Speech & Language , 307-320.
Cho, K. V. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. ArXiv preprint .
Cornell_University. (2020). ArXiv dataset. Tratto da Kaggle: https://www.kaggle.com/Cornell-University/arxiv
Devlin, J. M.-W. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint .
Etoori, P. M. (2018). Automatic spelling correction for resource-scarce languages using deep learning. Proceedings of ACL 2018, Student Research Workshop.
Fellbaum, C. (2010). WordNet. In Theory and applications of ontology: computer applications. (p. 231-243). Springer.
Goldberg, Y. (2015). A Primer on Neural Network Models for Natural Language. ArXiv Preprint . Tratto da https://arxiv.org/pdf/1510.00726.pdf
Hládek, D. J. (2020). Survey of Automatic Spelling Correc-tion. Electronics .
Jauhiainen, T. L. (2019). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research , 675-782.
Lopez-Moreno, I. G.-D.-R. (2014). Automatic language identification using deep neural networks. . 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) , (p. 5337-5341).
Mikolov, T., Chen, K., Corrado, G., Dean, J., Sutskever, L., & Zweig, G. (2020, 3 27). Tool for compu-ting continuous distributed representations of words: word2vec. Tratto da google: https://code.google.com/p/word2vec
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information pro-cessing systems 26 , 3111-3119.
Mukherjee, H. D. (2020). An ensemble learning-based language identification system. Computational Advancement in Comm. Circuits and Systems , 129-138.
Ramakrishnan, M. Z. (2019). UVA wahoos at SemEval-2019 task 6: Hate speech identification using ensemble machine learning. Proceedings of the 13th International Workshop on Semantic Evaluation , (p. 806-811).
Simões, A. A. (2014). Language Identification: a Neural Network Approach. 3rd Symposium on Languages, Applications and Technologies.
Sutskever, I. V. (2014). Sequence to sequence learning with neural networks. . ArXiv preprint .