Software Library for Authorship Identification
DOI:
https://doi.org/10.55630/dipp.2015.5.8Keywords:
text authorship identification, compression algorithms, normalized compression distance, n-grams, natural frequency zoned word distributionAbstract
The aim of this paper is to review some methods for text authorship attribution and to discuss the development of a software library with tools for automatic authorship attribution. The presentation is focused on an analysis of two groups of tools oriented to: (1) methods for extraction of features and (2) methods for computing the distance between character strings based on data compression algorithms.References
Adair D. (1944). The Authorship of the Disputed Federalist Papers. The William and Mary Quarterly ser. 3, vol. 1, no. 2: 97-122.
Cavnar W., Trenkle J. (1994). N-gram-based text categorization. In Proceedings of the 3 rd Annual Symposium on Document Analysis and Information Retrieval SDAIR-94, 161–175.
Chen Z., Huang L., Yang W., Meng P., Haibo Miao H. (2012). More than Word Frequencies: Authorship Attribution via Natural Frequency Zoned Word Distribution Analysis. Cornell University Library.
Cilibrasi R., Vitanyi P. M. B. (2005). Clustering by compression. IEEE Transactions on Information Theory, 51(4), 1523-1545.
Diederich J. (2003). Authorship Attribution with Support Vector Machines. Applied Intelligence 19, 109-123.
Google Code Jam: https://code.google.com/codejam
Hantova C. (2015). Authorship attribution. MSc Thesis, Sofia University, Facuty of Mathematics and Informatics.
Ivanov I. (2013). Automatic authorship attribution using compression methods. MSc Thesis, Sofia University, Facuty of Mathematics and Informatics.
Luyckx K. (2011). Scalability Issues in Authorship Attribution. Vubpress.
Mosteller F., Wallace D. (1964). Inference and disputed authorship: The Federalist. Addison- Wesley.
Peng F., Shuurmans D., Wang S. (2004). Augmenting naive Bayes classifiers with statistical language models. Information Retrieval Journal, 7(1), 317-345.
Stamatatos E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3), 538-556.