Bilingual Corpus – Digital Repository for Preservation of Language Heritage
Natural Language, Bilingual Corpus, Parallel Corpus, Aligned Corpus, AnnotationAbstract
The article briefly reviews bilingual Slovak-Bulgarian/Bulgarian- Slovak parallel and aligned corpus. The corpus is collected and developed as results of the collaboration in the frameworks of the joint research project between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, and Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. The mult ilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures. This bilingual corpus will be widely applicable to the contrastive studies of the both Slavic languages, will also be useful resource for language engineering research and development, especially in machine translation.References
Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.-J., Petkevic, V., and Tufis, D.: Multext- East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages. In: COLING- ACL '98. Montréal, Qué bec, Canada, pp. 315 – 319 (1998)
Dimitrova, L., Garabík, R.: Bulgarian -Slovak Parallel Corpus. In: 6th International Conference NLP, Multilinguality. SLOVKO 2011, Modra, Slovakia, 20 – 21 October 2011, pp. – (2011)
Dimitrova, L., Garabík, R., Majchráková, D.: Comparing Bulgarian and Slovak Multext- East morphology tagset. In: Organisation and Development of Digital Lexical Resources. MONDILEX Second Open Workshop, Kiev, Ukraine, 2 – 4 February 2009, pp. 38 – (2009)
Dimitrova, L., Koseska, V.: Bulgarian-Polis h Corpus. J. Cognitive Studies/Études Cognitives. Vol. 9, SOW, Warsaw, pp. 133 – (2009)
Dimitrova, L., Koseska, V., Roszko, D., Roszko, R.: Bulgarian-Polish-Lithuanian Corpus – Current Development. In: International Workshop “Multilingual resources, techn ologies and evaluation for Central and Eastern European languages” in conjunction with Intern ational Conference Recent Advance in NPL’2009. Borovec, Bulgaria, 17 September 2009. INCOMA Ltd., Bulgaria, pp. – (2009)
Garabík, R., Dimitrova, L., Koseska– Toszewa, V.: Webpresentation of bilingual corpora (Slovak-Bulgarian and Bulgarian- Polish). In: J. Cognitive Studies/Études Cognitives. Vol. 11, SOW, Warsaw, pp. – (2011).
Garabík , R. and В. П. Захаров. : Параллельный русско - словацкий корпус. In : Tруды международной конференции Корпусная лингвистика , pp. 81 –87, Санкт - Петербург, Издательство С. - Петербургского университета ( )
Ide, N., Bonhomme, P., and Romary, L.: XCES: An XML based Encoding Standard for Linguistic Corpora. In: 2nd International Language Resources and Evaluation Conference. Paris: ELRA, pp. 825 – 830 (2000)
Ide, N., Veronis, J.: Multext (multilingual text tools and corpora). In: COLING’94. Kyoto, Japan, pp. 90 – 96 (1994)
MTE, 2004: MULTEXT-East Morphosyntactic Specifications – version 3, edition 10th May 2004 (2004) International Conference on Digital Presentation and Preservation of Cultural and Scientific Heritage
Schmid, H. Probabilistic partofspeech tagging using decision trees. In Daniel Jones and Harold Somers, editors, New Methods in Language Processing, Studies in Computational Linguistics, pp. – 164. UCL Press, London, GB. ( )
Vasilišinová , D. and Garabík , R. Parallel French-Slovak Corpus. In Computer Treatment of Slavic and East European Languages. Proceedings of the conference Slovko 2007. Tribun, Brno, pp. – . ( )
Varga, D., L. Németh, P. Halácsy, A. Kornai, V. Trón, and V. Nagy. Parallel corpora for medium density languages. In Proceedings of the Recent Advances in Natural Language Processing, pp. – . ( )