What is ParCoLab?

ParCoLab is a 32.000.000-word parallel corpus containing original texts and their translations in 4 European languages: Serbian, French, English, Spanish, Occitan, and Poitvin-Saintongeais. Each of the six languages is represented both as a source language and as a target language in the corpus. The project manager is Dejan Stosic, Full Professor at Toulouse Jean Jaurès University, member of the CLLE research unit (UMR 5263 of CNRS). ParCoLab is a result of numerous collaborations between linguists and NLP scientists.

The corpus texts are aligned at the paragraph and sentence level, which means that each paragraph and each sentence of the original texts are linked with their respective translations in one or three remaining languages of the corpus. A great advantage of our parallel corpus lies in the high quality of the alignments, which are validated manually. Although at the moment it contains mostly literary texts, ParCoLab is becoming a more diversified resource as other text genres are being included (web content, movie subtitles, technical documentation, etc.).

The value and the originality of the corpus reside in the richness and the quality of its content, but also in the structure and annotation principles which follow the current standards for corpus constitution and distribution (XML format based on the TEI P5 recommendations).

ParCoLab can be consulted online for free. A search engine allows you to perform searches concerning an expression you wish to examine, and then extract the results, as well as the corresponding utterances in one or two other languages.

This corpus finds its principal use in scientific research and in language teaching. ParCoLab is a significant source of data for linguistic studies of Serbian, French, English, and Spanish, as well as for research in comparative linguistics and linguistic typology. The corpus can also be used in language teaching and learning, translation studies and translator training. Different applications in lexicography, teaching material creation and computer assisted translation can also be developed. Finally, several resources for NLP applications based on ParCoLab are under construction.

Since one of the main principles of our project is constant improvement, we continue to work hard on advancing the corpus from the technical, qualitative and quantitative point of view.

[Team]    [Documentation]  [Content]   [Publications]   [Resources]   [Acknowledgments]