This page describes the technological properties of ParCoLab.
The technologies that we are using
The corpus data are stored in an XML format based on the TEI P5 Guidelines. The files contain the following standardized metadata: title, subtitle, author, translator, editor, publication place, publication date, creation date, source, text language, language of the original text, domain, genre, word number, creation type (original or translation). As of yet, the corpus does not include any linguistic annotations, but our current short-term efforts are focused on enriching ParCoLab with morphosyntactic and syntactic annotation.
The alignment of the original texts with their translations is performed using an algorithm integrated in ParCoLab; no external resources are used for this task. The algorithm proceeds in descending order, creating one-to-one alignments, first at chapter level (<div> elements), then at paragraph level (elements), and finally at sentence level (<s> elements). Errors are signalled by the tool and corrected manually, which guarantees the reliability of the corpus alignments.
Queries are carried out via the ElasticSearch search engine, well adapted to querying data in NoSQL databases. Even with a minimal query form, search possibilities are great. The search engine allows for queries containing one or several words, and it is also possible to use jokers to replace one or more characters in a query and form queries looking for words that start or end in a certain string of characters. Regular expressions and boolean operators can also be used (for more details, please see the corresponding help pages). The search engine also uses the structure of the XML document and can target different elements of metadata.
As for web technology, ParCoLab uses an HTML5 query interface based on responsive web design. That means that the site is capable of adapting in real time to the type of device on which it is used. The site can be browsed using computers, tablets and smartphones with the same ease of use and with no need for user interventions. The application is compatible with the latest versions of the following web browsers: Google Chrome version 35 and later, Mozilla Firefox version 32 and later, Apple Safari version 7.0 and later, Microsoft Internet Explorer version 10 and later. The site exists in three languages: English, French and Serbian.
The corpus is enriched manually and it contains principally literary texts written in the three languages of the corpus (Serbian, French and English) and their translations. For more details on the texts, please see the page Content.
The scientific papers published in this project are provided on the page Publications.