Technologies

Technology

This page describes the technological properties of ParCoLab.

The technologies that we are using

ParCoLab uses the latest digital database management and web application technologies. It is based on Apace CouchDB, a NoSQL database management system. This technology, which is well adapted to the management of large quantities of data, stores the data in JSON document collections. It leaves a certain degree of freedom in the development of the corpus, as the data is not required to be organized in strictly structured, interconnected tables. Because of that, it is possible to subsequently include additional levels of annotation, which may be needed in order to satisfy future needs. CouchDB implements web protocols and principles which facilitate the execution of queries and can easily be used for developing collaborative resources. The efficiency of the tool also stems from the use of Nood.js, a program platform written in JavaScript and dedicated to web applications, which need to be able to adapt to a high number of simultaneous queries and maintain a solid performance even in the presence of a great number of users.

The corpus data are stored in an XML format based on the TEI P5 Guidelines. The files contain the following standardized metadata: title, subtitle, author, translator, editor, publication place, publication date, creation date, source, text language, language of the original text, domain, genre, word number, creation type (original or translation). As of yet, the corpus does not include any linguistic annotations, but our current short-term efforts are focused on enriching ParCoLab with morphosyntactic and syntactic annotation.

The alignment of the original texts with their translations is performed using an algorithm integrated in ParCoLab; no external resources are used for this task. The algorithm proceeds in descending order, creating one-to-one alignments, first at chapter level (<div> elements), then at paragraph level (elements), and finally at sentence level (<s> elements). Errors are signalled by the tool and corrected manually, which guarantees the reliability of the corpus alignments.

Queries are carried out via the ElasticSearch search engine, well adapted to querying data in NoSQL databases. Even with a minimal query form, search possibilities are great. The search engine allows for queries containing one or several words, and it is also possible to use jokers to replace one or more characters in a query and form queries looking for words that start or end in a certain string of characters. Regular expressions and boolean operators can also be used (for more details, please see the corresponding help pages). The search engine also uses the structure of the XML document and can target different elements of metadata.
As for web technology, ParCoLab uses an HTML5 query interface based on responsive web design. That means that the site is capable of adapting in real time to the type of device on which it is used. The site can be browsed using computers, tablets and smartphones with the same ease of use and with no need for user interventions. The application is compatible with the latest versions of the following web browsers: Google Chrome version 35 and later, Mozilla Firefox version 32 and later, Apple Safari version 7.0 and later, Microsoft Internet Explorer version 10 and later. The site exists in three languages: English, French and Serbian.

The corpus is enriched manually and it contains principally literary texts written in the three languages of the corpus (Serbian, French and English) and their translations. For more details on the texts, please see the page Content.

The scientific papers published in this project are provided on the page Publications.

[About] [Team] [Content] [Publications] [Resources] [Acknowledgments]

Quoi de neuf