Resursi

This page aims at collecting the linguistic resources developed in the framework of the ParCoLab project.

ParCoTrain-Synt – Syntactic analysis of Serbian

ParCoTrain-Synt is a training and evaluation corpus for the POS-tagging, fine-grained morphosyntactic annotation, lemmatisation and parsing of Serbian. The corpus contains 81 000 tokens annotated manually on all levels. The source texts for the corpus are contemporary Serbian novels from the 2nd half of the 20th century.

For each token, the corpus indicates the lemma, POS-tag, detailed morphosyntactic description, morphosyntactic traits important for parsing, syntactic governor and syntactic function. The syntactic annotation is done in the dependency-based approach.

This resource was developed by Aleksandra Miletic, Dejan Stosic and Cécile Fabre (CLLE, CNRS & University of Toulouse).

Licence
Some rights are reserved. ParCoTrain-Synt is distributed under a Creative Commons BY-NC-SA 3.0 licence. Please read the licence carefully before using the corpus in your work.

Contact
Aleksandra Miletic, aleksandra.miletic@univ-tlse2.fr

Download
Corpus
PDF documentation (forthcoming)

ParCoJour – an MSD-tagged, lemmatized and parsed Serbian news corpus

Description

ParCoJour is a Serbian news corpus containing 34,000 tokens. There are 37 articles from one daily (Danas) and one weekly (NIN) newspaper published between 2003 and 2017. The corpus indicates the lemma, POS-tag, detailed morphosyntactic traits important for parsing, syntactic governor and syntactic function for each token. The linguistic annotation of the corpus follows the guidelines of the ParCoTrain-Synt corpus.

Download
ParCoJour_v0.1

Licence
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Contact
Dusica Terzic, dusica.terzic@fil.bg.ac.rs

References
Terzic, Dusica. (2019). Parsing des textes journalistiques en serbe par le logiciel Talismane. Proceedings of TALN-RECITAL 2019, pp. 591-604. Toulouse, France. [PDF]

ParCoLab – files available for download

Description

A part of ParCoLab’s content is free of copyright and available for download. The portion of the corpus that is currently available contains 588 000 tokens in total (63 000 in Serbian, 260 000 in French, and 265 000 in English). The description of the texts included along with their size in tokens is given below.

Source	Type	Tokens per language			Total
Source	Type	Serbian	French	English	Total
French Embassy in Canada	Web content (short texts)	–	28 297	28 288	56 585
TV series Bref	Subtitles (spoken language)	13 305	15 168	–	28 473
Web magazine Pescanik	Web content (socio-political articles)	31 151	–	34 275	65 426
JRC-Acquis	Legislation (legislative texts from EU)	–	195 095	181 290	376 385
TED talks	Subtitles (short talks on various subjects)	18 933	21 105	21 410	61 448
Total		63 389	259 665	265 263	588 317

Contact person: Aleksandra Miletic (CLLE-ERSS), aleksandra.miletic@univ-tlse2.fr

Licence: Some rights are reserved. ParCoLab is distributed under a Creative Commons BY-NC-SA 3.0 licence.

Download

zip archive with XML files

ParCoTrain – morfosintaksička analiza i lematizacija srpskog jezika

Opis

ParCoTrain je korpus za učenje i evaluaciju alata za automatsku identifikaciju vrsta reči i lematizaciju srpskog. Lematizovani deo korpusa sadrži 95 585 ručno anotiranih tokena, dok deo obogaćen anotacijom vrsta reči sadrži ukupno 153 625 tokena, od kojih je 95 585 anotirano ručno, a 57 977 anotirano automatski, a anotacija je potom ručno proverena i ispravljena. Korpus je zasnovan na tekstu 3 savremena srpska romana iz druge polovine XX veka.

Anotacija vrsta reči sadrži glavnu kategoriju i pod-kategoriju, a za prideve i priloge navodi se i stepen poređenja. Detaljan pregled etiketa korišćenih pri anotaciji dat je u dokumentaciji u PDF formatu koju možete skinuti preko linka u dnu strane.

Ovaj resurs razvili su Aleksandra Miletić (istraživačka ekipa CLLE-ERSS, Univerzitet Tuluz – Žan Žores), Antonio Balvet (istraživačka ekipa STL, Univerzitet Lil 3) i Dejan Stošić (istraživačka ekipa CLLE-ERSS, Univerzitet Tuluz – Žan Žores) u okviru projekta ParCoLab.

Kontakt: Aleksandra Miletić (CLLE-ERSS), aleksandra.miletic@univ-tlse2.fr

Prava: Neka prava su zadržana. ParCoTrain se distribuira pod licencom a href=”http://creativecommons.org/licenses/by-nc-sa/3.0/deed.fr”>Creative Commons BY-NC-SA 3.0. Molimo vas da je pažljivo pročitate.

Fajlovi koje možete skinuti:
Korpus za učenje i evaluaciju
Dokumentacija na engleskom
Dokumentacija na francuskom

Reference:

Balvet, A., Stosic, D., & Miletic, A. (2014). TALC-sef, Un corpus étiqueté de traductions littéraires en serbe, anglais et français. In SHS Web of Conferences (Vol. 8, pp. 2551-2563). EDP Sciences. [PDF] [BibTex]

Balvet, A., Stosic, D., & Miletic, A. (2014, May). TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French. In LREC 2014. [PDF] [BibTex]

Miletic, A. (2013). Annotation semi-automatique en parties du discours d’un corpus littéraire serbe. Mémoire de Master. Université Charles de Gaulle Lille 3, France.

wikimorph-sr – a lexicon for POS-tagging and parsing of Serbianh b

Description

wikimorph-sr is a morphosyntactic lexicon for Serbian that can be used for POS-tagging, parsing and lemmatisation. It was mainly extracted from the serbo-croatian edition of the Wiktionary (sh.wiktionary.org).

The lexicon contains 1 222 486 different wordforms corresponding to 117 445 different lemmas and to 3 061 616 unique combinations wordform, lemma, morphosyntactic description. Each morphosyntactic description contains a POS indication, a subcategory and a set of relevant morphosyntactic traits: case, number and gender for nouns, adjectives and pronouns; verb form, person, gender and number for verbs; degree of comparison for adjectives and adverbs. More details are available in the PDF documentation of the lexicon.

This resource was developed as part of the ParCoLab project by Aleksandra Miletic (UMR 5263 CLLE-ERSS, CNRS & Université Toulouse – Jean Jaurès, France).

Contact person
Aleksandra Miletic
Contact: aleksandra.miletic@univ-tlse2.fr

Licence
Some rights are reserved. wikimorph-sr is distributed under a Creative Commons BY-SA 3.0 licence.

Downloads

Lexicon
PDF documentation in English

References

Miletic, Aleksandra. (2017). Building a morphosyntactic lexicon for Serbian from Wiktionary. Actes de la 6e édition des Journées d’étude toulousaines (JéTou2017). Toulouse, France.

Acknowledgements

Many thanks to Franck Sajous (UMR 5263 CLLE, CNRS & Université de Toulouse – Jean Jaurès) for sharing his experience in working on the Wiktionary.