ParCoTrain-Synt – Syntactic analysis of Serbian

ParCoTrain-Synt is a training and evaluation corpus for the POS-tagging, fine-grained morphosyntactic annotation, lemmatisation and parsing of Serbian. The corpus contains 81 000 tokens annotated manually on all levels. The source texts for the corpus are contemporary Serbian novels from the 2nd half of the 20th century.

For each token, the corpus indicates the lemma, POS-tag, detailed morphosyntactic description, morphosyntactic traits important for parsing, syntactic governor and syntactic function. The syntactic annotation is done in the dependency-based approach.

This resource was developed by Aleksandra Miletic, Dejan Stosic and Cécile Fabre (CLLE, CNRS & University of Toulouse).

Licence
Some rights are reserved. ParCoTrain-Synt is distributed under a Creative Commons BY-NC-SA 3.0 licence. Please read the licence carefully before using the corpus in your work.

Contact
Aleksandra Miletic, aleksandra.miletic@univ-tlse2.fr

Download
Corpus
PDF documentation (forthcoming)

ParCoLab – files available for download

Description

A part of ParCoLab’s content is free of copyright and available for download. The portion of the corpus that is currently available contains 588 000 tokens in total (63 000 in Serbian, 260 000 in French, and 265 000 in English). The description of the texts included along with their size in tokens is given below.

 

Source Type Tokens per language Total
Serbian French English
French Embassy in Canada Web content
(short texts)
28 297 28 288 56 585
TV series Bref Subtitles
(spoken language)
13 305 15 168 28 473
Web magazine Pescanik Web content
(socio-political articles)
31 151 34 275 65 426
JRC-Acquis Legislation
(legislative texts from EU)
195 095 181 290 376 385
TED talks Subtitles
(short talks on various subjects)
18 933 21 105 21 410 61 448
Total 63 389 259 665 265 263 588 317

 

Contact person: Aleksandra Miletic (CLLE-ERSS), aleksandra.miletic@univ-tlse2.fr

Licence: Some rights are reserved. ParCoLab is distributed under a Creative Commons BY-NC-SA 3.0 licence.

 

ParCoTrain – POS-tagging and lemmatisation of Serbian

Description

ParCoTrain is a training and test corpus for the POS-tagging and lemmatisation of Serbian. The lemmatised section of the corpus contains 95585 tokens, whereas the POS-tagged section counts 153625 tokens (95585 of which are annotated manually, with the remaining 57977 annotated automatically and validated manually). The source texts for the corpus are contemporary Serbian novels from the second half of the 20th century.

The POS-tagging gives the main POS and the subcategory. It also indicates the degree of comparison for adjectives and adverbs. A detailed description of the tagset used in the corpus can be found in the PDF documentation downloadable from this page.

This resource was developed as part of the ParCoLab project by Aleksandra Miletic (CLLE-ERSS, Université Toulouse – Jean Jaurès), Antonio Balvet (STL, Université Lille 3) and Dejan Stosic (CLLE-ERSS, Université Toulouse – Jean Jaurès).

Contact person: Aleksandra Miletic (CLLE-ERSS), aleksandra.miletic@univ-tlse2.fr

Licence: Some rights are reserved. ParCoTrain is distributed under a Creative Commons BY-NC-SA 3.0 licence.

References

Balvet, A., Stosic, D., & Miletic, A. (2014, May). TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French. In LREC 2014. [PDF] [BibTex]

Miletic, A. (2013). Annotation semi-automatique en parties du discours d’un corpus littéraire serbe. Mémoire de Master. Université Charles de Gaulle Lille 3, France.

 

wikimorph-sr – a lexicon for POS-tagging and parsing of Serbian

 

Description

wikimorph-sr is a morphosyntactic lexicon for Serbian that can be used for POS-tagging, parsing and lemmatisation. It was mainly extracted from the serbo-croatian edition of the Wiktionary (sh.wiktionary.org).

The lexicon contains 1 222 486 different wordforms corresponding to 117 445 different lemmas and to 3 061 616 unique combinations wordform, lemma, morphosyntactic description. Each morphosyntactic description contains a POS indication, a subcategory and a set of relevant morphosyntactic traits: case, number and gender for nouns, adjectives and pronouns; verb form, person, gender and number for verbs; degree of comparison for adjectives and adverbs. More details are available in the PDF documentation of the lexicon.

This resource was developed as part of the ParCoLab project by Aleksandra Miletic (UMR 5263 CLLE-ERSS, CNRS & Université Toulouse – Jean Jaurès, France).

Licence
Some rights are reserved. wikimorph-sr is distributed under a Creative Commons BY-SA 3.0 licence.

Downloads

Lexicon
PDF documentation in English

References

Miletic, Aleksandra. (2017). Building a morphosyntactic lexicon for Serbian from Wiktionary. Actes de la 6e édition des Journées d’étude toulousaines (JéTou2017). Toulouse, France.

Acknowledgements

Many thanks to Franck Sajous (UMR 5263 CLLE, CNRS & Université de Toulouse – Jean Jaurès) for sharing his experience in working on the Wiktionary.

 

 

 

 

[About]   [Team]   [Documentation]   [Content]   [Publications]   [Acknowledgments]