LatinCy Loves Data, Part 1: UD Latin Treebanks

The first in a five-part series highlighting key datasets behind LatinCy, starting with the six Universal Dependencies Latin treebanks and the LASLA corpus.
latincy
latin
treebanks
universal-dependencies
lasla
Author

Patrick J. Burns

Published

February 9, 2026

Last week I saw an email for NYU Libraries announcing it was their annual Love Data Week with the theme “Where’s the data?”. As it so happens, especially with the LatinCy Annual Meeting coming up at the end of the month, I have been immersed in Latin datasets as I prepare the next release of the LatinCy pipelines. Putting these two things together, I thought it would be good to highlight five key data sets—one a day—for a kind of LatinCy Loves Data Week.

When it comes to training SpaCy pipelines, there is an obvious place to start—the Universal Dependency (UD) treebanks. Back in 2019 when I first started thinking about developing a Latin spaCy model, my entry point was—and in some ways still is—this spaCy project.

The UD is in their own words “a framework for consistent annotation of grammar…across different human languages,” with origins in treebanking projects at Stanford NLP and Google. (A full history can be found here.) Basically, UD supports sentence- and token-level annotations of lemmas, part-of-speech and morphological tags, dependency relationships, and other useful linguistic data. At present there are six Latin treebanks at UD, each covering a different subset of the language whether based on period or author or genre.

So, UD Perseus covers classical literary Latin, while the Late Latin Charter Treebank (LLCT) covers medieval charters. PROIEL also covers some classical literary Latin but also has significant coverage of biblical texts. Index Thomisticus Treebank (ITTB) covers the works of Thomas Aquinas. UDante covers the Latin works of Dante. A sixth treebank, CIRCSE, originates from a collaboration between the LiLa: Linking Latin project and the LASLA laboratory.

Table 1: UD Latin treebanks summary
Treebank Sentences Tokens
Perseus 2,273 29,858
PROIEL 18,132 217,722
ITTB 26,977 458,152
LLCT 9,023 243,681
UDante 1,721 55,531
CIRCSE 1,664 26,563

LatinCy tries to leverage as many UD annotations as possible to improve performance on tasks like lemmatization, POS tagging, and dependency parsing, among others. The advantage of working with multiple treebanks is one of numbers. The classifiers beneath the pipeline components benefit from seeing more features in as many contexts as possible. A fairly common word like dico only appears in the Perseus treebank 95 times; in all six together, we have 7,137 instances on which to base future predictions. That represents an enormous increase in the number of contexts!

The difficulty with using all six treebanks is that they all have slightly different annotation strategies and implementations. As such, the treebanks need to be harmonized. So, if one treebank tags sum as VERB and another tags it as AUX, this makes it all the more difficult for a classifier to assign future tags correctly and consistently so. Some treebanks capitalize the first word of sentences, some do not. Some end sentence with punctuation, some do not. Some use consonantal u, others use v. I have been chipping away at these problems for years with each LatinCy release; a previous example of the harmonization code can be found here (the process is also described in the “4.1 Preprocessing” section of the LatinCy preprint). Federica Gamba and Daniel Zeman have also been making inroads in harmonizing the Latin treebanks with respect to both morphological and syntactic annotations. This is ongoing and important work if we want to train Latin language models on sufficient amounts of quality data.

We should also mention here the CIRCSE/LASLA: LASLA Corpus. These are not, strictly speaking, treebanks as they do not have dependency data. At the same time, the LASLA files have other key annotations—e.g. lemma, POS, morph—and a lot of them. This collections covers around 20 authors and consists of more than 1.8 million annotated tokens in all—an amazing data source for these NLP features.

For the data-format curious here is the basic structure of a UD treebank, using a sentence from CIRCSE’s Hercules Furens:

# sent_id = Latin_SenecaYounger_HercF_poetry-282
# text = certus inclusos tenet locus nocentes
# speaker = Amphitryon
1   certus  certus  ADJ C1  Case=Nom|Degree=Pos|Gender=Masc|InflClass=IndEurO|Number=Sing   4   amod    _   LiLaflcat=n6
2   inclusos    includo VERB    B3  Aspect=Perf|Case=Acc|Degree=Pos|Gender=Masc|InflClass=LatX|InflClass[nominal]=IndEurO|Number=Plur|VerbForm=Part|Voice=Pass  3   advcl:pred  _   LiLaflcat=v3
3   tenet   teneo   VERB    B2  Aspect=Imp|InflClass=LatE|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act   0   root    _   LiLaflcat=v2
4   locus   locus   NOUN    A2  Case=Nom|Gender=Masc|InflClass=IndEurO|Number=Sing  3   nsubj   _   LiLaflcat=n2
5   nocentes    nocens  ADJ C5  Case=Acc|Degree=Pos|Gender=Masc|InflClass=IndEurI|Number=Plur   3   obj _   LiLaflcat=n7

You can see the metadata—a collection-specific identifier for the sentence, the text of the sentence, and the speaker of the line in the tragedy—at the top of the record, offset with the # character. The text of the sentence is then reproduced one token per line with the following annotations: running per-sentence token count, word form, lemma, universal part-of-speech tag, project-specific POS tag, morphological features, three columns of word-to-word dependencies, and finally a catch-all miscellaneous column. Each treebank includes slightly different metadata field and present the token-level annotation with its own system, but the basic CoNNL-U structure is shared between all of the Latin UD treebanks.

One way you can get started with working with UD treebanks is by download them from GitHub and processing them with the CoNLL-U Parser. That said, both the Latin UD files (and the LASLA files, via Zenodo) are supported by the just-released LatinCy Readers project.

Check back here for another LatinCy Loves Data post tomorrow on the word vectors used throughout the LatinCy project.

Further Reading

Burns, Patrick J. 2023. LatinCy: Synthetic Trained Pipelines for Latin NLP.” May 7, 2023. https://arxiv.org/abs/2305.04365v1.
Cecchini, Flavio Massimiliano, Marco Passarotti, Paola Marongiu, and Daniel Zeman. 2018. “Challenges in Converting the Index Thomisticus Treebank into Universal Dependencies.” In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), edited by Marie-Catherine de Marneffe, Teresa Lynn, and Sebastian Schuster, 27–36. Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-6004.
Cecchini, Flavio M., Rachele Sprugnoli, Giovanni Moretti, and Marco Passarotti. 2020. UDante: First Steps Towards the Universal Dependencies Treebank of Dante’s Latin Works.” In Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020, edited by Felice Dell’Orletta, Johanna Monti, and Fabio Tamburini, 99–105. Accademia University Press. https://doi.org/10.4000/books.aaccademia.8653.
Denooz, Joseph. 1978. L’ordinateur et le latin: Techniques et méthodes. Université de Liège: L.A.S.L.A. http://orbi.ulg.ac.be/handle/2268/1059.
Eckhoff, Hanne, Kristin Bech, Gerlof Bouma, Kristine Eide, Dag Haug, Odd Einar Haugen, and Marius Jøhndal. 2018. “The PROIEL Treebank Family: A Standard for Early Attestations of Indo-European Languages.” Language Resources and Evaluation 52 (1): 29–65. https://doi.org/10.1007/s10579-017-9388-5.
Gamba, Federica, and Daniel Zeman. 2023a. “Universalising Latin Universal Dependencies: A Harmonisation of Latin Treebanks in UD.” In Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023), edited by Loïc Grobol and Francis Tyers, 7–16. Washington, D.C.: Association for Computational Linguistics. https://aclanthology.org/2023.udw-1.2.
———. 2023b. “Latin Morphology Through the Centuries: Ensuring Consistency for Better Language Processing.” In Proceedings of the Ancient Language Processing Workshop, edited by Adam Anderson, Shai Gordin, Bin Li, Yudong Liu, and Marco C. Passarotti, 59–67. Varna, Bulgaria: INCOMA Ltd., Shoumen, Bulgaria. https://aclanthology.org/2023.alp-1.7/.
Mambrini, Francesco, and Marco Passarotti. 2019. “Linked Open Treebanks: Interlinking Syntactically Annotated Corpora in the LiLa Knowledge Base of Linguistic Resources for Latin.” In Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 74–81. https://aclanthology.org/W19-7808.pdf.
Marneffe, Marie-Catherine de, Christopher D. Manning, Joakim Nivre, and Daniel Zeman. 2021. “Universal Dependencies.” Computational Linguistics 47 (2): 255–308. https://doi.org/10.1162/coli_a_00402.