Datasets

We have so far released the following linguistic datasets. All datasets except for MECORE-EN annotate fine-grained semantic properties of lexical items in a specific empirical domain based on original cross-linguistic elicitation data. MECORE-EN records naturally occurring examples of clausal embedding in English with annotations of their syntactic specifications. Please refer to the associated publications for their details.

Name Empirical Domain Sample Languages Data source Publication
MultiCoS connectives 24 languages Elicitation LREC 2026
MECORE-EN clause-embedding predicates English Web-crawled corpora SCiL 2025
LiSU-Modals modal auxiliaries 24 languages Elicitation Linguistic Variation 2024
MECORE-XLing clause-embedding predicates 14 languages Elicitation SigTyp 2023