Datasets

We have currently published the following linguistic datasets. All datasets except for MECORE-EN annotate
semantic properties of lexical items in a specific category based on original elicited judgments. MECORE-EN records naturally occurring examples of clausal embedding in English with annotations of their syntactic specifications. Please refer to the associated publications for their details.

Name Empirical Domain Sample Languages Data source Publication
MultiCoS connectives 24 languages Elicitation LREC 2026
MECORE-EN clause-embedding predicates English Web-crawled corpora SCiL 2025
LiSU-Modals modal auxiliaries 24 languages Elicitation Linguistic Variation 2024
MECORE-XLing clause-embedding predicates 14 languages Elicitation SigTyp 2023