Download CEFR readability datasets
As part of our pilot project for the European Language Grid, EDIA has developed several datasets that can be used for training AI models on CEFR readability classification. These datasets consist of texts from various sources, labelled on CEFR readability level.
Please fill in the form below to get access to the datasets. The datasets are available for non-commercial, academic purposes (CC-BY-NC) only.
Citing
When citing these resources in your research, please use:
Breuker, M. (2023). CEFR Labelling and Assessment Services. In: Rehm, G. (eds) European Language Grid. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-17258-8_16
Readability API
Based on the datasets, we have created several CEFR text classification models which can be used through our Readability API. For more information see our developer documentation.