Diachronic corpus documentation

13/03/2018

Introduction

The corpus contains 86 Spanish texts provided by the Biblioteca Virtual Miguel de Cervantes, printed between 1482 and 1647; it covers a representative variety of authors and genres (such as prose, theatre, and verse). This corpus is one of the few collections of historical Spanish distributed under an open license available at IMPACT website.
The BVC section of the impact-es diachronic corpus of historical Spanish compiles 86 books —containing approximately 2 million words. About 27% of the words —providing a representative coverage of the most frequent word forms— have been annotated with their lemma, part of speech, and modern equivalent following the Text Encoding Initiative guidelines. We describe how this type of annotation can be exploited to provide linguistically-enhanced search over historical documents. The advanced search supports queries whose search terms can be a combination of surface forms, lemmata, parts of speech and modern forms of historical variants.
The morphological categories which have been considered are abbreviation, adjective, adverb, conjunction, determiner, interjection, noun, proper noun, numeral, preposition, pronoun, relative pronoun, and verb. The annotation process was assisted by the CoBaLT tool, which supports complex annotations.

The Query Language and Interface

The interface with the search engine is available at link where multiple query terms can be specified. Every term can be preceded by a prefix:
  • If no prefix is added, the term denotes a diachronic form (verbatim text).
  • The prefix modern# denotes a modern form.
  • The prefix lemma# is followed by a lemma.
  • The prefix pos# denotes a part-of-speech tag.
Multiterm queries can include different prefixes and use the rich query language provided by Lucene, the open source information retrieval Java library. Words or text segments matching the query are highlighted and presented in their context (snippet).
For example, the word form celebrada generates 5 entries:
  • lemma#celebrar
  • pos#verb
  • modern#celebrada
  • lemma#celebrado
  • pos#adj
The word form yerro generates 7 entries:
  • lemma#yerro
  • pos#n
  • modern#yerro
  • lemma#hierro
  • modern#hierro
  • lemma#errar
  • pos#verb

The following figure shows the results for the query lemma#haber modern#de pos#verb

corpus-diacronico-resultados

How to reference the corpus

  • Carrasco, R. C., Martínez-Sempere, I., Mollá-Gandía, E., Sánchez-Martínez, F., Candela, G. and Escobar, P. (2015). Linguistically-Enhanced Search over an Open Diachronic Corpus. In Hanbury, A., Kazai, G., Rauber, A. and Fuhr N. (Eds.), Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science (Vol. 9022, pp. 801-804). Springer, Cham. https://doi.org/10.1007/978-3-319-16354-3_89
  • Sánchez-Martínez, F., Martínez-Sempere, I., Ivars-Ribes, X. and Carrasco, R. C. (2013). An open diachronic corpus of historical Spanish published in Language Resources and Evaluation. Lang Resources & Evaluation, 47, 1327-1342. https://doi.org/10.1007/s10579-013-9239