Phonetic Corpus

Phonetic Corpus of Estonian Spontaneous Speech

The aim of the corpus is to compile a large amount of quality recordings of spontaneous Estonian and segment it phonetically on different levels. The project started in autumn 2006.

Structure of corpus

  • The total size of the corpus will be approximately 20 hours of speech from 20 male and 20 female speakers with different dialectological and social background. Speakers are from different age groups (12 young speakers, 8 young middle-aged speakers, 8 middle-aged speakers and 12 elderly speakers). They are asked to participate with face-to-face invitation. To get the situation to be spontaneous, the dialogues of speakers who know each other earlier are recorded. In dialogues both speakers attend actively although one of them leads discourse.
  • Every participant fills in the questionnaire about her background. For the anonymity of the participants in the corpus the speakers are coded. When one speaker participates in several recordings she gets the same codename.
  • Monologues and dialogues are recorded. Most of the recordings are made in a recording studio, some also on fieldwork. The signal of each speaker is recorded in a separate channel. For the studio recordings large diaphragm microphones are used; the distance between the speakers is about 3 meters to minimize the effect of overlaps. For the field-work recordings head-set microphones are used. Recordings are saved in PCM wav-format and are not compressed. Background information about the recordings is collected in a text-file.
  • Segmentation and annotation files are saved as Praat TextGrid files and get same filenames as recordings segmented.

Segmentation and annotation

Segmentation and annotation is done with the Praat program ( Recordings are segmented manually on different levels (automatic segmentation program is also elaborated and tested).
Following tiers are used:

  • phonetic and linguistic tiers: words (in orthographic spelling), speech sounds (SAMPA adjusted for Estonian is used for transcription), sound structures (CV-structures), syllables (short – long, open – closed), feet, utterances;
  • dialogue units: turns and pauses;
  • fillers;
  • changes in voice quality (e.g. creaky breathy voice, whisper);
  • Paralinguistic phenomena (e.g. expiration and inspiration (also speaking during inspiration), sighing, yawning, sneezing, coughing etc.);
  • emotional states (e.g. laugher, weeping, whimper);
  • Other tiers (e.g. smacking with lips or tongue).

Using the corpus

Currently the web-based search engine lets you search the orthographic form or the phonetic transcription of a word-level segment.

If the web-based search options do not fit your needs or you have other questions related to the phonetic corpus of Estonian spontaneous speech, please write to the corpus administrator Pärtel Lippus: partel.lippus [ät]

or via ordinary mail:

Pärtel Lippus
Department of Estonian and Finno-Ugric Linguistics
University of Tartu
Ülikooli 18
Tartu 50090

The project is funded by:EKKTT

Last edited: 2012-09-28 16:49:38