Segmentation principles

The choise of the segmentation levels has been guided by the annotation guide by Mietta Lennes and Sanna Ahjoniemi.

Phonetic and linguistic levels

1)    words are written in their orthographic form. If the speaker gets stuck and stops in the middle of the word, then it ends with a hyphen (e.g. sinna > sin-). The compound words are marked with "+" (e.g. kauba+maja). Silent pauses are marked with "#" and filled pauses start with "." (period): .sisse (inhale), .välja (exhale), .naer (laughter), non-lexical words, e.g. .ee, .aa, etc. Non-linguistic data will be copied later to another segmentation level.
2)    sounds are labeled in SAMPA (Speech Assessment Methods Phonetic Alphabet) transcription (see about SAMPA in general, use of SAMPA in Phonetic Corpus). Only lexical words are segmented.
3)    sound structure (CV) – the sound-level segmentation is copied, vowels are labeled with "V" and consonants with "C";
4)    syllables – LL (short, opened), PL (long, opened), PK (long, closed) and count number of the syllable, starting from the root (in compound words the count starts from every word root). E.g. kau|ba|ma|ja – 1PL|2LL|1LL|2LL;

5) phrases - at the moment this level is segmented by a Praat script, that finds all the portions of speech between pauses.

Voice quality

Voice quality will be noticed: creaky – .?, breathy – .Hv, whisper – .0, falsetto – .F.

Paralinguistic levels

Brreathing – .sisse, .välja; sighing – .ohe, yawning – .haigutus, sneezing – .aevastus, coughing – .köha or .köhatus, swallowing – .neelatus etc.


Emotions: laughing – .naer or .naerdes, crying – .nutt või .nuttes, whimper – .nuuksatus etc.

