Home - Data processing

Our spoken resource projects are in principle concerned with continuous speech, including adult conversational speech, adult interview speech, child repetitive speech, and child narrative speech. Long speech stretches are segmented into interpause units (IPUs) according to disjuncture cues such as pauses and paralinguistic sounds (e.g., breathing, inhalation, and laughter). The ILAS phone aligner is applied to automatically obtain boundary information on phonemes and syllables based on the text information from transcripts. Signal-aligned word boundaries are derived by making use of syllable boundary information (signal-based) and word segmentation results (text-based). Concerning text processing, transcripts of the IPUs are processed by the CKIP automatic word segmentation and POS tagging system. Word segmentation results are then integrated with phone boundary information generated by the ILAS phone aligner to derive word boundary information in speech data. Subsequently, the boundaries of paralinguistic sounds are examined manually to eliminate misalignment caused by some of the voiced paralinguistic sounds in the adjacent speech sounds. After the correction of paralinguistic sound boundaries and a second-round forced alignment, the resulting word boundaries are manually verified by professional labelers, including incorrect word segmentation resulting from unknown words, fragmentary utterances, and disfluencies. Homographs that are very common in Chinese are corrected, if necessary, by referring to the Chinese Spoken Wordlist that contains manually corrected phonetic transcription. This step is essential, as incorrectly converted phonetic transcriptions (from Chinese characters) including tone labels would directly lead to incorrect phone alignment results. After the manual editing process is completed, forced alignment is conducted to accomplish the final multilayer linguistic annotation. It consists of the levels of IPU, word, POS, syllable, and phoneme.