HOME - LIST OF CORPORA

The 43-hour Sinica Taiwan Mandarin Conversational Corpus (TMC Corpus) consists of 30 free conversations between strangers (MCDC8 and MCDC22) and 29 topic-specific and 26 Map Task conversations (MTCC and MMTC) between people acquainted with each other, each an average length of one hour, 20 minutes, and 10 minutes, respectively. The TMC Corpus has a balanced design of scenarios and conversation partner familiarity. Ninety-eight female and 72 male speakers aged 16 to 63 years were recorded. Twenty-six speakers took part in all three sub-corpus projects. Conversations were recorded in quiet rooms in Academia Sinica by using the SONY TCD-D10 Pro II DAT digital recorder and the Audio-Technica ATM 33a microphone at a sampling rate of 48 kHz, with each speaker on a separate channel. The speech content was orthographically transcribed using traditional Chinese characters. Particles, discourse markers, fillers, word fragments, and paralinguistic sounds that often occur in Chinese conversation are accordingly annotated in the transcripts. Only MCDC8 has been manually checked for Pinyin and POS. The rest of the corpora provided in this system is automatically processed, so please use it with caution. The corpus statistics are summarized as follows.


IPU 81,237
Word Lexical words: 397,693 (15,105)
1-syllabic words: 224,343 (1,580)
2-syllabic words: 153,240 (9,705)
3-syllabic words: 17,322 (2,942)
Others: 2,788 (878)
Discourse-related items: 175,318 (2,419)
Discourse particles: 29,421 (36)
Discourse markers: 12,164 (16)
Fillers: 16,721 (34)
POS Verbs: 98,090 (6,261, 16)
Adverbs: 80,190 (657,64)
Nouns: 75,559 (8,210,7)
Pronouns: 39,453 (50, 1)
Determinatives: 24,865 (526, 5)
Preposition: 14,464 (100, 1)
Conjunctions: 17,950 (94, 4)
Structural particles DE: 16,342 (5, 1)
Classifiers: 12,969 (165, 1)
Particles: 3,802 (22, 1)
Adjectives: 813 (193, 1)
Interjection: 8 (4, 1)
Copula: 13,141 (3, 1)
Foreign words: 1,470 (473, 1)
Character 594,238 (2,952)
Syllable Tone-distinctive 1,086
No tone distinction 403
Phoneme 1,429,518

The Sinica Sociophonetic Corpus was funded by the National Digital Archives Project. The purpose of this corpus project was to document and archive the contemporary use of spoken Taiwan Mandarin. Recording was conducted in twelve regions distributed across northern, middle, and southern Taiwan, including Yilan County, Taoyuan County, Hsinchu County, Taichung City, Nantou County, Yunlin County, Chiayi City, Changhua County, Tainan City, Kaohsiung City, Kaoshiung County, and Taipei City. A total of 1,402 interviews mainly with individuals aged 20 to 40 years were recorded in public places, e.g., parks, post offices, or banks, where we assumed we were most likely to find local people. The interviews were recorded by using the Sony Hi-MD MZ-RH1 digital recorder and the Sony ECM MS907 microphone, digitized at a sampling rate of 44.1 kHz with 16-bit quantization. The speech content of the interviewees was orthographically transcribed in traditional Chinese characters with annotations of paralinguistic sounds and pauses. Twenty-five questions in three categories were directed to the interviewees, including information about the language use, socioeconomic background, and use of the internet of the interviewees. Concerning language use, dialect exposure was particularly specified in the way the spoken dialects, mainly Southern Min and Hakka, are used within a family, e.g., to parents and siblings. Questions about language ability are concerned with how many languages the interviewees can speak and how good they are. Concerning socioeconomic background, data on age, gender, salary level, education level and childhood residence were sought. The length of individual interviews ranged from three to eight minutes. All interviews were conducted in Taiwan Mandarin. Only the speech produced by the interviewees was transcribed and processed. The corpus statistics are summarized as follows.


IPU 124,916
Word Lexical words: 284,196 (7,085)
1-syllabic words: 133,354 (1,007)
2-syllabic words: 129,060 (4,218)
3-syllabic words: 20,348 (1,585)
Others: 1,434 (275)
Discourse-related items: 122,634 (718)
Discourse particles: 28,928 (33)
Discourse markers: 3,993 (12)
Fillers: 28,826 (21)
POS Verbs: 58,894 (2,135, 16)
Adverbs: 42,750 (367, 6)
Nouns: 88,146 (4,423, 7)
Pronouns: 10,020 (33, 1)
Determinatives: 18,700 (362, 5)
Preposition: 11,499 (66, 1)
Conjunctions: 9,817 (64, 4)
Structural particles DE: 6,655 (7, 1)
Classifiers: 5,917 (76, 1)
Particles: 6,324 (19, 1)
Adjectives: 683 (102, 1)
Interjection: 3 (2, 1)
Copula: 10,275 (4, 1)
Foreign words: 2,579 (235, 1)
Character 458,320 (2,006)
Syllable Tone-distinctive 929
No tone distinction 375
Phoneme 1,102,753

The Sinica Child Speech Corpus was funded by the National Science Council and the Children’s Hearing Foundation. It contains repetitive and narrative speech data produced by seventy-nine preschool children with normal hearing (NH) aged 2;11~6;3 (median 5;0) and forty-five children with hearing impairment (HI) aged 3;3~12;5 (median 5;9). Among the HI children, thirty wore traditional hearing aids (with mild to profound degrees of hearing loss), and fifteen were fitted with a cochlear implant (with severe to profound degrees of hearing loss). The HI children were recorded during their regular AVT session using the video equipment built into the sound-proof classrooms of the Children’s Hearing Foundation. Adobe Audition 1.0 was used to convert the video files into 44100 Hz, 16-bit single-channel sound files. The NH children were recorded either at Academia Sinica in sound-proof studios or in quiet classrooms at their kindergarten using the Sony Hi-MD MZ-RH1 digital recorder and the Sony ECM MS907 microphone. The data were digitized at a sampling rate of 44.1 kHz with 16-bit quantization. For narrative speech data collection, the children were asked to tell The Hare and the Tortoise, assisted with picture cards that were presented to them in a fixed order. The speech content was orthographically transcribed in traditional Chinese characters with annotations of paralinguistic sounds and pauses. The corpus statistics are summarized as follows.


HI NH
IPU 2,208 2,727
Word Lexical words 5,193(503) 6,436(559)
1-syllabic words 3,002(181) 3,863(205)
2-syllabic words 2,123(276) 2,484(305)
3-syllabic words 61(40) 86(46)
Others 7(6) 3(3)
Discourse-related items 2,778(42) 3,695(51)
Discourse particles 75(14) 52(16)
Discourse markers 21(4) 53(8)
Fillers 56(8) 214(11)
POS Verbs 1,612(252,16) 1,857(287,16)
Adverbs 1,038(71,6) 1,388(75,6)
Nouns 1,209(135,7) 1,254(148,7)
Pronouns 334(10,1) 650(11,1)
Determinatives 251(25,5) 344(28,5)
Preposition 155(15,1) 241(20,1)
Conjunctions 79(15,4) 92(15,4)
Structural particles DE 131(2,1) 107(2,1)
Classifiers 163(7,1) 239(8,1)
Particles 170(10,1) 183(10,1)
Adjectives 2(1,1) 2(1,1)
Interjection 0 0
Copula 49(1,1) 72(2,1)
Character 7,467(378) 9,102(408)
Syllable Tone-distinctive 311 349
No tone distinction 215 236
Phoneme 17,046 21,022