HOME - LIST OF CORPORA

The 43-hour Sinica Taiwan Mandarin Conversational Corpus (TMC Corpus) consists of 30 free conversations between strangers (MCDC8 and MCDC22) and 29 topic-specific and 26 Map Task conversations (MTCC and MMTC) between people acquainted with each other, each an average length of one hour, 20 minutes, and 10 minutes, respectively. The TMC Corpus has a balanced design of scenarios and conversation partner familiarity. Ninety-eight female and 72 male speakers aged 16 to 63 years were recorded. Twenty-six speakers took part in all three sub-corpus projects. Conversations were recorded in quiet rooms in Academia Sinica by using the SONY TCD-D10 Pro II DAT digital recorder and the Audio-Technica ATM 33a microphone at a sampling rate of 48 kHz, with each speaker on a separate channel. The speech content was orthographically transcribed using traditional Chinese characters. Particles, discourse markers, fillers, word fragments, and paralinguistic sounds that often occur in Chinese conversation are accordingly annotated in the transcripts. Only MCDC8 has been manually checked for Pinyin and POS. The rest of the corpora provided in this system is automatically processed, so please use it with caution. The corpus statistics are summarized as follows.


IPU 81,237
Word Lexical words: 397,693 (15,105)
1-syllabic words: 224,343 (1,580)
2-syllabic words: 153,240 (9,705)
3-syllabic words: 17,322 (2,942)
Others: 2,788 (878)
Discourse-related items: 175,318 (2,419)
Discourse particles: 29,421 (36)
Discourse markers: 12,164 (16)
Fillers: 16,721 (34)
POS Verbs: 98,090 (6,261, 16)
Adverbs: 80,190 (657,64)
Nouns: 75,559 (8,210,7)
Pronouns: 39,453 (50, 1)
Determinatives: 24,865 (526, 5)
Preposition: 14,464 (100, 1)
Conjunctions: 17,950 (94, 4)
Structural particles DE: 16,342 (5, 1)
Classifiers: 12,969 (165, 1)
Particles: 3,802 (22, 1)
Adjectives: 813 (193, 1)
Interjection: 8 (4, 1)
Copula: 13,141 (3, 1)
Foreign words: 1,470 (473, 1)
Character 594,238 (2,952)
Syllable Tone-distinctive 1,086
No tone distinction 403
Phoneme 1,429,518

The Sinica Sociophonetic Corpus was funded by the National Digital Archives Project. The purpose of this corpus project was to document and archive the contemporary use of spoken Taiwan Mandarin. Recording was conducted in twelve regions distributed across northern, middle, and southern Taiwan, including Yilan County, Taoyuan County, Hsinchu County, Taichung City, Nantou County, Yunlin County, Chiayi City, Changhua County, Tainan City, Kaohsiung City, Kaoshiung County, and Taipei City. A total of 1,402 interviews mainly with individuals aged 20 to 40 years were recorded in public places, e.g., parks, post offices, or banks, where we assumed we were most likely to find local people. The interviews were recorded by using the Sony Hi-MD MZ-RH1 digital recorder and the Sony ECM MS907 microphone, digitized at a sampling rate of 44.1 kHz with 16-bit quantization. The speech content of the interviewees was orthographically transcribed in traditional Chinese characters with annotations of paralinguistic sounds and pauses. Twenty-five questions in three categories were directed to the interviewees, including information about the language use, socioeconomic background, and use of the internet of the interviewees. Concerning language use, dialect exposure was particularly specified in the way the spoken dialects, mainly Southern Min and Hakka, are used within a family, e.g., to parents and siblings. Questions about language ability are concerned with how many languages the interviewees can speak and how good they are. Concerning socioeconomic background, data on age, gender, salary level, education level and childhood residence were sought. The length of individual interviews ranged from three to eight minutes. All interviews were conducted in Taiwan Mandarin. Only the speech produced by the interviewees was transcribed and processed. The corpus statistics are summarized as follows.


IPU 124,916
Word Lexical words: 284,196 (7,085)
1-syllabic words: 133,354 (1,007)
2-syllabic words: 129,060 (4,218)
3-syllabic words: 20,348 (1,585)
Others: 1,434 (275)
Discourse-related items: 122,634 (718)
Discourse particles: 28,928 (33)
Discourse markers: 3,993 (12)
Fillers: 28,826 (21)
POS Verbs: 58,894 (2,135, 16)
Adverbs: 42,750 (367, 6)
Nouns: 88,146 (4,423, 7)
Pronouns: 10,020 (33, 1)
Determinatives: 18,700 (362, 5)
Preposition: 11,499 (66, 1)
Conjunctions: 9,817 (64, 4)
Structural particles DE: 6,655 (7, 1)
Classifiers: 5,917 (76, 1)
Particles: 6,324 (19, 1)
Adjectives: 683 (102, 1)
Interjection: 3 (2, 1)
Copula: 10,275 (4, 1)
Foreign words: 2,579 (235, 1)
Character 458,320 (2,006)
Syllable Tone-distinctive 929
No tone distinction 375
Phoneme 1,102,753

The Sinica Child Speech Corpus was funded by the National Science Council and the Children’s Hearing Foundation. It contains repetitive and narrative speech data produced by seventy-nine preschool children with normal hearing (NH) aged 2;11~6;3 (median 5;0) and forty-five children with hearing impairment (HI) aged 3;3~12;5 (median 5;9). Among the HI children, thirty wore traditional hearing aids (with mild to profound degrees of hearing loss), and fifteen were fitted with a cochlear implant (with severe to profound degrees of hearing loss). The HI children were recorded during their regular AVT session using the video equipment built into the sound-proof classrooms of the Children’s Hearing Foundation. Adobe Audition 1.0 was used to convert the video files into 44100 Hz, 16-bit single-channel sound files. The NH children were recorded either at Academia Sinica in sound-proof studios or in quiet classrooms at their kindergarten using the Sony Hi-MD MZ-RH1 digital recorder and the Sony ECM MS907 microphone. The data were digitized at a sampling rate of 44.1 kHz with 16-bit quantization. For narrative speech data collection, the children were asked to tell The Hare and the Tortoise, assisted with picture cards that were presented to them in a fixed order. The speech content was orthographically transcribed in traditional Chinese characters with annotations of paralinguistic sounds and pauses. The corpus statistics are summarized as follows.


HI NH
IPU 2,208 2,727
Word Lexical words 5,193(503) 6,436(559)
1-syllabic words 3,002(181) 3,863(205)
2-syllabic words 2,123(276) 2,484(305)
3-syllabic words 61(40) 86(46)
Others 7(6) 3(3)
Discourse-related items 2,778(42) 3,695(51)
Discourse particles 75(14) 52(16)
Discourse markers 21(4) 53(8)
Fillers 56(8) 214(11)
POS Verbs 1,612(252,16) 1,857(287,16)
Adverbs 1,038(71,6) 1,388(75,6)
Nouns 1,209(135,7) 1,254(148,7)
Pronouns 334(10,1) 650(11,1)
Determinatives 251(25,5) 344(28,5)
Preposition 155(15,1) 241(20,1)
Conjunctions 79(15,4) 92(15,4)
Structural particles DE 131(2,1) 107(2,1)
Classifiers 163(7,1) 239(8,1)
Particles 170(10,1) 183(10,1)
Adjectives 2(1,1) 2(1,1)
Interjection 0 0
Copula 49(1,1) 72(2,1)
Character 7,467(378) 9,102(408)
Syllable Tone-distinctive 311 349
No tone distinction 215 236
Phoneme 17,046 21,022

The Sinica Phonological Development Corpus contains speech recordings of 798 preschool children from Taipei City and New Taipei City in Taiwan (see Table 1). The recording project was approved in 2017 by the Institutional Review Board on Humanities and Social Science Research at Academia Sinica (AS-IRB-HS07-107079). None of the children had any known diagnoses related to language, hearing, or cognitive development. All children passed a pure-tone audiometric hearing test using a GSI 18 Screening Audiometer at 1, 2, and 4 kHz at 20 dB in both ears.

For data collection, CapiAssess/AssessingSpeech was installed on a MacBook Air Pro Retina 13.3 laptop, with a Sony ECM MS907 microphone. A picture-naming task was conducted to record the Sinica Child Balanced Wordlist (see Table 2). The wordlist consists of 70 child-friendly multisyllabic words and short phrases, designed with the following balance criteria: all onsets eligible for composing Chinese syllables appear in both the first and second syllable positions. With the exception of the neutral tone, all 2x2 combinations of tones in disyllabic words are represented in the wordlist.

To create child-friendly sentences and short discourse content for future continuous speech recording, the wordlist includes a variety of semantic fields familiar to children, such as animals, food, transportation, body parts, movement, objects, games, locations, and natural phenomena. Each child recorded 148 syllables, resulting in a total of 55,860 words/118,104 syllables. These words were digitized at a sampling rate of 16 kHz. The data were automatically processed using the ILAS phone aligner, with manual verification of syllable boundaries.


Table 1. Subjects

AGE 3~3.5 3.5~4 4~4.5 4.5~5 5~5.5 5.5~6 6~6.5 6.5~7 Total
Male 10 40 64 55 65 64 79 22 399
Female 21 52 58 64 53 62 60 29 399

Table 2. Sinica Child Balanced Wordlist

Word/Phrase Pinyin IPA
cloud báiyún /pai yn/
crayon cǎisèbǐ /tsʰai sə pi/
strawberry cǎoméi /tsʰao mei/
teacup chábēi /tʂʰa pei/
have a meal chīfàn /tʂʰɯ fan/
ugly duckling chǒuxiǎoyā /tʂʰou ɕjao ja/
window chuānghù /tʂʰwaŋ xu/
put on clothes chuānyīfú /tʂʰwan i fu/
kitchen chúfáng /tʂʰu faŋ/
blow bubbles chuīpàopào /tʂʰwei pʰao pʰao/
hedgehog cìwèi /tsʰɨ wei/
Monopoly dàfùwēng /ta fu wəŋ/
cake dàngāo /tan kao/
TV diànshì /tjen ʂɯ/
cliff duànyá /twan jai/
ears ěrduo /ɚ two/
airplane fēijī /fei tɕi/
turn off light guāndēng /kwan təŋ/
juice guǒzhī /kwo tʂɯ/
garden huāyuán /xwa yen/
train huǒchē /hwo tʂʰə/
building blocks jīmù /tɕi mu/
living room kètīng /kʰə tʰiŋ/
dinosaur kǒnglóng /kʰoŋ loŋ/
chopsticks kuàizi /kʰwai tsɨ/
tiger lǎohǔ /lao hu/
eagle lǎoyīng /lao iŋ/
get wet in rain línyǔ /lin y/
tire lúntāi /lun tʰai/
go shopping mǎicài /mai tsʰai/
mango mángguǒ /maŋ kwo/
steamed bread mántóu /man tʰou/
bee mìfēng /mi fəŋ/
hen mǔjī /mu tɕi/
button niǔkòu /njou kʰou/
milk niúnǎi /njou nai/
steak niúpái /njou pʰai/
crab pángxiè /pʰaŋ ɕje/
dish pánzi /pʰan tsɨ/
climb mountain páshān /pʰa ʂan/
fountain pēnshuǐchí /pʰən ʂwei tʂʰɯ/
apple píngguǒ /pʰiŋ kwo/
jigsaw pīntú /pʰin tʰu/
leather shoes píxié /pʰi ɕje/
grapes pútáo /pʰu tʰao/
car qìchē /tɕʰi tʂʰə/
ride horse qímǎ /tɕʰi ma/
hot dog règǒu /ʐə kou/
sweep sǎodì /sao ti/
birthday shēngrì /ʂəŋ ʐɯ/
clock shízhōng /ʂɯ tʂoŋ/
sushi shòusī /ʂou sɨ/
sleep shuìjiào /ʂwei tɕjao/
speak shuōhuà /ʂwo hwa/
swan tiāné /tʰjen ə/
donut tiántiánquān /tʰjen tʰjen tɕʰyen/
rabbit tùzi /tʰu tsɨ/
toy wánjù /wan tɕy/
thermometer wēndùjì /wən tu tɕi/
turtle wūguī /u kwei/
write xiězì /ɕje tsɨ/
straw xīguǎn /ɕi kwan/
school xuéxiào /ɕye ɕjao/
teeth yáchǐ /ja tʂʰɯ/
swim yóuyǒng /jou joŋ/
moon yuèliàng /ye ljaŋ/
spider zhīzhū /tʂɯ tʂu/
walk zǒulù /tsou lu/
mouth zuǐbā /tswei pa/
football zúqiú /tsu tɕʰjou/