HOME - LIST OF CORPORA
The 43-hour Sinica Taiwan Mandarin Conversational Corpus (TMC Corpus) consists of 30 free conversations between strangers (MCDC8 and MCDC22) and 29 topic-specific and 26 Map Task conversations (MTCC and MMTC) between people acquainted with each other, each an average length of one hour, 20 minutes, and 10 minutes, respectively. The TMC Corpus has a balanced design of scenarios and conversation partner familiarity. Ninety-eight female and 72 male speakers aged 16 to 63 years were recorded. Twenty-six speakers took part in all three sub-corpus projects. Conversations were recorded in quiet rooms in Academia Sinica by using the SONY TCD-D10 Pro II DAT digital recorder and the Audio-Technica ATM 33a microphone at a sampling rate of 48 kHz, with each speaker on a separate channel. The speech content was orthographically transcribed using traditional Chinese characters. Particles, discourse markers, fillers, word fragments, and paralinguistic sounds that often occur in Chinese conversation are accordingly annotated in the transcripts. Only MCDC8 has been manually checked for Pinyin and POS. The rest of the corpora provided in this system is automatically processed, so please use it with caution. The corpus statistics are summarized as follows.
IPU | 81,237 |
Word | Lexical words: 397,693 (15,105) |
1-syllabic words: 224,343 (1,580) | |
2-syllabic words: 153,240 (9,705) | |
3-syllabic words: 17,322 (2,942) | |
Others: 2,788 (878) | |
Discourse-related items: 175,318 (2,419) | |
Discourse particles: 29,421 (36) | |
Discourse markers: 12,164 (16) | |
Fillers: 16,721 (34) | |
POS | Verbs: 98,090 (6,261, 16) |
Adverbs: 80,190 (657,64) | |
Nouns: 75,559 (8,210,7) | |
Pronouns: 39,453 (50, 1) | |
Determinatives: 24,865 (526, 5) | |
Preposition: 14,464 (100, 1) | |
Conjunctions: 17,950 (94, 4) | |
Structural particles DE: 16,342 (5, 1) | |
Classifiers: 12,969 (165, 1) | |
Particles: 3,802 (22, 1) | |
Adjectives: 813 (193, 1) | |
Interjection: 8 (4, 1) | |
Copula: 13,141 (3, 1) | |
Foreign words: 1,470 (473, 1) | |
Character | 594,238 (2,952) |
Syllable | Tone-distinctive 1,086 |
No tone distinction 403 | |
Phoneme | 1,429,518 |
The Sinica Sociophonetic Corpus was funded by the National Digital Archives Project. The purpose of this corpus project was to document and archive the contemporary use of spoken Taiwan Mandarin. Recording was conducted in twelve regions distributed across northern, middle, and southern Taiwan, including Yilan County, Taoyuan County, Hsinchu County, Taichung City, Nantou County, Yunlin County, Chiayi City, Changhua County, Tainan City, Kaohsiung City, Kaoshiung County, and Taipei City. A total of 1,402 interviews mainly with individuals aged 20 to 40 years were recorded in public places, e.g., parks, post offices, or banks, where we assumed we were most likely to find local people. The interviews were recorded by using the Sony Hi-MD MZ-RH1 digital recorder and the Sony ECM MS907 microphone, digitized at a sampling rate of 44.1 kHz with 16-bit quantization. The speech content of the interviewees was orthographically transcribed in traditional Chinese characters with annotations of paralinguistic sounds and pauses. Twenty-five questions in three categories were directed to the interviewees, including information about the language use, socioeconomic background, and use of the internet of the interviewees. Concerning language use, dialect exposure was particularly specified in the way the spoken dialects, mainly Southern Min and Hakka, are used within a family, e.g., to parents and siblings. Questions about language ability are concerned with how many languages the interviewees can speak and how good they are. Concerning socioeconomic background, data on age, gender, salary level, education level and childhood residence were sought. The length of individual interviews ranged from three to eight minutes. All interviews were conducted in Taiwan Mandarin. Only the speech produced by the interviewees was transcribed and processed. The corpus statistics are summarized as follows.
IPU | 124,916 |
Word | Lexical words: 284,196 (7,085) |
1-syllabic words: 133,354 (1,007) | |
2-syllabic words: 129,060 (4,218) | |
3-syllabic words: 20,348 (1,585) | |
Others: 1,434 (275) | |
Discourse-related items: 122,634 (718) | |
Discourse particles: 28,928 (33) | |
Discourse markers: 3,993 (12) | |
Fillers: 28,826 (21) | |
POS | Verbs: 58,894 (2,135, 16) |
Adverbs: 42,750 (367, 6) | |
Nouns: 88,146 (4,423, 7) | |
Pronouns: 10,020 (33, 1) | |
Determinatives: 18,700 (362, 5) | |
Preposition: 11,499 (66, 1) | |
Conjunctions: 9,817 (64, 4) | |
Structural particles DE: 6,655 (7, 1) | |
Classifiers: 5,917 (76, 1) | |
Particles: 6,324 (19, 1) | |
Adjectives: 683 (102, 1) | |
Interjection: 3 (2, 1) | |
Copula: 10,275 (4, 1) | |
Foreign words: 2,579 (235, 1) | |
Character | 458,320 (2,006) |
Syllable | Tone-distinctive 929 |
No tone distinction 375 | |
Phoneme | 1,102,753 |
The Sinica Child Speech Corpus was funded by the National Science Council and the Children’s Hearing Foundation. It contains repetitive and narrative speech data produced by seventy-nine preschool children with normal hearing (NH) aged 2;11~6;3 (median 5;0) and forty-five children with hearing impairment (HI) aged 3;3~12;5 (median 5;9). Among the HI children, thirty wore traditional hearing aids (with mild to profound degrees of hearing loss), and fifteen were fitted with a cochlear implant (with severe to profound degrees of hearing loss). The HI children were recorded during their regular AVT session using the video equipment built into the sound-proof classrooms of the Children’s Hearing Foundation. Adobe Audition 1.0 was used to convert the video files into 44100 Hz, 16-bit single-channel sound files. The NH children were recorded either at Academia Sinica in sound-proof studios or in quiet classrooms at their kindergarten using the Sony Hi-MD MZ-RH1 digital recorder and the Sony ECM MS907 microphone. The data were digitized at a sampling rate of 44.1 kHz with 16-bit quantization. For narrative speech data collection, the children were asked to tell The Hare and the Tortoise, assisted with picture cards that were presented to them in a fixed order. The speech content was orthographically transcribed in traditional Chinese characters with annotations of paralinguistic sounds and pauses. The corpus statistics are summarized as follows.
HI | NH | ||
IPU | 2,208 | 2,727 | |
Word | Lexical words | 5,193(503) | 6,436(559) |
1-syllabic words | 3,002(181) | 3,863(205) | |
2-syllabic words | 2,123(276) | 2,484(305) | |
3-syllabic words | 61(40) | 86(46) | |
Others | 7(6) | 3(3) | |
Discourse-related items | 2,778(42) | 3,695(51) | |
Discourse particles | 75(14) | 52(16) | |
Discourse markers | 21(4) | 53(8) | |
Fillers | 56(8) | 214(11) | |
POS | Verbs | 1,612(252,16) | 1,857(287,16) |
Adverbs | 1,038(71,6) | 1,388(75,6) | |
Nouns | 1,209(135,7) | 1,254(148,7) | |
Pronouns | 334(10,1) | 650(11,1) | |
Determinatives | 251(25,5) | 344(28,5) | |
Preposition | 155(15,1) | 241(20,1) | |
Conjunctions | 79(15,4) | 92(15,4) | |
Structural particles DE | 131(2,1) | 107(2,1) | |
Classifiers | 163(7,1) | 239(8,1) | |
Particles | 170(10,1) | 183(10,1) | |
Adjectives | 2(1,1) | 2(1,1) | |
Interjection | 0 | 0 | |
Copula | 49(1,1) | 72(2,1) | |
Character | 7,467(378) | 9,102(408) | |
Syllable | Tone-distinctive | 311 | 349 |
No tone distinction | 215 | 236 | |
Phoneme | 17,046 | 21,022 |
The Sinica Phonological Development Corpus contains speech recordings of 798 preschool children from Taipei City and New Taipei City in Taiwan (see Table 1). The recording project was approved in 2017 by the Institutional Review Board on Humanities and Social Science Research at Academia Sinica (AS-IRB-HS07-107079). None of the children had any known diagnoses related to language, hearing, or cognitive development. All children passed a pure-tone audiometric hearing test using a GSI 18 Screening Audiometer at 1, 2, and 4 kHz at 20 dB in both ears.
For data collection, CapiAssess/AssessingSpeech was installed on a MacBook Air Pro Retina 13.3 laptop, with a Sony ECM MS907 microphone. A picture-naming task was conducted to record the Sinica Child Balanced Wordlist (see Table 2). The wordlist consists of 70 child-friendly multisyllabic words and short phrases, designed with the following balance criteria: all onsets eligible for composing Chinese syllables appear in both the first and second syllable positions. With the exception of the neutral tone, all 2x2 combinations of tones in disyllabic words are represented in the wordlist.
To create child-friendly sentences and short discourse content for future continuous speech recording, the wordlist includes a variety of semantic fields familiar to children, such as animals, food, transportation, body parts, movement, objects, games, locations, and natural phenomena. Each child recorded 148 syllables, resulting in a total of 55,860 words/118,104 syllables. These words were digitized at a sampling rate of 16 kHz. The data were automatically processed using the ILAS phone aligner, with manual verification of syllable boundaries.
Table 1. Subjects
AGE | 3~3.5 | 3.5~4 | 4~4.5 | 4.5~5 | 5~5.5 | 5.5~6 | 6~6.5 | 6.5~7 | Total |
Male | 10 | 40 | 64 | 55 | 65 | 64 | 79 | 22 | 399 |
Female | 21 | 52 | 58 | 64 | 53 | 62 | 60 | 29 | 399 |
Table 2. Sinica Child Balanced Wordlist
Word/Phrase | Pinyin | IPA |
cloud | báiyún | /pai yn/ |
crayon | cǎisèbǐ | /tsʰai sə pi/ |
strawberry | cǎoméi | /tsʰao mei/ |
teacup | chábēi | /tʂʰa pei/ |
have a meal | chīfàn | /tʂʰɯ fan/ |
ugly duckling | chǒuxiǎoyā | /tʂʰou ɕjao ja/ |
window | chuānghù | /tʂʰwaŋ xu/ |
put on clothes | chuānyīfú | /tʂʰwan i fu/ |
kitchen | chúfáng | /tʂʰu faŋ/ |
blow bubbles | chuīpàopào | /tʂʰwei pʰao pʰao/ |
hedgehog | cìwèi | /tsʰɨ wei/ |
Monopoly | dàfùwēng | /ta fu wəŋ/ |
cake | dàngāo | /tan kao/ |
TV | diànshì | /tjen ʂɯ/ |
cliff | duànyá | /twan jai/ |
ears | ěrduo | /ɚ two/ |
airplane | fēijī | /fei tɕi/ |
turn off light | guāndēng | /kwan təŋ/ |
juice | guǒzhī | /kwo tʂɯ/ |
garden | huāyuán | /xwa yen/ |
train | huǒchē | /hwo tʂʰə/ |
building blocks | jīmù | /tɕi mu/ |
living room | kètīng | /kʰə tʰiŋ/ |
dinosaur | kǒnglóng | /kʰoŋ loŋ/ |
chopsticks | kuàizi | /kʰwai tsɨ/ |
tiger | lǎohǔ | /lao hu/ |
eagle | lǎoyīng | /lao iŋ/ |
get wet in rain | línyǔ | /lin y/ |
tire | lúntāi | /lun tʰai/ |
go shopping | mǎicài | /mai tsʰai/ |
mango | mángguǒ | /maŋ kwo/ |
steamed bread | mántóu | /man tʰou/ |
bee | mìfēng | /mi fəŋ/ |
hen | mǔjī | /mu tɕi/ |
button | niǔkòu | /njou kʰou/ |
milk | niúnǎi | /njou nai/ |
steak | niúpái | /njou pʰai/ |
crab | pángxiè | /pʰaŋ ɕje/ |
dish | pánzi | /pʰan tsɨ/ |
climb mountain | páshān | /pʰa ʂan/ |
fountain | pēnshuǐchí | /pʰən ʂwei tʂʰɯ/ |
apple | píngguǒ | /pʰiŋ kwo/ |
jigsaw | pīntú | /pʰin tʰu/ |
leather shoes | píxié | /pʰi ɕje/ |
grapes | pútáo | /pʰu tʰao/ |
car | qìchē | /tɕʰi tʂʰə/ |
ride horse | qímǎ | /tɕʰi ma/ |
hot dog | règǒu | /ʐə kou/ |
sweep | sǎodì | /sao ti/ |
birthday | shēngrì | /ʂəŋ ʐɯ/ |
clock | shízhōng | /ʂɯ tʂoŋ/ |
sushi | shòusī | /ʂou sɨ/ |
sleep | shuìjiào | /ʂwei tɕjao/ |
speak | shuōhuà | /ʂwo hwa/ |
swan | tiāné | /tʰjen ə/ |
donut | tiántiánquān | /tʰjen tʰjen tɕʰyen/ |
rabbit | tùzi | /tʰu tsɨ/ |
toy | wánjù | /wan tɕy/ |
thermometer | wēndùjì | /wən tu tɕi/ |
turtle | wūguī | /u kwei/ |
write | xiězì | /ɕje tsɨ/ |
straw | xīguǎn | /ɕi kwan/ |
school | xuéxiào | /ɕye ɕjao/ |
teeth | yáchǐ | /ja tʂʰɯ/ |
swim | yóuyǒng | /jou joŋ/ |
moon | yuèliàng | /ye ljaŋ/ |
spider | zhīzhū | /tʂɯ tʂu/ |
walk | zǒulù | /tsou lu/ |
mouth | zuǐbā | /tswei pa/ |
football | zúqiú | /tsu tɕʰjou/ |