HOME - LIST OF CORPORA

The 43-hour Sinica Taiwan Mandarin Conversational Corpus (TMC Corpus) consists of 30 free conversations between strangers (MCDC8 and MCDC22) and 29 topic-specific and 26 Map Task conversations (MTCC and MMTC) between people acquainted with each other, each an average length of one hour, 20 minutes, and 10 minutes, respectively. The TMC Corpus has a balanced design of scenarios and conversation partner familiarity. Ninety-eight female and 72 male speakers aged 16 to 63 years were recorded. Twenty-six speakers took part in all three sub-corpus projects. Conversations were recorded in quiet rooms in Academia Sinica by using the SONY TCD-D10 Pro II DAT digital recorder and the Audio-Technica ATM 33a microphone at a sampling rate of 48 kHz, with each speaker on a separate channel. The speech content was orthographically transcribed using traditional Chinese characters. Particles, discourse markers, fillers, word fragments, and paralinguistic sounds that often occur in Chinese conversation are accordingly annotated in the transcripts. Only MCDC8 has been manually checked for Pinyin and POS. The rest of the corpora provided in this system is automatically processed, so please use it with caution. The corpus statistics are summarized as follows.

IPU	81,237
Word	Lexical words: 397,693 (15,105)
	1-syllabic words: 224,343 (1,580)
	2-syllabic words: 153,240 (9,705)
	3-syllabic words: 17,322 (2,942)
	Others: 2,788 (878)
	Discourse-related items: 175,318 (2,419)
	Discourse particles: 29,421 (36)
	Discourse markers: 12,164 (16)
	Fillers: 16,721 (34)
POS	Verbs: 98,090 (6,261, 16)
	Adverbs: 80,190 (657,64)
	Nouns: 75,559 (8,210,7)
	Pronouns: 39,453 (50, 1)
	Determinatives: 24,865 (526, 5)
	Preposition: 14,464 (100, 1)
	Conjunctions: 17,950 (94, 4)
	Structural particles DE: 16,342 (5, 1)
	Classifiers: 12,969 (165, 1)
	Particles: 3,802 (22, 1)
	Adjectives: 813 (193, 1)
	Interjection: 8 (4, 1)
	Copula: 13,141 (3, 1)
	Foreign words: 1,470 (473, 1)
Character	594,238 (2,952)
Syllable	Tone-distinctive 1,086
Syllable	No tone distinction 403
Phoneme	1,429,518

The Sinica Sociophonetic Corpus was funded by the National Digital Archives Project. The purpose of this corpus project was to document and archive the contemporary use of spoken Taiwan Mandarin. Recording was conducted in twelve regions distributed across northern, middle, and southern Taiwan, including Yilan County, Taoyuan County, Hsinchu County, Taichung City, Nantou County, Yunlin County, Chiayi City, Changhua County, Tainan City, Kaohsiung City, Kaoshiung County, and Taipei City. A total of 1,402 interviews mainly with individuals aged 20 to 40 years were recorded in public places, e.g., parks, post offices, or banks, where we assumed we were most likely to find local people. The interviews were recorded by using the Sony Hi-MD MZ-RH1 digital recorder and the Sony ECM MS907 microphone, digitized at a sampling rate of 44.1 kHz with 16-bit quantization. The speech content of the interviewees was orthographically transcribed in traditional Chinese characters with annotations of paralinguistic sounds and pauses. Twenty-five questions in three categories were directed to the interviewees, including information about the language use, socioeconomic background, and use of the internet of the interviewees. Concerning language use, dialect exposure was particularly specified in the way the spoken dialects, mainly Southern Min and Hakka, are used within a family, e.g., to parents and siblings. Questions about language ability are concerned with how many languages the interviewees can speak and how good they are. Concerning socioeconomic background, data on age, gender, salary level, education level and childhood residence were sought. The length of individual interviews ranged from three to eight minutes. All interviews were conducted in Taiwan Mandarin. Only the speech produced by the interviewees was transcribed and processed. The corpus statistics are summarized as follows.

IPU	124,916
Word	Lexical words: 284,196 (7,085)
	1-syllabic words: 133,354 (1,007)
	2-syllabic words: 129,060 (4,218)
	3-syllabic words: 20,348 (1,585)
	Others: 1,434 (275)
	Discourse-related items: 122,634 (718)
	Discourse particles: 28,928 (33)
	Discourse markers: 3,993 (12)
	Fillers: 28,826 (21)
POS	Verbs: 58,894 (2,135, 16)
	Adverbs: 42,750 (367, 6)
	Nouns: 88,146 (4,423, 7)
	Pronouns: 10,020 (33, 1)
	Determinatives: 18,700 (362, 5)
	Preposition: 11,499 (66, 1)
	Conjunctions: 9,817 (64, 4)
	Structural particles DE: 6,655 (7, 1)
	Classifiers: 5,917 (76, 1)
	Particles: 6,324 (19, 1)
	Adjectives: 683 (102, 1)
	Interjection: 3 (2, 1)
	Copula: 10,275 (4, 1)
	Foreign words: 2,579 (235, 1)
Character	458,320 (2,006)
Syllable	Tone-distinctive 929
Syllable	No tone distinction 375
Phoneme	1,102,753

The Sinica Child Speech Corpus was funded by the National Science Council and the Children’s Hearing Foundation. It contains repetitive and narrative speech data produced by seventy-nine preschool children with normal hearing (NH) aged 2;11~6;3 (median 5;0) and forty-five children with hearing impairment (HI) aged 3;3~12;5 (median 5;9). Among the HI children, thirty wore traditional hearing aids (with mild to profound degrees of hearing loss), and fifteen were fitted with a cochlear implant (with severe to profound degrees of hearing loss). The HI children were recorded during their regular AVT session using the video equipment built into the sound-proof classrooms of the Children’s Hearing Foundation. Adobe Audition 1.0 was used to convert the video files into 44100 Hz, 16-bit single-channel sound files. The NH children were recorded either at Academia Sinica in sound-proof studios or in quiet classrooms at their kindergarten using the Sony Hi-MD MZ-RH1 digital recorder and the Sony ECM MS907 microphone. The data were digitized at a sampling rate of 44.1 kHz with 16-bit quantization. For narrative speech data collection, the children were asked to tell “The Hare and the Tortoise” and “Little Bear Brings an Apple”,
assisted with picture cards that were presented to them in a fixed order. The speech content was orthographically transcribed in traditional Chinese characters with annotations of paralinguistic sounds and pauses. The corpus statistics of “The Hare and the Tortoise” are summarized as follows.

		HI	NH
IPU		2,208	2,727
Word	Lexical words	5,193(503)	6,436(559)
	1-syllabic words	3,002(181)	3,863(205)
	2-syllabic words	2,123(276)	2,484(305)
	3-syllabic words	61(40)	86(46)
	Others	7(6)	3(3)
	Discourse-related items	2,778(42)	3,695(51)
	Discourse particles	75(14)	52(16)
	Discourse markers	21(4)	53(8)
	Fillers	56(8)	214(11)
POS	Verbs	1,612(252,16)	1,857(287,16)
	Adverbs	1,038(71,6)	1,388(75,6)
	Nouns	1,209(135,7)	1,254(148,7)
	Pronouns	334(10,1)	650(11,1)
	Determinatives	251(25,5)	344(28,5)
	Preposition	155(15,1)	241(20,1)
	Conjunctions	79(15,4)	92(15,4)
	Structural particles DE	131(2,1)	107(2,1)
	Classifiers	163(7,1)	239(8,1)
	Particles	170(10,1)	183(10,1)
	Adjectives	2(1,1)	2(1,1)
	Interjection	0	0
	Copula	49(1,1)	72(2,1)
Character		7,467(378)	9,102(408)
Syllable	Tone-distinctive	311	349
Syllable	No tone distinction	215	236
Phoneme		17,046	21,022

The Sinica Phonological Development Corpus contains speech recordings of 798 preschool children from Taipei City and New Taipei City in Taiwan (see Table 1). The recording project was approved in 2017 by the Institutional Review Board on Humanities and Social Science Research at Academia Sinica (AS-IRB-HS07-107079). None of the children had any known diagnoses related to language, hearing, or cognitive development. All children passed a pure-tone audiometric hearing test using a GSI 18 Screening Audiometer at 1, 2, and 4 kHz at 20 dB in both ears.

For data collection, CapiAssess/AssessingSpeech was installed on a MacBook Air Pro Retina 13.3 laptop, with a Sony ECM MS907 microphone. A picture-naming task was conducted to record the Sinica Child Balanced Wordlist (see Table 2). The wordlist consists of 70 child-friendly multisyllabic words and short phrases, designed with the following balance criteria: all onsets eligible for composing Chinese syllables appear in both the first and second syllable positions. With the exception of the neutral tone, all 2x2 combinations of tones in disyllabic words are represented in the wordlist.

To create child-friendly sentences and short discourse content for future continuous speech recording, the wordlist includes a variety of semantic fields familiar to children, such as animals, food, transportation, body parts, movement, objects, games, locations, and natural phenomena. Each child recorded 148 syllables, resulting in a total of 55,860 words/118,104 syllables. These words were digitized at a sampling rate of 16 kHz. The data were automatically processed using the ILAS phone aligner, with manual verification of syllable boundaries.

Table 1. Subjects

AGE	3~3.5	3.5~4	4~4.5	4.5~5	5~5.5	5.5~6	6~6.5	6.5~7	Total
Male	10	40	64	55	65	64	79	22	399
Female	21	52	58	64	53	62	60	29	399

Table 2. Sinica Child Balanced Wordlist

Word/Phrase	Pinyin	IPA
cloud	báiyún	/pai yn/
crayon	cǎisèbǐ	/tsʰai sə pi/
strawberry	cǎoméi	/tsʰao mei/
teacup	chábēi	/tʂʰa pei/
have a meal	chīfàn	/tʂʰɯ fan/
ugly duckling	chǒuxiǎoyā	/tʂʰou ɕjao ja/
window	chuānghù	/tʂʰwaŋ xu/
put on clothes	chuānyīfú	/tʂʰwan i fu/
kitchen	chúfáng	/tʂʰu faŋ/
blow bubbles	chuīpàopào	/tʂʰwei pʰao pʰao/
hedgehog	cìwèi	/tsʰɨ wei/
Monopoly	dàfùwēng	/ta fu wəŋ/
cake	dàngāo	/tan kao/
TV	diànshì	/tjen ʂɯ/
cliff	duànyá	/twan jai/
ears	ěrduo	/ɚ two/
airplane	fēijī	/fei tɕi/
turn off light	guāndēng	/kwan təŋ/
juice	guǒzhī	/kwo tʂɯ/
garden	huāyuán	/xwa yen/
train	huǒchē	/hwo tʂʰə/
building blocks	jīmù	/tɕi mu/
living room	kètīng	/kʰə tʰiŋ/
dinosaur	kǒnglóng	/kʰoŋ loŋ/
chopsticks	kuàizi	/kʰwai tsɨ/
tiger	lǎohǔ	/lao hu/
eagle	lǎoyīng	/lao iŋ/
get wet in rain	línyǔ	/lin y/
tire	lúntāi	/lun tʰai/
go shopping	mǎicài	/mai tsʰai/
mango	mángguǒ	/maŋ kwo/
steamed bread	mántóu	/man tʰou/
bee	mìfēng	/mi fəŋ/
hen	mǔjī	/mu tɕi/
button	niǔkòu	/njou kʰou/
milk	niúnǎi	/njou nai/
steak	niúpái	/njou pʰai/
crab	pángxiè	/pʰaŋ ɕje/
dish	pánzi	/pʰan tsɨ/
climb mountain	páshān	/pʰa ʂan/
fountain	pēnshuǐchí	/pʰən ʂwei tʂʰɯ/
apple	píngguǒ	/pʰiŋ kwo/
jigsaw	pīntú	/pʰin tʰu/
leather shoes	píxié	/pʰi ɕje/
grapes	pútáo	/pʰu tʰao/
car	qìchē	/tɕʰi tʂʰə/
ride horse	qímǎ	/tɕʰi ma/
hot dog	règǒu	/ʐə kou/
sweep	sǎodì	/sao ti/
birthday	shēngrì	/ʂəŋ ʐɯ/
clock	shízhōng	/ʂɯ tʂoŋ/
sushi	shòusī	/ʂou sɨ/
sleep	shuìjiào	/ʂwei tɕjao/
speak	shuōhuà	/ʂwo hwa/
swan	tiāné	/tʰjen ə/
donut	tiántiánquān	/tʰjen tʰjen tɕʰyen/
rabbit	tùzi	/tʰu tsɨ/
toy	wánjù	/wan tɕy/
thermometer	wēndùjì	/wən tu tɕi/
turtle	wūguī	/u kwei/
write	xiězì	/ɕje tsɨ/
straw	xīguǎn	/ɕi kwan/
school	xuéxiào	/ɕye ɕjao/
teeth	yáchǐ	/ja tʂʰɯ/
swim	yóuyǒng	/jou joŋ/
moon	yuèliàng	/ye ljaŋ/
spider	zhīzhū	/tʂɯ tʂu/
walk	zǒulù	/tsou lu/
mouth	zuǐbā	/tswei pa/
football	zúqiú	/tsu tɕʰjou/