关键词:语音识别和音频索引;扬声器系统;改善
摘 要:Speaker diarization is the problem of determining “who spoke when” in an audio recording when the number and identities of the speakers are unknown. Motivated by applications in automatic speech recognition and audio indexing, speaker diarization has been studied extensively over the past decade, and there are currently a wide variety of approaches – including both top-down and bottom-up unsupervised clustering methods. The contributions of this thesis are to provide a unified analysis of the current state-of-the-art, to understand where and why mistakes occur, and to identify directions for improvements.In the first part of the thesis, we analyze the behavior of six state-of-the-art diarization systems, all evaluated on the National Institute of Standards and Technology (NIST) Rich Transcription 2009 evaluation dataset. While performance is typically assessed in terms of a single number – the diarization error rate (DER) – we further characterize the errors based on speech segment durations and their proximity to speaker change points.