Frisian Audio Mining Enterprise

Summary

In this project we will disclose 2600 hours of radio broadcasts from the Omrop Fryslân (Frisian Broadcast). The radio broadcasts contain spoken Frisian and Dutch covering the period 1950?2000. We will use speech technology for spoken document retrieval (speech to text conversion) and for speaker tracking (speaker diarization & recognition). Thus we will be able to locate broadcasts addressing specific topics and specific speakers in the audio signal. In order to guarantee relevance in retrieval, the project will also develop an enriched Frisian lexicon and a semantic search engine for Frisian and Dutch to search the broadcasts. The non-academic project partners acknowledge the disclosure of this data as a rich source of Frisian cultural heritage. The project carries out innovative research since it will investigate efficiency and performance of: 1. Automatic Speech Recognition of Frisian and Dutch using either two separate recognizers or a hybrid one; 2. the integration of speaker diarization and speaker recognition applied to a large longitudinal data set; 3. a flexible semantic search interface targeted at various user groups. In all these topics efficient processing is required, because of the sheer volume of the data.

Key words: audio mining; big data; semantic searching; Cultural Heritage; Frisian; radio broadcasts; language variation; language domains, spoken document retrieval

Details

Project number

314-99-119

Main applicant

Prof. dr. ir. D.A. van Leeuwen

Affiliated with

Radboud Universiteit Nijmegen, Faculteit der Letteren, Taalwetenschap

Team members

Dr. J.E. Dijkstra, Dr. E. Yilmaz, Dr. E. Yilmaz

Duration

01/07/2015 to 30/06/2018