Skip to Main content Skip to Navigation
Journal articles

Information Retrieval from Unsegmented Broadcast News Audio

Abstract : This paper describes a system for retrieving relevant portions of broadcast news shows starting with only the audio data. A novel method of automatically detecting and removing commercials is presented and shown to increase the performance of the system while also reducing the computational effort required. A sophisticated large vocabulary speech recogniser which produces high-quality transcriptions of the audio and a window-based retrieval system with post-retrieval merging are also described. Results are presented using the 1999 TREC-8 Spoken Document Retrieval data for the task where no story boundaries are known. Experiments investigating the effectiveness of all aspects of the system are described, and the relative benefits of automatically eliminating commercials, enforcing broadcast structure during retrieval, using relevance feedback, changing retrieval parameters and merging during post-processing are shown. An Average Precision of 46.8%, when duplicates are scored as irrelevant, is shown to be achievable using this system, with the corresponding word error rate of the recogniser being 20.5%.
Complete list of metadata

Cited literature [41 references]  Display  Hide  Download
Contributor : pierre jourlin Connect in order to contact the contributor
Submitted on : Monday, July 8, 2019 - 2:19:08 PM
Last modification on : Wednesday, June 16, 2021 - 6:14:01 PM


Explicit agreement for this submission


  • HAL Id : hal-02171698, version 1



Sue E Johnson, Pierre Jourlin, Karen Jones, Philip C Woodland. Information Retrieval from Unsegmented Broadcast News Audio. International Journal of Speech Technology, Springer Verlag, 2001, 4, pp.251 - 268. ⟨hal-02171698⟩



Record views


Files downloads