- Research Article
- Open Access
Efficient Advertisement Discovery for Audio Podcast Content Using Candidate Segmentation
© M. N. Nguyen et al. 2010
- Received: 20 November 2009
- Accepted: 29 June 2010
- Published: 14 July 2010
Nowadays, audio podcasting has been widely used by many online sites such as newspapers, web portals, journals, and so forth, to deliver audio content to users through download or subscription. Within 1 to 30 minutes long of one podcast story, it is often that multiple audio advertisements (ads) are inserted into and repeated, with each of a length of 5 to 30 seconds, at different locations. Automatic detection of these attached ads is a challenging task due to the complexity of the search algorithms. Based on the knowledge of typical structures of podcast contents, this paper proposes a novel efficient advertisement discovery approach for large audio podcasting collections. The proposed approach offers a significant improvement on search speed with sufficient accuracy. The key to the acceleration comes from the advantages of candidate segmentation and sampling technique introduced to reduce both search areas and number of matching frames. The approach has been tested over a variety of podcast contents collected from MIT Technology Review, Scientific American, and Singapore Podcast websites. Experimental results show that the proposed algorithm archives detection rate of 97.5% with a significant computation saving as compared to existing state-of-the-art methods.
- Search Area
- False Acceptance Rate
- False Rejection Rate
- Search Speed
- Candidate Segment
Podcasting is already an important Internet application with millions of subscribers and is still growing rapidly towards a projected audience of 56 million by the year 2010 , it results in massive collections of audio podcasting contents available on the Internet. Most podcast content is audio in the form of news feeds, interview transcripts, entertainment, and radio shows. It is often that multiple advertisements (ads) are inserted into one podcast story at different locations to advertise for publishers or sponsor companies. The increasing amount of audio podcasting creates the need to develop algorithms and systems for users to organize, storage, personalize and especially search the audio podcast content collections.
Recently, audio advertisement discovery and detection systems have attracted attention of researchers due to their important role in many practical applications that are demanded by both publisher and listener. From the viewpoint of end users, who have few options to determine what ads they want to listen on the podcast channels, automatic filtering of repeated and boring ads provides better experience and higher efficiency. On the other hand, podcast advertisers and publishers are seeking cost-effective ways to replace outdated advertisement in the rebroadcasts with the new one for business purposes or target individual viewer's interests and preferences. This ad replacement increases the value to both distributor and viewer. When podcast material is redistributed by individual request, the original advertisements can be removed and replaced with new ads that are more tightly targeted to listeners. Moreover, for the marketing analyzers, and advertising planers, automatic monitoring of ads can help to collect informative statistics on the distributed advertisements. However, information about the original distributed podcast ads and their insertion points are rarely available. This creates the need to efficiently and accurately discover and segment advertising material out of the podcast contents.
In this paper, an efficient advertisement discovery approach for audio podcasting is proposed to automatically discover and locate repeated advertisement for huge audio podcasting collections. Compared to existing work, the proposed approach has major contributions as follows. We first present typical audio podcasting structures and common temporal podcast distributions as well as podcast advertisement characteristics. Based on this knowledge, a Repeated Ad Database is built on the fly and the characteristics of ads selected from this database are used to train a fuzzy neural network classifier, which is used as a candidate segmentation to preprocess the audio collection data and select candidate segments for searching. This approach not only helps to reduce a significant portion of search area but also limits the search to shorter buffer length.
Second, we propose a multistage approach to greatly accelerate search speed using two stage detectors in cascade by combining both detection and discovery-based techniques. In the first stage, by making use of the knowledge about typical structures of podcast contents, candidate segmentation using fuzzy neural network is used to quickly narrow down search areas. In the second stage, a sampling technique is employed to discover new and unknown advertisements. Finally, we analyse the computation savings in a mathematical way and present in detail the tradeoff between search speed and detection accuracy. As compared to existing state-of-the-art methods and simple brute force, our proposed approach greatly improves search speed by 20 times to 100 times, respectively, while maintaining a high detection rate of 97.5% and obtaining the lowest false detection rate.
The rest of the paper is organized as follows. Related work is given in Sections 2, and Section 3 presents typical audio podcasting structures together with common temporal distributions of podcast contents. The proposed approach and detailed computation analyses are described in Section 4. Finally, experimental results are shown in Section 5 followed by the conclusions in Section 6.
Advertisement detection and discovery for large audio collection data are crucial problems that are highly related to audio processing techniques, such as segmentation, indexing and retrieval. Existing advertisement extraction approaches can be classified into two categories. The first one, detection-based approach, makes use of given clues or particular features of certain classes of the advertisement, such as black/silent frames and difference on audio volume, to detect and locate advertisements [2–4]. The second one, discovery-based approach, makes use of repetition detection methods to discover new and unknown ads from an existing collection and keep a database of all these advertisements for matching and detection of their recurrences [5–8]. In the first approach, feature extraction and matching are performed for advertisement detection. In the second approach, it needs to discover ads from an existing collection and keep a database of all known advertisements. Depending on the needs of particular applications, either one of the two approaches can be employed. However, both techniques face the same challenges of accurate, robust detection and computationally efficient implementation for massive amounts of audio/video data to be processed.
An interesting model used to detect a given known set of advertisements is proposed in . In their method, given a library of advertisements, they calculate a fingerprint for a sequence of frames based on the Color Coherence Vectors. The fingerprints are then compared to detect recurrence of known advertisements. The authors pointed out that any fingerprint that is tolerant of channel deformations has low dimension and discriminates well between different advertisements will suffice. The main advantage of this approach comes from its high efficiency of detection, which allows it could be applied to real-time advertisement recognition applications. Another example of the techniques in this category could be found in , in which the detection-based technique can also be used as a filtering method to accelerate the advertisement recognition . In addition, techniques based on segmentation and classification, which extract acoustic features and use classifier modes such as SVM or HMM to classify each clip into advertisement or program, have been proposed to provide a general solution for advertisement classification .
The techniques in the first category have the advantage of efficient and fast detection due to their low-computation search algorithms; however, these techniques suffer from the following drawbacks. The clues used to detect advertisement are not always reliable for general applications. Moreover, advertisements are nowadays becoming extremely similar to normal program contents. As a result, particular features could not help to distinguish properly advertisement from other program contents. When black/silent frames are not used at the beginning and end of some advertisement breaks, this type of techniques will fail. Moreover, a general threshold that is suitable for different broadcast channels and programs is very difficult to find; therefore, detection-based approach is very sensitive to the broadcasting.
On the other hand, the techniques in the second category can overcome the above problem by using a repetition detection approach, which does not make any assumption about the location or nature of the advertisement, resulting in universality and robustness. However, the major problem of these techniques when applied to large collection data is the high computation required of the matching strategy. While the detection-based techniques can be performed by constantly matching the fingerprints of a known library of advertisements with every incoming signal, the case of discovery an unknown advertisement is more complex. In , the authors reported an efficient short video repeat identification based on similarity fingerprint matrix together with the locality-sensitive hashing which are usually used to reduce search complexity. This approach pays a price of computation cost as it performs an exhausted search for the whole data stream. In order to overcome the exhausted search, a sampled search algorithm was introduced in . Rather than checking every single frame, the authors proposed an approach to operate only on a small sample set of the stream. However, in their approach, they used fixed write and check rates, which may affect to the detection accuracy if any failure happens in sampled frame extraction or matching. Moreover, their search is applied to the whole stream, which leads to a very much computation being required when dealing with a large collection data.
Besides the computation consuming, another issue that discovery-based techniques need to handle is how to manage the search buffer effectively when dealing with a huge collection. In , the authors introduced an ARGOS system for repeated object extraction and showed that it is not necessary to buffer large sections of the stream. Although the detection process of the ARGOS system, which makes use of detected library of repeated objects, significantly increases the search efficiency, its discovery process still suffers from the exhaustive search over all the collection data.
All the proposed systems in [6–8] have in common with our work that they do not seek particular features of certain classes of the advertisement; for example, the advertisements are often followed by two blank/silent frames. The contrastive point between our work and many previous approaches is that they focused on very long unstructured streaming sequences, while our work is applied to multiple short audio podcast contents and makes use of the knowledge about their typical structures to quickly narrow down search areas. Another point of difference is that while other approaches either employed audio as a preprocessing for video detection or subsequently used video features to verify the extracted objects, our approach considers solely audio features for audio podcast advertisement extraction.
Audio podcast contents can be in various compressed audio formats such as mp3, advanced audio coding (AAC). They are usually short and less than 30 minutes long for each content item. It is often that multiple audio ads are inserted into one podcast story at different locations, ranging from 5–10 seconds to 20–30 seconds for each advertisement.
In general, sign_on and sign_off advertisements' locations are fixed at the start and the end of the podcast items, respectively. In some sense, this knowledge could help the detection process to be easier. However, multiple service ads could be inserted anywhere along the podcast story. Therefore, in the discovery process, we do not assume any location assumption as well as the nature of the advertisement. This location information is only used to classify and identify advertisements after they are extracted by our proposed search algorithm.
The most challenging tasks in automatically discovering a repeated advertisement are how to manage buffering efficiently and handle the complexity of the search strategy when dealing with a huge collection data. Different from audio detection techniques that seek recurrences of known advertisement, discovery of unknown repeated ads is more complex. Detection of known advertisement only requires comparing each incoming audio frame against a library of known advertisement, which is usually small as compared to the huge collection of sought data. Therefore, the complexity of the detection problem is linearly related to the size of the collection. On the other hand, in discovery problem, we must first find out what the repeated ads are, and then detect all of their recurrences. Thus the computation required for discovering unknown advertisement grows exponentially with the search length as well as the size of the collection data.
From (1), there are two factors, the search length L of the collection and the number of matching frames N B of the Buffer window, that determine the complexity of the search algorithm. The fact is that an exhausted search for every frame for the whole collection is not necessary. The efficiency of the search could be improved by reducing the number of frames to be matched, or shortening search areas.
In this work, we propose an approach, namely, Sampled and Skipping Search (SSS), to discover and identify unknown advertisements for large collection of audio podcasting by making use of specific knowledge of typical podcast structure consisting of multiple short files. The proposed approach enjoys advantages of efficient sampling technique and candidate segmentation to offer significantly fast search with sufficient accuracy by subsequently reducing both the search length and the number of matching frames.
4.1. Candidate Segmentation
In this section, making use of audio podcast knowledge, a fuzzy neural classifier is employed to quickly narrow the search area of the huge collection data. The input signals of podcast files are labeled as candidate segment, which has advertisement characteristics such as music, high-peak signal, or silence break, or noncandidate segment, which is more concerned to "pure talk" segment. In other words, the candidate segment is an audio segment that has higher probability of advertisement appearance, while the later is an audio segment that has low probability or does not contain any advertisement.
4.1.1. Feature Extraction and Analysis
In the audio podcasting, the signal mainly comes from speech, music, and environment. Therefore, given the audio podcast files, we first uniformly segment them into nonoverlapping 1-s clips. Then, 8 features both temporal and spectral, which are chosen to represent each segment, are extracted to capture the structure of difference advertisements as candidate segments. They are energy-entropy block (EEB), short-time energy (STE), low-STE ratio (LSTER), short-time zero-crossing rate (ZCR), high-ZCR ratio (HZCRR), Spectral Roll-Off point (SRP), Spectral Centroid (SC), and Spectral Flux (SF). The detailed descriptions are given as follows.
Energy-entropy block is calculated by standard deviation of the energy entropy over a 1-s clip. While STE and EEB provide a convenient representation of the signal's amplitude evolutions over time, it is found that there are more quiet frames in pure speech (nonadvertisement) segments . Therefore, the ratio of "low-energy" frames (LSTER), whose STE values are less than 0.5 of the mean value to the total number of frames within a 1-s clip, can be used to detect the nonadvertisement segment.
where N is the number of frames in 1-s clip.
where the roll-off point, fro, is the largest value of frequency fq for which the above equation is satisfied.
where X(n) represents the magnitude of frames in 1-s clip and f(n) represents the center frequency of that frame.
Spectral Flux (SF) is defined as the spectral correlation between two adjacent frames in a 1-s clip to capture the local spectral change . It is effective in distinguishing certain audio types of environmental sound.
4.1.2. Candidate Selection by FCMAC-BYY Classification
In this section, the FCMAC-BYY network  is employed to classify the input audio podcast signal into one of two classes (candidate or noncandidate) due to its fast learning and simple computation capabilities. First proposed by Albus , the original CMAC has been widely applied in many areas such as robotic control, signal processing, and pattern recognition. Recently, Bayesian Ying-Yang (BYY) learning and fuzzy logic have been successfully integrated into CMAC to propose the FCMAC-BYY network, which has shown advantages in classification and regression problems .
The input to FCMAC-BYY is a nonfuzzy data vector corresponding to a measure of the input parameter represented in the respective dimension. The fuzzification layer maps input patterns into the fuzzy clusters c in the association layer through BYY learning. Thereafter, the association layer associates the fuzzy rules with the memory cell and tries to imitate a human cerebellum. The logical AND operation is carried out in this layer to ensure that a cell is activated only when all the inputs associated with it are fired. The association layer is then mapped to the postassociation layer where the logical OR operation will fire those cells whose connected inputs are activated. For the output layer, the defuzzification centre of area (COA)  method is used to compute the output of the structure.
In contrast to the conventional clustering algorithms, which are "one-way" learning, BYY harmonizes the training input and the solution/clusters by considering not only forward mapping from the input data into the clusters, but also the backward path from the obtained clusters to the input data. With the introduction of the BYY learning algorithm, FCMAC-BYY has higher generalization ability because the fuzzy rule sets are systematically optimized by BYY. Recently, incremental learning with sliding window has been introduced into FCMAC-BYY to dynamically construct fuzzy rule sets for time series applications . In this respect, the FCMAC-BYY model becomes an ideal classifier for our candidate segmentation, which requires fast learning and adaptive and dynamic real-time training since the entire data set no longer needs to be obtained during the prediction process.
In this research, the FCMAC-BYY classifier is trained by 8 features that represent for every 1-s clip from advertisement templates selected from the Repeated Ad Database. One output is used to differentiate between advertisement and nonadvertisement clips. The advertisement clips are denoted with output "1" and the nonadvertisement clips are identified by output "0". We define the candidate ratio (CR) as the ratio of the length of candidate segments to the total length of the podcast files. Then, the classification threshold of the trained FCMAC-BYY is adjusted to minimize this ratio while keeping the misclassification rate of an advertisement clip into noncandidate segment as low as zero percent.
4.2. Repeated Advertisement Detection
4.2.1. Audio Fingerprint Generator
where E(r, b) is the energy of b band of frame r, and f(r, b) is the b th bit of the fingerprint f(r) of frame r.
4.2.2. Detection Process
Once an advertisement is found by the discovery process, and its locations and boundaries have been identified, the clip can be added to the Repeated Ad Database as known advertisement. These detected ads are then compared with new incoming frames into the Buffer window to detect their repetitions, if a match is found, their locations are marked as identified segment. These identified segments do not have to be searched again by the discovery process. Therefore, the search length of the Buffer window could further be reduced by skipping all the identified segments.
The advantage of this approach is that after the second occurrence, each repeated advertisement will be added into the Repeated Ad Database and its recurrence will be quickly detected by the detection process. Every time a repetition of ads is found, we can shorten the length of the Buffer window that remains to be searched by the discovery process. As long as the Repeated Ad Database is smaller than the buffer, our approach improves search speed. In addition, the Repeated Ad Database of found ads can be sorted by the frequency of repetition, so that most common ads are checked first.
4.2.3. Discovery Process
where LR is the length of the advertisement and , and are the fingerprints at frame r1, r2 of advertisement and its repetition, respectively. Note that, it is not necessary that the relative offsets, k, have to be exactly equal due to noise or decoding effect. As long as their relative offsets are less than a small threshold of each other, they still match.
Once a match is found, we would expect that a repeat has occurred. However, the boundaries of the two repetitions have not been identified. Since the match corresponds to a repetition of the advertisement at the location of the sampled frame found, we are able to identify the starting and ending positions of the repeat by tracing the fingerprint of frames backwards and forwards, respectively.
4.3. Analysis Computation Saving
As compared to (1), the proposed SSS approach has saved a significant computation of times on average. Therefore, the two factors, sampling rate S and candidate ratio CR, will determine the complexity of the proposed SSS approach. The detailed analyses of the effect of these two factors as well as the tradeoff between search speed and detection accuracy will be given in Section 5.
Summary of audio podcast collection.
Number of files
Total length in minutes
Number of advertisements
MIT technology review
We first show the effect of the proposed candidate segmentation on shortening the search length. This experiment is conducted in two steps as follows. In the initialization step, the Repeated Ad Database is built from the first 300 minutes of the collection using the proposed approach without candidate segmentation. The learning (training) data set is then selected from these detected advertisements to train our FCMAC-BYY classifier. The rest of the collection data is used in the testing step. During this step, new advertisements discovered by the discovery process will be added into the Repeated Ad Database and be used to update the learning data set subsequently.
From Figure 8, it is observed that while the advertisement clips of the MIT Technology Review podcasting could be separated easily with low error values, the advertisement and nonadvertisement clips of the Scientific American and Singapore Podcast podcasting have a significant degree of overlapping. The above results could be explained by the nature characteristics of these three podcasting websites as shown in Figure 8.
Candidate segmentation percentage.
Total time (seconds)
In the second experiment, we instrumented our experiments to study the relationship between the detection accuracy and the searching time with different sampling rate S. The proposed approach is compared to the Sampled Search  which applies the search to the whole collection. In this experiment, the buffer size of the proposed SSS approach is set to the length of 30 files which approximates about 2 hours long.
Comparison results of the proposed SSS approach.
Average detection rate
Computation time (mins)
We can see that the false detection rate only falls on service ads; this is due to the fact that all repeated sequences located between the begin and the end of the files will be considered as service advertisement. It is noticed that the location information is only used to classify advertisements after they are identified by our SSS approach. On average, the proposed SSS approach greatly improves search speed by 100 times as compared to a simple search method and 20 times as compared to  while maintaining a high detection rate of 97.5% and obtaining the lowest false detection rate.
This paper introduces a novel Sampled and Skipping Search approach to discover and detect unknown podcast advertisements for large podcasting collections efficiently. Based on the knowledge of typical structures of podcast contents, the proposed approach employed candidate segmentation and sampling techniques to accelerate the search speed by reducing both search areas and the number of matching frames. As compared to existing state-of-the-art techniques, the proposed approach greatly improves search speed and saves a significant computation while maintaining sufficient detection accuracy. In this paper, we show the effect of the sampling rate and the buffer size to the trade-off between detection accuracy and search speed. We also present the typical audio podcast structures and point out that buffering a large amount of files of the collection is not necessary for improving the detection rate. Finally, detailed experimental analyses conducted on a variety of podcast contents collected from various websites are reported.
This work is supported by the Agency for Science, Technology, and Research (A*Star), Singapore, under the Grant no. 062 130 0058.
- Lienhart R, Kuhmuench C, Effelsberg W: On the detection and recognition of television commercials. Proceedings of the IEEE International Conference on Multimedia Computing and Systems (ICMCS '97), June 1997 509-516.View ArticleGoogle Scholar
- Ruinskiy D, Lavner Y: An effective algorithm for automatic detection and exact demarcation of breath sounds in speech and song signals. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(3):838-850. 10.1109/TASL.2006.889750View ArticleGoogle Scholar
- Kashino K, Kurozumi T, Murase H: A quick search method for audio and video signals based on histogram pruning. IEEE Transactions on Multimedia 2003, 5(3):348-357. 10.1109/TMM.2003.813281View ArticleGoogle Scholar
- Foote J, Cooper M: Audio retrieval by rhythmic similarity. Proceedings of the International Conference on Music Information Retrieval, 2002, Paris, FranceGoogle Scholar
- Herley C: Accurate repeat finding and object skipping using fingerprints. In Proceedings of the 13th annual ACM international Conference on Multimedia, 2005, Singapore. ACM; 656-665.View ArticleGoogle Scholar
- Yang X-F, Tian Q, Xue P: Efficient short video repeat identification with application to news video structure analysis. IEEE Transactions on Multimedia 2007, 9(3):600-609. 10.1109/TMM.2006.889352View ArticleGoogle Scholar
- Herley C: ARGOS: automatically extracting repeating objects from multimedia streams. IEEE Transactions on Multimedia 2006, 8(1):115-129. 10.1109/TMM.2005.861286View ArticleGoogle Scholar
- Ling-Yu D, Jinqiao W, Yantao Z, Jesse SJ, Hanqing L, Changsheng X: Segmentation, categorization, and identification of commercial clips from TV streams using multimodal analysis. In Proceedings of the 14th Annual ACM International Conference on Multimedia (MM '06), 2006, Santa Barbara, Calif, USA. ACM; 201-210.Google Scholar
- Li Y, Dorai C: Instructional video content analysis using audio information. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(6):2264-2274. 10.1109/TASL.2006.872602View ArticleGoogle Scholar
- Lie L, Hong-Jiang Z, Hao J: Content analysis for audio classification and segmentation. IEEE Transactions on Speech and Audio Processing 2002, 10(7):504-516. 10.1109/TSA.2002.804546View ArticleGoogle Scholar
- McAdams S: Perspectives on the contribution of timbre to musical structure. Computer Music Journal 1999, 23(3):85-102. 10.1162/014892699559797View ArticleGoogle Scholar
- Nguyen MN, Shi D, Quek C: FCMAC-BYY: fuzzy CMAC using Bayesian Ying-Yang learning. IEEE Transactions on Systems, Man, and Cybernetics B 2006, 36(5):1180-1190. 10.1109/TSMCB.2006.874691View ArticleGoogle Scholar
- Albus JS: A new approach to manipulator control: the cerebellar model articulation controller (CMAC). Transactions of the ASME, Dynamic Systems, Measurement and Control 1975, 97(3):220-227. 10.1115/1.3426922MATHView ArticleGoogle Scholar
- Nguyen MN, Shi D, Quek C: A nature inspired Ying-Yang approach for intelligent decision support in bank solvency analysis. Expert Systems with Applications 2008, 34(4):2576-2587. 10.1016/j.eswa.2007.04.020View ArticleGoogle Scholar
- Lee ES, Zhu Q: Fuzzy and Evidence Reasoning. Physica; 1995.MATHGoogle Scholar
- Shi D, Nguyen MN, Zhou S, Yin G: Fuzzy CMAC with incremental bayesian Ying-Yang learning and dynamic rule construction. IEEE Transactions on Systems, Man, and Cybernetics B 2009, 40: 548-552.Google Scholar
- Chang CC, Lin CJ: LIBSVM : a library for support vector machines. 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm/Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.