Skip to main content
  • Empirical Research
  • Open access
  • Published:

End-to-end training of acoustic scene classification using distributed sound-to-light conversion devices: verification through simulation experiments

Abstract

We propose a framework for classifying acoustic scenes utilizing distributed sound sensor devices capable of sound-to-light conversion, which we term as Blinkies. These Blinkies can convert acoustic signals into varying intensities of light via an inbuilt light-emitting diode. By using Blinkies, we can aggregate the spatial acoustic information across a wide region by recording the fluctuating light intensities of numerous Blinkies distributed throughout the region. Nonetheless, the signal communicated is subject to the bandwidth limitation imposed by the frame rate of the video camera, typically capped at 30 frames per second. Our objective is to refine the process of transforming sound into light for the purpose of acoustic scene classification within these bandwidth confines. While traversing through the air, a light signal is affected by inherent physical limitations such as the attenuation of light and interference from noise. To account for these factors, we have integrated these physical constraints into differentiable physical layers. This approach enables us to jointly train a pair of deep neural networks for the conversion of sound to light and for the classification of acoustic scenes. Our simulation studies, which employed the SINS database for acoustic scene classification, demonstrated that our proposed framework outperforms the previous one that utilized Blinkies. These findings emphasize the effectiveness of Blinkies in the field of acoustic scene classification.

1 Introduction

There has been a recent surge in the pursuit of acoustic scene analysis, with numerous seminars and contests taking place  [1, 2]. The purpose of acoustic scene analysis is to recognize actions, for instance, “cooking,” “cleaning with a vacuum,” or “viewing television,” or to comprehend the ongoing scenario, such as “traveling in a bus,” “being in a park,” or “socializing with others,” on the basis of auditory data [3]. Specifically, determining the best-fitting acoustic scene from a predefined set of such classes is called acoustic scene classification (ASC). To effectively analyze acoustic scenarios, both spatial and spectral information are critical. In more concrete terms, the differentiation between the actual sound and the reproduced sound, which is challenging with spectral information alone, can be realized by also utilizing spatial information.

The acquisition of spatial information requires simultaneously using multiple microphones, in other words, a distributed microphone array  [4, 5]. A fundamental strategy for integrating spatial information into acoustic scene analysis is through the localization of sound sources. Nevertheless, there are technical problems associated with real-time acoustic sensing using a distributed microphone array. To acquire spatial information from a distributed microphone array, it is necessary for the acoustic signals obtained from each microphone to be highly accurate in temporal synchronization. For the temporal synchronization of signals, each microphone is generally connected to a multichannel analog-to-digital converter via wiring, but this involves a significant cost in setting up the recording environment. For this reason, many studies have been conducted to build distributed microphone arrays using wireless microphones, i.e., wireless acoustic sensor networks. To synchronize the asynchronous multichannel signals obtained from these wireless microphones, research on blind synchronization of sound is ongoing [6]. However, in the wireless transmission of audio signals, there is not only the issue of synchronization but also the problem of channel capacity. For instance, if uncompressed audio signals of 384k bits/s are transmitted from each of eight wireless microphones, the bandwidth requirement would consistently be around 3M bits/s. As more microphones are needed to sense a wider space, the required bandwidth for communication will increase even further.

To address these challenges, we previously engineered a sound-to-light conversion device named a Blinky shown in Fig. 1  [7,8,9,10]. In an earlier framework using Blinkies, a Blinky recorded an acoustic signal through a microphone and evaluated its power in-device. In line with the acoustic power, the Blinky modulated the brightness of an integrated light-emitting diode (LED). Finally, a video camera was employed to concurrently record LED brightness from multiple Blinkies spread over a wide region. Aggregating the Blinky signals from the recorded video, the fusion center, which is a high-performance server, performs acoustic scene analysis by integrating and analyzing acoustic information. If a video camera is already installed for purposes such as surveillance, it becomes possible to easily start acoustic sensing by simply distributing Blinkies.

Fig. 1
figure 1

Picture of Blinky. Blinky transmits sound information as the intensity of an onboard LED

Despite the potential of power-based sound-to-light conversion in the realm of acoustic scene analysis, its optimality remains undetermined. Light signals emitted from Blinkies are affected by light attenuation and noise during the light-signal transmission through air. In addition, the captured signals will face strong bandwidth limitations due to the cameras’ limited frame rate, generally 30 frames per second (fps). Consequently, it becomes imperative to account for these physical constraints in the design of the sound-to-light conversion process. Furthermore, vital acoustic features vary depending on the scene and situation that we want to analyze. Thus, it is crucial to optimize the sound-to-light conversion process in a manner that enables the extraction of the most appropriate and useful features for the problem to be solved.

Because of such a situation, our objective is to learn the optimal sound-to-light conversion process in Blinkies as a substitute for transmitting sound power information. To realize this objective, in this study, we develop an end-to-end training framework of both edge devices (Blinkies) and a fusion center with sound-to-light conversion for ASC. The proposed framework enables us to train two types of deep neural network (DNN) with an end-to-end approach: an encoder in each edge device that transforms a sound signal measured via a microphone into a signal to be transmitted by an LED and a classifier in the fusion center that estimates the acoustic scene using captured LED intensities. In the proposed framework, the light-signal propagation in air and camera responses are modeled as differentiable physical layers. These physical layers allow us to obtain appropriate signal transformations in Blinkies by using a data-driven approach while considering the physical constraints and the accuracy of ASC.

We conducted a simulation experiment of acoustic scene classification using the SINS database [11] to evaluate the efficiency of the proposed framework. The experimental findings illustrate that the proposed framework allows us to achieve a superior classification accuracy compared with a previous framework with Blinkies.

This paper is partially based on a conference paper [12] in which we proposed the end-to-end training framework for acoustic scene classification with Blinkies. The contribution of this paper is that we provide results of simulations using a larger-scale dataset than that used in our previous work. We also provide results of analysis from new viewpoints to confirm the properties of the framework.

The rest of this paper is organized as follows. In Sect. 2, we review related work on acoustic scene classification. In Sect. 3, the sound-to-light conversion device Blinky and acoustic sensing using Blinkies are summarized. We propose a novel ASC framework with Blinkies in Sect. 4. Simulation experiments and their results are analyzed in Sect. 5. In Sect. 6, we discuss our findings and the limitations of our framework. In Sect. 7, we conclude this paper.

2 Related work

For ASC using acoustic information over a wide area, we can consider using distributed edge devices having microphones for sensing and extracting helpful features for ASC. Then, we can use a fusion center to integrate the features and perform ASC. In this section, we briefly summarize related work on ASC. After that, we review ASC with a distributed microphone array.

2.1 Acoustic scene classification

ASC is a task that involves the determination of the best-fitting acoustic scene from a predefined set of classes based on an input sound. An acoustic scene represents the location, situation, and surrounding human activities where the sound was recorded.

Since the inaugural Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge in 2013, ASC has been a recurring task in these annual competitions, highlighting its significance as a fundamental task in environmental sound analysis. Historically, classifiers such as Gaussian mixture models, hidden Markov models, and support vector machines were employed for ASC. Regarding acoustic features, time-frequency representations such as spectrograms, mel-band spectrograms, and mel-frequency cepstral coefficients (MFCCs) are commonly used. In recent times, deep neural networks (DNNs) have become the mainstream classifier. Because of this, there are also methods that directly use time signals (as explored in [13]) and techniques that employ convolutional neural networks (CNNs) for feature extraction (as discussed in [14]).

Toward the implementation of ASC in portable and compact devices, low-complexity ASC systems have recently been studied. The DCASE 2023 Challenge, Task 1, requires participants to achieve high ASC accuracy under the constraints of a parameter size of 128 KB and computational complexity not exceeding 30 Million Multiply-Accumulates (MMACs). This task is designed with specific hardware in mind, namely, Cortex-M4 devices. The top three teams of the challenge [15,16,17] reduced the computational complexity by employing DNNs with small numbers of parameters, reducing parameter bit depth to an 8-bit integer with quantization-aware training [18], and knowledge distillation [19].

Furthermore, research efforts have been directed towards utilizing microphone arrays to capture spatial information of sounds, aiming to achieve a more accurate ASC [20,21,22]. However, these methods face challenges in environments where the desired scene classification spans a wide area and the microphone array is far from the sound source. To address these challenges, some researchers focus on using a distributed microphone array for ASC.

2.2 ASC with distributed microphone array

Spatial information of sound that can be extracted by a distributed microphone array is useful for DCASE issues. This is because acoustic scenes can be characterized by a combination of sound events, and certain sound events are more likely to occur in specific locations. Illustrative examples of these sound events include domestic sounds such as doorbells, kitchen appliances, and television audio. For this reason, various methods using distributed microphone arrays have been proposed  [4, 5, 23,24,25,26,27,28]

Assuming that all microphones are synchronized at least at a short-time frame level, Phan et al. [28] and Kuerby et al. [5] showed that a late fusion technique for acoustic event detection, which identifies acoustic events individually for each microphone and then amalgamates these findings on a frame-by-frame basis, outperforms methods using single-channel signals in terms of accuracy.

Recent investigations by Kawamura et al. [27] reinforce the efficacy of spatial information in ASC. Assuming that multiple subarrays, each containing multiple microphones, are distributed and synchronized, they showed that using generalized cross-correlation phase transform (GCC-PHAT) obtained from channels within the same subarray or between different subarrays, together with the spectral features such as the log-mel spectrogram, successfully enhanced the ASC performance.

In real-world scenarios, the synchronization of distributed microphone arrays is difficult to achieve owing to sampling rate or recording start time mismatches. Consequently, the spatial cepstrum [26] has been suggested as a spatial feature that is robust to the lack of synchronization among edge devices. Nevertheless, there exist hurdles in the transmission of features extracted at each microphone, particularly given the constraint of limited wireless bandwidth.

3 Sound-to-light conversion device Blinky

The use of Blinkies enables us to avoid complicated processing, such as synchronization, in signal acquisition using a distributed microphone array. In this section, we summarize an acoustic sensing procedure with Blinkies and the aim of this study.

3.1 Acoustic sensing with Blinkies

In this paper, we assume that M Blinkies placed at fixed locations record acoustic signals and a video camera located at a fixed location captures their LED intensities. Here, we assume that their spatial positions are given. Acoustic sensing with Blinkies consists of three parts: sound-to-light conversion in each Blinky, signal transmission by light, and capturing the LED light of Blinkies by the video camera (see Fig. 2).

Fig. 2
figure 2

Process of acoustic sensing with Blinkies

Fig. 3
figure 3

Proposed end-to-end acoustic scene analysis framework

3.1.1 Sound-to-light conversion

Let n, \(F_{s, n}\), x[n], and B be the discrete-time index, the sampling frequency corresponding to n, the microphone signal, and the audio buffer size, respectively. From the signal x[n], the sound power measurement u[n] is computed as

$$\begin{aligned} u[n] = \left\{ \begin{array}{ll} \frac{1}{B} \sum _{i=1}^{B} x[n-B+i]^2 & n \bmod B = 0\\ u[n-1] & \textrm{otherwise}\\ \end{array}\right. . \end{aligned}$$
(1)

To efficiently encode sound power measurements u[n] as LED intensities, we map u[n] using a nonlinear function \(\varphi (\cdot )\). The function \(\varphi (\cdot )\) was designed such that it maximizes the entropy of \(\varphi (u[n])\) to distribute LED intensities uniformly, to avoid unnecessarily allocating a broad range of the LED intensities, to rare, extremely loud sounds, and to prevent errors caused by ambient light and quantization  [10]. Then, the actual emitted light intensity I(t) at the continuous time t is given by

$$\begin{aligned} I(t) = \varphi (u[\lfloor t F_{s, n} \rfloor ]), \end{aligned}$$
(2)

where \(\lfloor \cdot \rfloor\) indicates the floor function.

3.1.2 Signal transmission by light

After the sound-to-light conversion, LED light from Blinkies propagates in air, and a video camera captures it. The LED light intensity at the camera is affected by attenuation a depending on the angle and distance between each LED and the video camera. In addition to this attenuation, ambient light is added to the light intensity as a positive bias b. For these reasons, the radiant power density at an imaging sensor on the camera, i.e., the irradiance E(t), is calculated using the attenuation a, bias b, and noise \(\epsilon\) as

$$\begin{aligned} E(t) = a I(t) + b + \epsilon . \end{aligned}$$
(3)

3.1.3 Capturing light of Blinkies by video camera

An imaging sensor on a camera captures the irradiance E. The camera then encodes it as a video file. The irradiance E is integrated over time, which depends on the frame rate \(F_{s, m}\) of the camera. This process can be written as

$$\begin{aligned} X[m] = \int _{(m-1) / F_{s, m}}^{m / F_{s, m}} E(t) dt, \end{aligned}$$
(4)

where m is the discrete-time index for video frames and X[m] is the energy density. Finally, the captured pixel value p is given by

$$\begin{aligned} p[m] = f(X[m]), \end{aligned}$$
(5)

where f is a function combining the sensor saturation \(s(\cdot ) = \max (0, \min (1, \cdot ))\) and the camera response function (CRF). The CRF represents the processing in each camera that makes the final image appear better. One of the typical CRFs is the gamma correction. It converts the sensor output \(v[m] = s(X[m])\) so that \(p[m] = (v[m])^{1/\gamma }\) with \(\gamma = 2.2\). Because industrial cameras usually provide raw video frames that directly store the sensor output v[m], the nonlinear transform by the CRF can be avoided and we can assume \(p[m] = v[m]\).

3.2 Scenario

Because of the propagation in Eq. (3) and the camera response in Eqs. (4) and (5), the captured pixel value p differs from the actual sound power u measured by Blinkies. Furthermore, important acoustic features for acoustic scene analysis vary in accordance with the scene labels we want to attach to sounds or the ambient sound type and volume. Therefore, the sound-to-light conversion based on the sound power in the previous framework with Blinkies might not be optimal for transmitting sound information by light or for acoustic scene analysis.

To overcome these issues with a data-driven approach, we propose an end-to-end training framework for acoustic scene analysis with Blinkies in the next section.

4 Proposed framework

Figure 3 shows the proposed end-to-end acoustic scene analysis framework. In the proposed framework, we have two DNNs: an encoding network that converts recorded signals into signals that can be effectively transmitted and are appropriate for scene analysis, and a scene analysis network that performs scene analysis. To train these DNNs in an end-to-end manner, we model the light propagation between Blinkies and a camera, and camera responses as differentiable physical layers.

4.1 Differentiable physical layers

Differentiable physical layers are differentiable models of physical phenomena that can be incorporated into DNNs. They enable DNNs to consider physical phenomena.

4.1.1 Light propagation layer

A light propagation layer is a model of the signal transmission between a Blinky and a camera (see Sect. 3.1.2). Since Eq. (3) is differentiable, we calculate the following equation in this layer:

$$\begin{aligned} y[n] = a x[n] + b + \epsilon , \end{aligned}$$
(6)

where x[n] and y[n] are 1D signals input to this layer and output from this layer, respectively. We assume that attenuation a is inversely proportional to the square of the distance between a Blinky and a camera, and \(\epsilon\) follows a normal distribution. b can be calculated from the pixel value p when the corresponding LED is not lit.

4.1.2 Camera response layer

A camera response layer is a model of the integration in Eq. (4) on a camera sensor (see Sect. 3.1.3). This integration can be interpreted as a sampling operation with low-pass filtering. For this reason, the camera response layer resamples an input signal x[n] at the camera frame rate \(F_{s, m}\) using

$$\begin{aligned} y[m] = \textrm{resample}(x[n]), \end{aligned}$$
(7)

where \(\mathrm {resample(\cdot )}\) indicates the resample operation. Since most cameras have a frame rate of \(30~\textrm{fps}\), we set \(F_{s, m}\) to \(30~\textrm{Hz}\) in this work. Note that the nonlinear transform by a CRF can be avoided by using raw video frames. Hence, we do not consider CRFs in the camera response layer.

4.2 Network architecture

As shown in Fig. 3, there are two subnetworks in the proposed framework: an encoding network that transforms a sound signal into a signal transmitted by LED light and a scene analysis network that performs acoustic scene analysis. For the training of these networks, the light propagation and camera response layers are located between these two networks.

For the encoding network, we employed a 1D convolutional neural network (CNN)  [29], where we did not consider the hardware limitations of Blinkies in this paper. A 1D CNN is also adopted in Wave-U-Net, which transforms acoustic signals into other signals, and its effectiveness has been confirmed  [30]. In the encoding network, we downsampled microphone signals using six 1D strided convolution layers with a kernel size of 3, a stride of 2, and a padding of 1. In addition, two 1D convolution layers with a kernel size of 3 and a padding of 1 were inserted before each strided convolutional layer.

We adopted a simple VGG-like architecture with 1D convolution layers for the scene analysis network [31]. Similarly to the encoding network, downsampling layers in this network are replaced with 1D strided convolution layers with a kernel size of 3, a stride of 2, and a padding of 1. The depth of the network is 4, and the resulting feature map is transformed by a global average pooling layer into a 1D vector. The vector is fed into a linear layer to obtain the final scene analysis results.

Considering that the signals captured by a camera have only a limited number of samples in the time direction (e.g., 300 points for a 10-s audio clip in our experiments), we thought that the CNN architecture would be sufficient for the task, although transformer-based networks and recurrent neural networks are effective to handle long time-series signals.

5 Simulations

We evaluated the effectiveness of the proposed framework by acoustic scene classification experiments using the SINS database  [11]. Ideally, a dataset for acoustic scene classification using Blinkies should be constructed. However, for comparative evaluation, it is necessary to collect a large amount of sensor data from both Blinkies and distributed microphone arrays using the same sound source signals, which requires considerable effort. For this reason, in this study, publicly available datasets were used to conduct comparative evaluation experiments simulating acoustic sensing with Blinkies.

5.1 Simulation conditions

The SINS database contains a continuous recording of one person living in a vacation home for one week. Figure 4 shows the room layout of the vacation home and the arrangement of the 13 sensor nodes used to record acoustic signals. Each sensor node had a 4-channel linear microphone array with an inter-microphone distance of 5 cm, where the microphones in a single sensor node were synchronized. The sampling for each audio channel was conducted sequentially at a rate of 16 kHz with a bit depth of 12. A recorded signal was stored with an internal counter value of an analog-to-digital converter. The value was reset every second by a GPS/clock module. In total, 16 different human activities (i.e., acoustic scenes) were annotated in five different rooms.

Fig. 4
figure 4

Arrangement of microphone arrays. Blue circles indicate the nodes used in our simulation. We added blue circles and the camera to the original figure in  [11]

In this simulation, we regard each sensor node in the SINS database as a Blinky and assume that light signals from Blinkies are captured by an omnidirectional (360-degree) camera located and fixed at the center of the living room, as shown in Fig. 4. Under this assumption, we focus on the living room and use the observations from nodes 1 to 8, except for node 5, because the camera might not be able to capture sensor nodes in the other rooms. The reason why we exclude node 5 is that the node was crushed when the SINS database was constructed. By using the observations from the seven nodes, we classified 10 acoustic scenes (“absence,” “call,” “cooking,” “dishwashing,” “eating,” “others,” “vacuum cleaning,” “visit,” “watching TV,” and “working”). We synchronized the acoustic signals obtained from every sensor nodes and segmented the signals into sound clips, by using a program in the official GitHub repositoryFootnote 1. The length and sampling frequency of the sound clips were unified to \(10~\textrm{s}\) and \(16~\textrm{kHz}\), respectively. Each sound clip was labeled with one acoustic scene. The number of audio clips utilized for our experiments was 51,831, amounting to approximately 143.9 h of signals.

The dataset was divided into three subsets: training, validation, and testing. Since the SINS database comprises continuous recordings, it is inappropriate to randomly assign 10-s sound clips, segmented from these recordings, to the three subsets, as very similar clips could end up in different subsets. To prevent such inappropriate data partitioning, we divided the data based on a session. A session is a continuous acoustic scene (a time interval with the same label) in a recording. We randomly assigned each session to one of the training, validation, and test subsets with probabilities of 0.6, 0.2, and 0.2, respectively. After this assignment, sessions were segmented into 10-s sound clips.

We prepared seven encoding networks and fed clips recorded by nodes 1, 2, 3, 4, 6, 7, and 8 into the networks. Signals transformed by the networks and propagated through the differentiable physical layers were concatenated and fed into the scene analysis network, where we assumed the distances between the camera and nodes 1, 2, 3, 4, 6, 7, and 8 to be 1.25, 1.13, 1, 1.62, 0.66, 0.66, and 1.16, respectively, using the distance between the camera and node 3 as a reference. These networks were trained with 200 epochs using the training subset and the well-known cross-entropy loss. Here, the Adam optimizer [32] was utilized for optimization, where the parameters in Adam were set as \(\alpha =0.0005, \beta _1=0.9\), and \(\beta _2=0.999\). The learning rate \(\alpha\) was multiplied by 1/10 when the number of epochs reached 100 and 150. The method by He et al. [33] was used for initializing the network. The performance of ASC was evaluated in terms of the micro- and macro-averages of accuracy. The micro-average of accuracy \(A_{\textrm{micro}}\) is defined as

$$\begin{aligned} A_{\textrm{micro}} = \frac{\sum _{{i=1}}^{N_{\textrm{class}}} C_{i}}{\sum _{{i=1}}^{N_{\textrm{class}}} T_{i}}, \end{aligned}$$
(8)

where \(C_i\) and \(T_i\) indicate the number of correct predictions and the total number of samples for the ith class, respectively. Moreover, the macro-average of accuracy \(A_{\textrm{macro}}\) is obtained by calculating

$$\begin{aligned} A_{\textrm{macro}} = \frac{1}{N_{\textrm{class}}} \sum \limits _{{i=1}}^{N_{\textrm{class}}} \frac{C_{i}}{T_{i}}. \end{aligned}$$
(9)

The validation subset was used to check for the overfitting of networks and we selected model parameters that provided the highest accuracy for the validation subset during training as the training result. We trained our framework 10 times and evaluated the mean and the standard deviation of accuracy.

5.2 Results

5.2.1 Comparison with conventional frameworks

We compared the following five frameworks:

  1. (a)

    The CNN of Inoue et al. [34] (hereafter Inoue’s CNN) with log-mel spectrogram calculation and lossless transmission (Raw signal/Log-mel energy + Inoue’s CNN 1),

  2. (b)

    VGG 2D with log-mel spectrogram calculation and lossless transmission (Raw signal/Log-mel energy + VGG 2D in Table 1),

  3. (c)

    VGG 1D without any preprocessing and lossless transmission (Raw signal/VGG 1D in Table 1),

  4. (d)

    Blinky’s power calculation in 3.1.1 + physical layers + VGG 1D (Power/VGG 1D in Table 1),

  5. (e)

    The proposed end-to-end framework, i.e., the CNN-based encoding network + physical layers + VGG 1D (CNN/VGG 1D in Table 1).

Note that frameworks (a), (b), and (c) require a wide bandwidth for transmitting raw signals, but frameworks (d) and (e) do not.

Table 1 shows the classification accuracy for the test subset of the five frameworks. From the table, we can confirm that the proposed framework can achieve a higher accuracy than a non-end-to-end framework considering the same physical phenomena (i.e., Power/VGG 1D). In addition, the accuracy of the proposed method was comparable to that of Raw/VGG 1D and Raw/Log-mel energy + VGG 2D, whereas the use of the typical DNN-based approach for acoustic scene classification, i.e., Raw/Log-mel energy + Inoue’s CNN, provided the highest accuracy among the five frameworks. Note that all of Raw/VGG 1D, Raw/Log-mel energy + VGG 2D, and Raw/Log-mel energy + Inoue’s CNN assume an unrealistic situation, where distributed microphone arrays are synchronized and high-capacity lossless transmission is possible. Hence, this result suggests the suitability of the proposed end-to-end framework with Blinkies for acoustic scene analysis in practical situations.

Table 1 Transmission bandwidth and micro-/macro-averages of accuracy for each framework. Raw/Log-mel energy + Inoue’s CNN, Raw/Log-mel energy + VGG 2D, and Raw/VGG 1D require a wide bandwidth for transmitting raw signals, but conventional and proposed frameworks do not

More detailed classification results are shown in Fig. 5 as confusion matrices, where each element represents the number of sound clips whose true label is shown on the vertical axis and the predicted label is shown on the horizontal axis. By comparing Fig. 5c and d, we can see that the number of misclassifications increased using Power/VGG 1D due to the signal propagation process described in Sect. 3.1, i.e., the sound-to-light conversion, the light-signal transmission, and the capture of the light-signal by the camera. The proposed end-to-end framework can prevent this performance degradation, as shown in Fig. 5e. This figure illustrates that all methods, particularly the proposed framework, frequently misclassified sound clips from the working class as those of the absence class. The data in the working class primarily consists of sounds such as typing on a keyboard or clicking a mouse, lacks other distinctive sounds, and contains many silent intervals. On the other hand, the data in the absence class is mainly composed of silent intervals during which the person is away from the vacation home. This is believed to be the reason for the frequent misclassification of sound clips from the working class as those of the absence class.

Fig. 5
figure 5

Confusion matrices for acoustic scene classification. Class labels are (1) absence, (2) calling, (3) cooking, (4) dishwashing, (5) eating, (6) other, (7) vacuum cleaner, (8) visit, (9) watching TV, and (10) working

Figure 6 shows examples of feature maps, i.e., outputs from camera response layers, obtained by Power/VGG 1D and CNN/VGG 1D, where Fig. 6a shows feature maps for a sound clip labeled “vacuum cleaner” and Fig. 6b shows those for a sound clip labeled “watching TV.” As shown in this figure, the feature maps obtained by the proposed framework were different from the sound power obtained by the conventional framework. In the case of a vacuum cleaner, its sound was heard continuously from approximately 3 s. The signal at each node in the conventional framework took a value greater than ever before from that point in time. The proposed framework showed similar trends for nodes 1, 2, 7, and 8; however, for the other nodes, no significant difference from the previous signal values was observed. In the case of watching TV, there were intermittent footsteps and artificial sound effects from the television occurring approximately from 4 to 7 s. Within the proposed framework, node 4 appears to be emitting a strong signal in response to footsteps, whereas node 7 seems to be responding to the sound effects. For these reasons, it is considered that the proposed framework trained the encoder in such a way that separate nodes respond to distinctive sounds, which are meaningful for acoustic scene classification.

Fig. 6
figure 6

Examples of feature maps (outputs from camera response layer) for \(10~\textrm{s}\) sound clips. The top of each subfigure shows the feature map of Power/VGG 1D, and the bottom of each subfigure shows the feature map of CNN/VGG 1D. The horizontal axis shows the time t with a video frame rate of \(30~\textrm{Hz}\)

5.2.2 Robustness against occlusion

When sensing is performed using a camera, signal loss due to other subjects appearing between the camera and a Blinky, that is, occlusion, becomes a problem. We conducted an experiment to evaluate whether the proposed framework is effective even in situations where occlusion occurs. In the experiment, after sound-to-light conversion by the encoding network, occlusion was independently generated for each node with a 20% probability. The occlusion duration was set from 0.5 to 5 s within a 10-s sound clip, determined by a uniformly distributed random number. The signal value in the interval where occlusion occurred was set to a constant value, which was randomly determined according to a normal distribution with a mean of 0.5 and a standard deviation of 0.25. Other experimental conditions were the same as those in the previous section. We also considered the occlusion during both training and testing DNNs. Training DNNs with occlusion was conducted by applying the simulated occlusion process to output signals of the camera response layer. In this training, other training conditions were the same as those without considering occlusion (see Sect. 5.1).

Under the occurrence of occlusion, we evaluated the acoustic scene classification accuracy of the conventional framework using Blinky, Power/VGG 1D, and the proposed framework, CNN/VGG 1D, as shown in Table 2. Here, “occlusion during training” indicates whether occlusion was considered during the training of the DNNs. From Table 2, it can be seen that the proposed framework can perform acoustic scene classification with higher accuracy than the conventional framework, regardless of whether occlusion was considered during training. By comparing Tables 1 and 2, we find that the conventional framework trained without considering occlusion shows an accuracy decrease of about 7 points due to occlusion at the time of evaluation, which is smaller than the approximately 11 points decrease of the proposed framework. The reason for the smaller decrease in accuracy due to occlusion in Power/VGG 1D is that Power/VGG 1D performs the same sound-to-light conversion at each node, resulting in similar information being transmitted from multiple nodes, as shown in Fig. 6. However, the proposed framework when trained without considering occlusion shows a larger decrease in accuracy owing to occlusion at the time of evaluation, but this decrease in accuracy can be mitigated by training considering occlusion.

Table 2 Acoustic scene classification accuracy under occlusion conditions for conventional and proposed frameworks. Accuracy degradation owing to occlusion can be mitigated by training with occlusion

5.2.3 Classification accuracy vs number of Blinkies

Finally, we investigated the impact of the number of Blinkies on scene classification accuracy. In this experiment, we varied the number of Blinkies to 1, 3, 5, and 7, and conducted training and evaluation under each condition. To capture the information of sounds occurring in various locations, we selected Blinkies that were placed far from each other. Specifically, the following conditions were considered:

  • Node 3

  • Nodes 1, 3, and 8

  • Nodes 1, 3, 4, 6, and 8

  • Nodes 1, 2, 3, 4, 6, 7, and 8

Other experimental conditions were the same as those in Sect. 5.2.1.

The results of the evaluation are shown in Fig. 7. From the figure, it can be seen that in the conventional framework, the classification accuracy increases as the number of Blinkies used increases. It is also observed that the margin of accuracy improvement becomes gradual when more than three Blinkies are used.

It is also apparent that using more than three Blinkies is effective in the proposed framework. However, with more than three Blinkies, the accuracy is almost equivalent to the proposed framework with three Blinkies, and it was confirmed that using five Blinkies slightly improves the accuracy compared with using all seven Blinkies. The discrepancy in nodes between the five- and seven-Blinky configurations pertained to nodes 2 and 7. Since multiple Blinkies, nodes 1, 3, 6, and 8, were located around nodes 2 and 7, it is plausible that they were capable of capturing sounds occurring in the center of the living room or around the desk. As the encoding network successfully encoded this information, the incorporation of nodes 2 and 7 may not have contributed to providing novel information. Consequently, it can be said that the proposed framework not only performs acoustic scene classification with higher accuracy than the conventional framework but also effectively conducts it with fewer Blinkies. However, using a large number of Blinkies seems useful for ensuring robustness against occlusions, as shown in Sect. 5.2.2.

Fig. 7
figure 7

Effect of number of Blinkies on acoustic scene classification accuracy

6 Discussion

From the simulation experiments in Sect. 5, our objective in this study, that is, learning the optimal sound-to-light conversion process in Blinkies, which provides appropriate and useful features for ASC even when there are physical constraints of light signal attenuation, noise, occlusion, and strong bandwidth limitations, is achieved by using the proposed framework.

However, to implement the encoder in a Blinky and utilize the proposed ASC framework in a real environment, there are still challenges that need to be addressed.

6.1 Computational resource constraints

Blinky is a low-cost, small device with limited computational resources. In particular, its memory consists of 520 kB SRAM and 4 MB of flash memory  [10]. In this study, we focused on the question of whether an optimal audio-visual conversion can be achieved by training the encoder in an end-to-end manner. The computational resources required for implementing the encoder were not particularly considered. Therefore, to implement the encoder in a Blinky, it is necessary to use a low-complexity encoder with a small number of parameters. For its development, techniques such as quantization-aware training and knowledge distillation mentioned in Sect. 2.1 are considered effective.

6.2 Physical model identification

The propagation process of a light signal between a Blinky and a camera changes depending on the environment in which it is installed. That is, the hyperparameters such as a and b in Sect. 4.1.1 should be set according to the environment in which they are placed. For the identification of these hyperparameters, the calibration techniques discussed in [35] can be used. Alternatively, treating the physical model as a black box and using black-box optimization techniques for end-to-end learning is also a possible direction.

6.3 Online learning of the encoder in real environments

To implement environment-specific encoders in Blinkies, besides the method of identifying the physical model and training the encoder in simulation before implementing it in a Blinky, another approach is to install an untrained encoder in the Blinky and train it online. For training the encoder, the gradient of the loss with respect to the encoder’s parameters is required. To realize online learning, it is necessary to backpropagate the loss calculated on the fusion center to Blinkies. If this backpropagation can be achieved, acoustic sensing using Blinkies will have greater convenience.

To analyze these practical challenges on the framework using Blinkies towards real-world implementation, it is essential to have a dataset including audio signals captured with actual Blinkies and corresponding light signals from Blinkies. Our next step towards implementation in real-world environments would be to extract acoustic signals recorded by Blinkies, store them on an external computer, and then use them to train an encoder network that can operate within the constraints of Blinky’s computational resources. This will help clarify its performance in situations closer to real-world conditions and allow the trained encoding network to be implemented on a Blinky. Through this process, we can identify the utility and challenges, enabling a more concrete evaluation of the necessity and feasibility for the physical model identification and the online learning.

Additionally, when using Blinkies in real-world environments, the emitted light might be bothersome. This issue can be resolved by using infrared LEDs or other invisible light sources for communication since our proposed framework uses only the intensity information of LEDs.

7 Conclusion

In this paper, we proposed an end-to-end acoustic scene analysis framework considering the physical signal propagation process between Blinkies and a camera. In the proposed framework, the use of differentiable physical layers that model physical phenomena as differentiable equations enables us to consider physical constraints in DNNs. As a result, we can train DNNs by an end-to-end approach and obtain appropriate signal transformations in a data-driven manner. Experimental results showed that the proposed framework provided a higher accuracy than the previous framework with Blinkies for the DCASE 2018 Challenge Task 5 development dataset. The accuracy of the proposed method was comparable to that of a framework that does not consider physical constraints.

This result indicates that the end-to-end training enables a more effective encoding in sound-to-light conversion and estimation of the acoustic scene, even with the limitation of camera frame bandwidth. In our future work, we will consider more practical conditions, e.g., the occlusion of Blinky signals and the hardware limitation of Blinkies. We will also collect data using Blinkies in real environments and conduct experiments for acoustic scene analysis.

Availability of data and materials

The SINS database used in this study is freely available in https://github.com/KULeuvenADVISE/SINS_database, under the terms of the MIT license.

Notes

  1. https://github.com/KULeuvenADVISE/SINS_database

Abbreviations

ASC:

Acoustic scene classification

CNN:

Convolutional neural network

CRF:

Camera response function

DCASE:

Detection and classification of acoustic scenes and events

DNN:

Deep neural network

fps:

Frames per second

LED:

Light-emitting diode

MFCC:

Mel-frequency cepstral coefficient

References

  1. A. Temko, R. Malkin, C. Zieger, D. Macho, C. Nadeu, M. Omologo, in Multimodal Technologies for Perception of Humans (Springer Berlin Heidelberg, Berlin, Heidelberg, 2007), pp. 311–322. https://doi.org/10.1007/978-3-540-69568-4_29

  2. D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, M.D. Plumbley, in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Detection and classification of acoustic scenes and events: an IEEE AASP challenge (New Paltz, 2013), https://doi.org/10.1109/WASPAA.2013.6701819

  3. D. Barchiesi, D. Giannoulis, D. Stowell, M.D. Plumbley, Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Proc Mag. 32(3), 16–34 (2015). https://doi.org/10.1109/MSP.2014.2326181

    Article  Google Scholar 

  4. P. Giannoulis, A. Brutti, M. Matassoni, A. Abad, A. Katsamanis, M. Matos, G. Potamianos, P. Maragos, in Proceedings of European Signal Processing Conference, Multi-room speech activity detection using a distributed microphone network in domestic environments (Nice, 2015), https://doi.org/10.1109/EUSIPCO.2015.7362588

  5. J. Kürby, R. Grzeszick, A. Plinge, G.A. Fink, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop, Bag-of-features acoustic event detection for sensor networks (Budapest, 2016)

  6. D. Cherkassky, S. Gannot, Blind synchronization in wireless acoustic sensor networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(3), 651–661 (2017). https://doi.org/10.1109/TASLP.2017.2655259

    Article  Google Scholar 

  7. R. Scheibler, D. Horiike, N. Ono, in Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Blinkies: sound-to-light conversion sensors and their application to speech enhancement and sound source localization (Honolulu, 2018), https://doi.org/10.23919/APSIPA.2018.8659793

  8. R. Scheibler, N. Ono, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Multi-modal blind source separation with microphones and blinkies (Brighton, 2019), https://doi.org/10.1109/ICASSP.2019.8682594

  9. D. Horiike, R. Scheibler, Y. Wakabayashi, N. Ono, in Proceedings of IEEE 21st International Workshop on Multimedia Signal Processing, Blink-former: light-aided beamforming for multiple targets enhancement (Kuala, 2019), https://doi.org/10.1109/MMSP.2019.8901799

  10. R. Scheibler, N. Ono, Blinkies: open source sound-to-light conversion sensors for large-scale acoustic sensing and applications. IEEE Access. 67603–67616 (2020). https://doi.org/10.1109/ACCESS.2020.2985281

  11. G. Dekkers, S. Lauwereins, B. Thoen, M.W. Adhana, H. Brouckxon, T. van Waterschoot, B. Vanrumste, M. Verhelst, P. Karsmakers, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop, The SINS database for detection of daily activities in a home environment using an acoustic sensor network (Munich, 2017)

  12. Y. Kinoshita, N. Ono, in Proceedings of European Signal Processing Conference, End-to-end training for acoustic scene analysis with distributed sound-to-light conversion devices (Online, 2021)

  13. S. Mishima, Y. Wakabayashi, T. Fukumori, M. Nakayama, T. Nishiura, Investigations on raw features in deep neural network for indoor-environmental sound classification. INTER-NOISE and NOISE-CON Congress and Conference Proceedings 255(4), 3250–3257 (2017)

    Google Scholar 

  14. Y. Tokozume, T. Harada, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Learning environmental sounds with end-to-end convolutional neural network (New Orleans, 2017). https://doi.org/10.1109/ICASSP.2017.7952651

  15. Y. Cai, M. Lin, C. Zhu, S. Li, X. Shao, Device simulation and time-frequency separable convolution for acoustic scene classification. Tech. rep., DCASE2023 Challenge (2023)

  16. F. Schmid, T. Morocutti, S. Masoudian, K. Koutini, G. Widmer, Efficient acoustic scene classification with cp-mobile. Tech. rep., DCASE2023 Challenge (2023)

  17. J. Tan, Y. Li, Low-complexity acoustic scene classification using blueprint separable convolution and knowledge distillation. Tech. rep., DCASE2023 Challenge (2023)

  18. B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D. Kalenichenko, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Quantization and training of neural networks for efficient integer-arithmetic-only inference (2018), pp. 2704–2713. https://doi.org/10.1109/CVPR.2018.00286

  19. G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network (2015), arXiv:1503.02531

  20. M.C. Green, D. Murphy, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop, Acoustic scene classification using spatial features (Munich, 2017)

  21. S.K. Zieliński, H. Lee, in Proceedings of 2018 Federated Conference on Computer Science and Information Systems, Feature extraction of binaural recordings for acoustic scene classification (Poznań, 2018)

  22. B. Ding, T. Zhang, G. Liu, L. Kong, Y. Geng, Late fusion for acoustic scene classification using swarm intelligence. Appl Acoust 192, 108698 (2022). https://doi.org/10.1016/j.apacoust.2022.108698

    Article  Google Scholar 

  23. Y. Kaneko, T. Yamada, S. Makino, Monitoring of domestic activities using multiple beamformers and attention mechanism. J. Signal Process. 25(6), 239–243 (2021). https://doi.org/10.2299/jsp.25.239

    Article  Google Scholar 

  24. K. Imoto, N. Ono, in Proceedings of 25th European Signal Processing Conference, Acoustic scene classification based on generative model of acoustic spatial words for distributed microphone array (Kos island, 2017), https://doi.org/10.23919/EUSIPCO.2017.8081616

  25. K. Imoto, in Proceedings of European Signal Processing Conference, Acoustic scene classification using multichannel observation with partially missing channels (Online, 2021)

  26. K. Imoto, N. Ono, Spatial cepstrum as a spatial feature using a distributed microphone array for acoustic scene analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1335–1343 (2017). https://doi.org/10.1109/TASLP.2017.2690559

    Article  Google Scholar 

  27. T. Kawamura, Y. Kinoshita, N. Ono, R. Scheibler, in Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Effectiveness of inter- and intra-subarray spatial features for acoustic scene classification (Rhodes Island, 2023), https://doi.org/10.1109/ICASSP49357.2023.10096935

  28. H. Phan, M. Maass, L. Hertel, R. Mazur, A. Mertins, in Proceedings of 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, A multi-channel fusion framework for audio event detection (New Paltz, 2015), https://doi.org/10.1109/WASPAA.2015.7336889

  29. O. Ronneberger, P.Fischer, T. Brox, in Medical Image Computing and Computer-Assisted Intervention, U-net: convolutional networks for biomedical image segmentation, LNCS, vol. 9351 (Springer, 2015), pp. 234–241

  30. D. Stoller, S. Ewert, S. Dixon, in Proceedings of International Society for Music Information Retrieval Conference, Wave-U-net: a multi-scale neural network for end-to-end audio source separation (2018), arXiv:1806.03185

  31. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2014), arXiv:1409.1556

  32. D.P. Kingma, J. Ba, Adam: a method for stochastic optimization (2014). arXiv:1412.6980

  33. K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of IEEE International Conference on Computer Vision, Delving deep into rectifiers: surpassing human-level performance on ImageNet classification (Santiago, 2015), https://doi.org/10.1109/ICCV.2015.123

  34. T. Inoue, P. Vinayavekhin, S. Wang, D. Wood, N. Greco, R. Tachibana, Domestic activities classification based on CNN using shuffling and mixing data augmentation. Tech. rep., DCASE2018 Challenge (2018)

  35. K. Nishida, N. Ueno, Y. Kinoshita, N. Ono, in Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Estimation of transfer coefficients and signals of sound-to-light conversion device blinky under saturation (Chiang Mai, 2022), https://doi.org/10.23919/APSIPAASC55919.2022.9980090

Download references

Acknowledgements

This work was supported by JST SICORP Grant Number JPMJSC2306 and JSPS KAKENHI Grant Number JP22K17915.

Funding

Funded by JST SICORP Grant Number JPMJSC2306 and JSPS KAKENHI Grant Number JP22K17915.

Author information

Authors and Affiliations

Authors

Contributions

YK designed and formalized the study and conducted the experiments. NO supervised the study. All authors read and approved the manuscript.

Corresponding author

Correspondence to Yuma Kinoshita.

Ethics declarations

Competing interests

NO is a guest editor of the Collection "Advanced Signal Processing and Machine Learning for Acoustic Scene Analysis and Signal Enhancement" in EURASIP Journal on Audio, Speech, and Music Processing.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kinoshita, Y., Ono, N. End-to-end training of acoustic scene classification using distributed sound-to-light conversion devices: verification through simulation experiments. J AUDIO SPEECH MUSIC PROC. 2024, 46 (2024). https://doi.org/10.1186/s13636-024-00369-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-024-00369-z

Keywords