Transformer-based autoencoder with ID constraint for unsupervised anomalous sound detection

Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detecting anomalies from sound. The self-supervised methods are not always stable and perform differently, even for machines of the same type. In addition, the anomalous sound may be short-lived, making it even harder to distinguish from normal sound. This paper proposes an ID-constrained Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly score computation for unsupervised ASD. Machine ID is employed to constrain the latent space of the Transformer-based autoencoder (TransAE) by introducing a simple ID classifier to learn the difference in the distribution for the same machine type and enhance the ability of the model in distinguishing anomalous sound. Moreover, weighted anomaly score computation is introduced to highlight the anomaly scores of anomalous events that only appear for a short time. Experiments performed on DCASE 2020 Challenge Task2 development dataset demonstrate the effectiveness and superiority of our proposed method.


Introduction
Anomalous sound detection (ASD) aims to detect anomalies from acoustic signals.Since anomalous sounds can indicate system error or malicious activities, ASD has received much attention [1][2][3][4][5], which has been widely used in various applications, such as road surveillance [6,7], animal disease detection [8], and industrial equipment predictive maintenance [9].Recently, ASD has also been used to monitor the abnormality of industrial machinery equipment, such as anomaly detection for surface-mounted device machine [10,11], and the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge Task2 from 2020 to 2023 [12][13][14][15], to reduce the loss caused by machine damage and the cost of manual inspection.
Supervised learning based methods usually train a binary classifier to detect the anomaly [7,16].However, it is hard to collect enough anomalous data for supervised learning, as actual anomalous sounds rarely occur in real scenarios.In addition, the high diversity of the anomalies can reduce the robustness of supervised methods.Therefore, unsupervised methods are often employed to detect unknown anomalous sounds without using anomalous sound samples.
In unsupervised ASD, a method is to employ the autoencoder (AE) to learn the distributions of sound signals and perform anomaly detection.Conventional AE-based approaches adopt autoencoder to reconstruct multiple frames of spectrogram to learn the distribution of normal sounds, and then the reconstruction error is used to obtain the anomaly score for anomaly detection [10,12,[17][18][19].However, the conventional AE-based methods do not work well for non-stationary ASD [20], as non-stationary normal sounds (e.g., sound signals of valves) can easily have larger reconstruction errors than abnormal sounds, thus deteriorating the detection performance.In [20], an interpolation deep neural network (IDNN) method is proposed, which masks the center frame of the input, and only uses the reconstruction error of the masked center frame to improve non-stationary sound reconstruction, without considering the edge frames, while the method in [21] adopts a similar strategy as IDNN and applies the local area mask on the input and employs attentive neural process (ANP) [22] for the reconstruction of the masked input.
Instead of reconstructing spectrogram feature, the method in [23] mixes multiple features as the input, and adopts a fully connected U-Net for the mixed feature reconstruction.To utilize the intra-frame statistics of sound signal, a novel group masked autoencoder for distribution estimation (Group MADE) is proposed for unsupervised ASD [24,25], which estimates the density of an audio time series and achieves better performance.However, the distributions of normal audio clips from different machines are different even for the same sound class.This difference can be even greater than that between normal and anomalous sound, which makes it harder to distinguish normal and anomalous sounds for these purely AEbased methods, as the learned feature from these normal sounds may also fit with the anomalous sounds [26].
Machine identity (ID) has been used as the additional condition for encoding in the latent feature space of AE, in order to allow the decoder to provide different reconstructions for each machine [27,28].However, the encoder is unable to learn the difference in distributions for different machines, and as a result, the anomalous sound may be well reconstructed.For this reason, it could still be difficult to distinguish normal and anomalous sound.In addition, the abovementioned AE-based methods often use averaged anomaly score for detection, which does not take into account the short-lived condition in anomalous sound, resulting in low anomaly scores for anomalous events that appear only for a short time, which makes it even more challenging for the AE-based methods.
Therefore, instead of reconstructing normal sounds to learn the feature representation, the self-supervised methods are presented to learn the feature representation by utilizing the difference in distributions among different machines [29][30][31][32][33][34][35][36].The study in [29] uses machine type and machine ID in addition to the machine condition (normal/abnormal) as training labels for selfsupervised classification.The flow-based self-supervised method [37] adopts normalizing flow (NF) [38,39] models, such as generative flow (Glow) [40] and masked autoregressive flow (MAF) [41], to obtain the likelihood estimation for anomaly detection.In this method, an auxiliary task is introduced to distinguish the sound data of that machine ID (i.e., target data) from the sound data of other machine IDs with the same machine type (i.e., outlier data).Moreover, although the self-supervised learning-based methods can achieve better performance than the AE-based methods, they are not always stable and could perform differently even for the machines of the same type.
In this paper, we present an ID-constrained Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly score computation for unsupervised ASD.Our method includes two stages, namely, spectrogram reconstruction and anomaly detection.First, an IDC-TransAE is introduced to reconstruct the spectrogram of normal sounds, where Transformer [42] is employed to build the AE architecture, and a simple ID classifier is incorporated into the AE.Specifically, the Transformer captures the timedependent information of the sound signal, and the classifier utilizes machine ID to constrain the latent space of AE, so that our proposed IDC-TransAE can learn different distributions of normal machines, even with the same type.In the proposed IDC-TransAE architecture, instead of using the positional encoding (PE) for Transformer to provide additional temporal information, a linear phase embedding (LPE) method is proposed to represent the temporal information of sound signal by using its phase information, which can further enhance the classification performance of the proposed IDC-TransAE.In addition, the center frame prediction (CFP) is also employed in our IDC-TransAE to improve the ASD ability for non-stationary signals (e.g., Valve).Then, the reconstruction error from the trained IDC-TransAE can be used to calculate the anomaly score to detect the anomaly.Here, we introduce a weighted anomaly score computation method via global weighted ranking pooling (GWRP) [43], which can highlight the anomaly scores for the anomalous events that only appear for a short time.Finally, we obtain the final anomaly score with the combination of the classification anomaly score and weighted reconstruction anomaly score, to obtain more stable and consistent detection performance.
In summary, the innovations and contributions of this paper for unsupervised anomalous sound detection can be summarized as follows:

Preliminary
The AE-based methods are widely used for unsupervised ASD [10,12,17,18] An AE model is trained with normal sounds to learn their feature distribution.It implicitly assumes that it can reconstruct normal sounds better than anomalous sounds, so that anomalous sounds often have larger reconstruction errors than normal sound.The reconstruction error is then used for deriving the anomaly scores for anomaly detection.Figure 1 shows the AE architecture for unsupervised ASD.
Regarding model training, multiple frames of a spectrogram are usually used as the input, and the same number of frames are generated as the output.Suppose X ∈ R N ×M is the log-Mel spectrogram of the sound signal, where N is the number of frames and M is the feature dimension of each frame of X .The loss for the AE model training is where E(•) and D(•) are the encoder and the decoder of AE, respectively.
Then, the trained AE model can be used to detect the anomaly.Y is the test audio clip, split into I segments, Here, Y i ∈ R N ×M is the ith segment and also the ith input of the model.The reconstruction error e i for Y i is where (1)  where θ is a pre-defined threshold value to determine whether an audio clip is anomalous.However, for normal non-stationary sounds, the AEbased methods tend to give large reconstruction errors for both normal and abnormal sounds; this is because the edge frames of non-stationary sound are hard to reconstruct.In [20], IDNN is proposed for non-stationary sound ASD, which removes the center frame of the multiple frames as the input, and predicts the removed frame as the output, as shown in Fig. 2. The input multiple frames of IDNN can be expressed as is the removed center frame of original input frames.Unlike conventional AE-based methods, the reconstruction error e i of the ith input is only calculated by the center frame.
However, the training procedure does not involve the anomalous sound, as a result, the AE-based methods could be limited in the scenario where the learned feature also fits with the anomalous sound [2].In this case, the anomalous sound could be well reconstructed with a smaller reconstruction error than that of the normal sounds of different machines, even of the same type.For example, the anomalous sounds from one machine may be similar to the normal sounds of another machine, due to different usage of different machines.In this case, the AE trained with these different machines of the same machine type can reconstruct the anomalous sounds well, and thus it may not be able to detect these anomalous sounds.
In addition, for anomalous events that only appear for a short time in audio clips, the anomaly score calculated by mean reconstruction error is often too small, making it difficult to detect the anomaly. (5)

Proposed method
This section presents our IDC-TransAE with weighted anomaly score computation for unsupervised ASD.We introduce IDC-TransAE to reconstruct the spectrogram of normal sounds to learn their distributions, and apply GWRP for weighted anomaly score computation to perform anomaly detection.

ID constraint Transformer autoencoder
We utilize Transformer to exploit temporal information for better reconstruction of normal sounds, where only the encoder layer of Transformer is employed to build the encoder and decoder of our IDC-TransAE architecture.In addition, machine ID is adopted to constrain the latent space of the AE by introducing a simple ID classifier to learn different representations for different normal sounds.The framework of the proposed IDC-TransAE is illustrated in Fig. 3.

Center frame prediction
For better reconstruction of the spectrogram of normal non-stationary sound, following IDNN, we introduce a center frame prediction (CFP) method by removing the center frame of input frames and predicting the removed frame.After removing center frame x N +1 2 , the input frames can be expressed as where T denotes the matrix transposition operation.Unlike IDNN, the predicted center frame obtained by CFP is the average pooling of decoder output in frames and then processed by a linear layer, as shown in Fig. 3.

Linear phase embedding
To represent the appropriate positional relationship of the sound signal, we propose a linear phase embedding (6) Fig. 2 The architecture of IDNN uses the reconstruction error of center frame as the anomaly score (LPE) method for IDC-TransAE, to replace positional encoding (PE), often used in Transformer to provide additional position information via sinusoid function [42], which is, however, not strongly correlated with the sound signal.In contrast, LPE in the proposed method preserves the signal's temporal information by linearly embedding the phase angles of the signal to the same dimensions with the input X .The phase angle is obtained via the short-time Fourier transform (STFT).
Assuming the phase angles corresponding to X are with center frame removed, and F (•) is the linear embed- ding function, including two linear layers with batch normalization.The output X of decoder D(•) can be obtained as Then, the average pooling of X is used to predict the center frame, and the reconstruction loss for the center frame is where W o and b o are the learnable parameters of the last output linear layer.The LPE module helps preserve the temporal information of the signal to enhance the ability of the model for anomalous sound detection. (

ID classifier
We observed that the performance of the trained AE model on different machines with the same type could be quite different.The potential cause is the difference in distributions of normal machine sound when individual machines have different usages.However, the trained model only learns how to reconstruct the general distribution of different normal machines sounds.
To enable the model to learn different representations for different machine sounds even with the same type, we introduce an ID classifier C(•) with machine ID informa- tion to constrain the latent feature z of AE.The structure of the ID classifier C(•) is given in Fig. 3, which consists of a max pooling layer, two linear layers with a ReLU [45] function and a softmax activation function.
Here, the latent feature z is the output of the encoder of AE, which is the input of the classifier, defined as z = E(X + F ( )) .The output of the ID classifier l = C(z) ∈ R K is the probability indicating normal/ anomalous sound corresponding to the machine ID, and K is the number of machines with the same type.Then, the classification error of C(•) can be obtained via a cross- entropy loss function [46] where l ∈ R K is the one-hot vector of machine ID label of the sound signal.
Therefore, the proposed IDC-TransAE can be jointly trained by minimizing the center frame reconstruction (9) L c = CrossEntropy(l, l), Fig. 3 The architecture of the proposed IDC-TransAE for normal sound reconstruction.X and are the inputs to the model, which are obtained from the sound signal by removing the center frame.The final predicted center frame is obtained by average pooling of the output of the decoder X in frames and a linear layer, and l is the predicted machine ID probability of sound signal, which is obtained by max pooling of output of encoder z in frames and two linear layers with softmax.IDC-TransAE is optimized by the combination of reconstruction error and classification error error and the machine ID classification error with the joint loss function where α ∈ [0, 1) is a hyper-parameter.The magnitude of α denotes the extent to which the machine ID classifier restricts z .By jointly training the AE with the ID classi- fier, we can improve anomaly detection performance.

Weighted anomaly score computation
For anomaly detection, the formula A(e) mean in Eq. ( 3) usually underestimates the anomaly scores of anomalous audio clips when the anomalous events only appear for a short time.One solution is to use the maximal reconstruction error as the anomaly score, i.e., max anomaly score A(e) max = max(e) , to highlight the anomalies of these audio clips.However, it is not robust to use the maximum value of e as the anomaly score of the whole audio clip, as it may overestimate the anomaly scores of some normal audio clips.
To improve the reliability of the calculated anomaly score, we employ the global weighted rank pooling (GWRP) method to obtain weighted anomaly score, where GWRP is a generalization of max and mean, which can highlight the anomaly score by setting different weights to reconstruction error sequence e .For example, let ê = {ê 1 , ..., êI } be sorted by descending order of e , the GWRP anomaly score can be calculated as where 0 ≤ r ≤ 1 is a hyper-parameter and Z(r) = I i=1 r i−1 is a normalization term.When r = 0 , A(ê) gwrp degenerates to A(e) max , and when r = 1 , A(ê) gwrp becomes A(e) mean .It intends to assign larger weights to anomalous audio clips and lower weights to normal audio clips, to generate high anomaly scores for the anomalous events of short duration.In addition, the classification error is combined with the reconstruction error to calculate the anomaly score, to allow the anomaly score to increase if the ID classifier misclassifies the machine ID.Finally, the weighted anomaly score can be calculated as where β ∈ [0, 1] is a parameter weighting the impact of a false prediction by the ID classifier on the anomaly score.For clarity, the proposed IDC-TransAE with weighted anomaly score computation is denoted as IDC-TransAE-W in the following section.(

Experimental setup 4.1.1 Dataset
We evaluate our method on the DCASE 2020 Challenge Task2 [12] dataset, which comprises parts of MIMII [47] and ToyADMOS dataset [48] including the normal/anomalous operating sounds of six types of real/ toy machines.The MIMII dataset includes four types of machines (i.e., Fan, Pump, Slider, and Valve), with four different machines for each machine type.The ToyAD-MOS dataset consists of two types of machines (i.e., ToyCar and ToyConveyor), with four and three different machines for each type, respectively.Each recording is a single-channel audio of 10-s long with a 16-kHz sampling rate that includes both a target machine's operating sound and environmental noise.Following [12], the training set only includes normal sounds, with around 6000 items for each machine type, and the test set consists of both normal and anomalous sounds, including about 500 to 1000 items for normal and anomaly in each machine type.

Performance metrics
Following [12,20,29,37,49], we employ area under the receiver operating characteristic (ROC) curve (AUC) and the partial-AUC (pAUC) as the performance metrics, where the pAUC is calculated as the AUC over a low false-positive-rate (FPR) range [0, p] and p = 0.1 as in [12].Higher AUC indicates better model performance.pAUC reflects the reliability of the ASD system based on practical requirements.It is important to increase pAUC to avoid the ASD system predicting false alerts frequently [12].In addition, the minimum AUC (mAUC) is adopted to represent the worst detection performance achieved among individual machines of same machine type, following [37].

Implementation details
The implementation details of IDC-TransAE can be seen in Table 1.We use the log-Mel spectrogram and phase angle of the sound signal as the input of our IDC-TransAE.The frame size is 1024 with an overlapping 50%, i.e., the number of FFT bins (n_FFT) is 1024, and the hop length is 512.The number of Mel filter banks (n_Mels) is set as 128.The number of frames (i.e., N) is 5.The dimension of phase angles is 513, which is embedded to a 128-dimensional vector by the linear function F (•) .Here, F (•) consists of two lin- ear layers with batch normalization.The encoder and decoder of IDC-TransAE include two layers, respectively.The classifier includes a max pooling layer, two linear layers with a ReLU and a softmax activation function.The hyper-parameter α of the joint loss func- tion is empirically set as 0.3.
Adam optimizer [50] is used to optimize our model with a learning rate of 0.0001.For each machine type, our model is trained 300 epochs, and the batch size is set as 2000.In the joint training stage, we found that the classification loss converges much faster than the reconstruction loss, so we adopt a training strategy to avoid the overfitting of the classifier, by training the classifier every 10 epochs (i.e., using L total loss) and the remaining epochs for autoencoder (i.e., using L r loss).In weighted anomaly score computation, r and β are empirically selected, and the values of r and β are pro- vided in Table 1.
Table 2 shows the comparison results in terms of AUC and pAUC.Here, IDC-TransAE-W and IDC-TransAE-mean represent IDC-TransAE with weighted anomaly score computation and mean anomaly score computation, respectively.In addition, the proposed methods without using ID information are evaluated, i.e., TransAE-W and TransAE-mean.
As shown in Table 2, the methods with ID information (denoted as w/ ID) give better detection performance than the methods without ID information (denoted as w/o ID), except IDCAE.The proposed IDC-TransAE-W performs the best in terms of average AUC and pAUC.Amongst the methods without using ID information, our TransAE-W also achieves the best overall performance.Especially, both TransAE-W and IDC-TransAE-W can substantially improve the performance on the Note that r in the weighted anomaly score computation can be adjusted according to the time length of the anomalous event, for example, when r = 1 , A(ê) gwrp = A(e) mean .This means it is more applicable than mean anomaly score.The influence of r will be discussed in Section 4.4

Detection stability
To demonstrate the effectiveness of our method for more stable detection, another experiment is conducted to show the worst detection performance on individual machines of the same type, where the self-supervised based methods (i.e., MobileNetV2 and Glow_Aff ) and the typical AE-based method (i.e., IDNN) are employed for comparison.The results in terms of mAUC are given in Table 3.
As can be seen from Tables 2 and 3, the self-supervised methods, i.e., MoblieNetV2 and Glow_Aff, can achieve significant improvements in average AUC and pAUC, as compared to the AE-based method IDNN.However, they perform dramatically different even for the machines of the same type, as observed from Tables 2 and 3, e.g., MobileNetV2 has much smaller mAUC than AUC on Fan, Pump, ToyCar, and ToyConveyor.The results demonstrate the instability of the self-supervised methods.
Especially, the average mAUC (i.e., 59.73%) of Mobile-NetV2 is lower than that of IDNN (i.e., 64.46%), which indicates that the self-supervised classification method (i.e., MobileNetV2) indeed easily fails on some individual machines and lacks performance consistency.In contrast, the AE-based method IDNN can provide a relatively stable detection performance.Although the flow-based self-supervised method (Glow_Aff ) can improve detection stability to some extent compared to the AE-based method, our proposed method can achieve the best average mAUC performance and obtain more stable performance for some machine types, i.e., Slider, Valve, and ToyCar.
Although Glow_Aff has a higher mAUC on Pump than our proposed method, the model needs to be trained for each individual machine which could be limited in realworld applications.In contrast, our proposed method only needs to train one model for each machine type.

Generalization to anomaly
To demonstrate the proposed IDC-TransAE can mitigate the generalization of AE for anomalous sound and improve its detection performance, experiments are conducted to compare it with the typical AE-based method (i.e., IDNN).
First, we show the histograms of anomaly score distribution on Slider, Valve, and ToyCar using IDNN and our proposed IDC-TransAE.For a fair comparison, our method (i.e., IDC-TransAE-mean) also adopts mean anomaly score computation as IDNN, and the results are provided in Fig. 4. Here, the anomaly score is on the horizontal axis of the histogram, which is normalized to facilitate comparison.The vertical axis represents the number of audio samples corresponding to the anomaly score distribution on the histogram.
From Fig. 4, we can see that, for IDNN, the anomaly score distribution of the anomalous sound tends to be similar to that of the normal sound, especially on ToyCar, as shown in Fig. 4a.It shows that most anomalous sound have a small anomaly score similar to normal sound.This indicates that the AE-based method (i.e., IDNN) is able to generalize the representation for anomalous sound, which reduces its ability to distinguish between normal and abnormal sound.In contrast, our proposed method can give higher anomaly scores for the anomalous sound, and provide better detection ability than the AE-based method, as shown in the histograms in Fig. 4b, which demonstrates the effectiveness of our proposed IDC-TransAE architecture.
To further demonstrate that our proposed IDC-TransAE can mitigate its generalization for the anomaly, we perform another experiment for non-stationary anomalous sound detection (i.e., sound of Valve) as compared with IDNN, where the log-Mel spectrogram reconstruction of normal and anomalous sound is illustrated in Fig. 5. From left to right, Fig. 5 shows the original log-Mel spectrograms, the reconstructed log-Mel spectrograms, and the absolute values of their difference.Comparing the red box areas illustrated in Fig. 5a and b, the proposed IDC-TransAE can provide better normal sound reconstruction, as it can achieve smaller reconstruction error for normal sound than that of the typical AE-based method (i.e., IDNN).This can be clearly observed in the comparison of the absolute value difference of original log-Mel spectrogram and reconstrcuted log-Mel spectrogram, as the red box indicated areas in Fig. 5a and b.Whereas for the anomalous sound reconstruction, our proposed method can give larger reconstruction error than the typical AE-based method, which means that our method has a better ability to highlight the anomalies when reconstructing the anomalous sound.This can be observed from the comparison between the red box areas in Fig. 5c and d, where the absolute value difference shown in Fig. 5d is much more clear than that in Fig. 5c.The results further demonstrate that our proposed IDC-TransAE can solve the generalization problem of the AE-based method and has a better ability in anomaly detection.
Note that the log-Mel spectrogram of the anomalous sound also shows that the anomalies may appear for a short time in the sound, as illustrated in Fig. 5.In this case, the mean anomaly score computation method will give low anomaly scores for the anomalous events that only appear for a short time.

Ablation studies
To show the effectiveness of different parts of our proposed IDC-TransAE-W, ablation studies are conducted, where AUC and pAUC are used as performance metric.The results are given in Table 4   To further demonstrate the effectiveness of the IDC module, we compare the performance of IDC-TransAE-W and TransAE/LPE/CFP-W in terms of AUC and pAUC on four different machines of the machine type Fan.The result is illustrated in Fig. 6.From Fig. 6, we can see that IDC-TransAE-W can significantly improve the performance on ID_02, ID_04, and ID_06, as compared with TransAE/LPE/CFP-W.This means the IDC method can better distinguish the anomalous sound for different machines with the same type.The results in Table 4 and Fig. 6 verify the effectiveness of different modules of our proposed method.To further illustrate the effectiveness of each module, we give the visualization analysis for each module in the following Section 4.4.

Visualization analysis
In this section, visualization analysis is provided for better understanding the experimental results in the ablation studies.Specifically, the effectiveness of CFP, LPE module and IDC module in our proposed IDC-TransAE method are further evaluated.Besides, the influence of the parameter in the GWRP operation of anomaly score calculation is also explored in this section.

Effectiveness of CFP
To show how CFP operation affects the anomaly detection for non-stationary sound signals, we compare the histograms of anomaly score distribution between TransAE/PE-W and TransAE/PE/CFP-W on Valve.The result is given in Fig. 7. Same as Fig. 4, the anomaly score is also normalized to facilitate comparison.By comparing Fig. 7a and b, we can see that the distribution of normal sound samples is on a smaller range of anomaly scores when adopting the CFP module (i.e., TransAE/PE/CFP-W), as illustrated in Fig. 7b.It verifies that CFP operation can improve the reconstruction of non-stationary signals as described in [20].Therefore, it can improve the performance of our proposed method for anomaly detection of non-stationary sound signals.

Visualization of linear phase embedding
To show why the LPE module can enhance the ability of the model for anomalous sound detection, we visualize the encoding result of five consecutive input sound signals of TransAE/LPE/CFP, and compare it with the encoding result of positional encoding for TransAE/PE/ CFP, as illustrated in Fig. 8. Here, f 1 to f 5 are the encod- ing visualizations corresponding to the five consecutive input sound signals, respectively, where each input includes four frames.
From Fig. 8a, we can see that the positional encoding visualization of each input is the same because the positional Fig. 6 Performance illustration for 4 different machines with the same type, i.e., Fan encoding operation adopts the same cosine representation for signal encoding.In contrast, by linearly encoding the phase information of the signal, our proposed LPE can preserve the signal's own temporal information and give different encoding representations for each different input signal, as indicated in the red box in Fig. 8b.Therefore, our proposed method can learn better latent features with unique characteristics from each signal, and enhance the ability of the model for anomalous sound detection.

Validation of IDC module
We show the t-distributed stochastic neighbor embedding (t-SNE) cluster visualization of the latent features to validate the IDC module further.The experiment is conducted on the test dataset of the machine type Toy-Car, where the proposed method IDC-TransAE without using ID information (i.e., TransAE/LPE/CFP) is employed for comparison.The result is illustrated in Fig. 9.As observed from Fig. 9a, the latent features of normal and anomalous sound samples from different machines overlap with each other when using the method without IDC module (i.e., TransAE/LPE/CFP).In addition, the latent features of the normal sound samples of one machine may be close to that of the anomalous samples from other machines, rather than the normal samples from the same machine, as illustrated in Fig. 9a.It results in the latent features of some anomalous samples from one machine on the manifold of the normal samples from another machine.Thereby these anomalous sounds will be well reconstructed, making it hard to distinguish the anomalies and reducing the detection performance.By introducing the IDC module to constrain the latent feature, the proposed method can reduce the generalization of AE for anomalous sound and further improve its distinguishing ability that the normal and anomalous latent features are well separated, as illustrated in Fig. 9b.

Influence of parameter r for anomaly detection
As mentioned in Section 3.2, we introduce the weighted anomaly score computation to highlight the anomalous events that only appear for a short time.The parameter r in Eq. (11) will decide the way for anomaly score computation, i.e., weighted anomaly score computation will degenerate to max anomaly score computation when r = 0 , and it will become mean anomaly score compu- tation when r = 1 .Therefore, we also carry out another experiment to show the impact of parameter r on the performance of our proposed IDC-TransAE for anomaly detection.Here, different values of r from 0 ≤ r ≤ 1 with an interval of 0.05 are selected to evaluate the performance of our proposed method in terms AUC and pAUC on all six machine types.The result is shown in Fig. 10.
From Fig. 10, we can see that the mean score computation (i.e., r = 1 ) can achieve the best performance for the machine types of Fan, Pump, ToyCar, and ToyConveyor.However, it obtains the worst performance for the machine types of Slider and Valve, where the anomalous sound often occurs in a short time.Though using max anomaly score computation ( r = 0 ) can achieve better performance than adopting mean anomaly score computation on Slider and Valve, the weighted anomaly score computation method can provide the best performance for the machine type of Slider and Valve.Especially, the weighted anomaly score computation method can significantly improve the pAUC performance over the mean and max anomaly score computation on Slider and Valve.The result verifies the effectiveness of weighted anomaly score computation for the anomalous sound that appears over short time.In addition, the values of r can be adjusted according to different machine types, which makes it more applicable than mean anomaly score and max anomaly score computation.

Conclusions
In this paper, we have presented an IDC-TransAE architecture with weighted anomaly score computation for unsupervised ASD, where an ID classifier was introduced to mitigate the generalization of AE for anomalous sound and enhance the distinguishing ability for different machines with the same type.In addition, center frame prediction was utilized to improve the reconstruction of the non-stationary sound signal, and a linear phase embedding strategy was applied to preserve the signal's temporal information and further improve its distinguishing ability for anomalous sound detection.Moreover, a weighted anomaly score computation method was introduced to highlight the anomaly scores for anomalous events that only appear for a short time.The experiments demonstrate the effectiveness and superiority of our proposed method, as compared with the baseline methods.

1 .
We analyze the generalization problem of AE for ASD and point out the main reason for this problem, and propose a solution, i.e., IDC-TransAE, to mitigate the generalization of AE and improve the detection performance.To the best of our knowledge, this is the first work to clearly point out the main reason for the generalization problem of AE for ASD. 2. We propose an ID constraint (IDC) classifier to learn different audio feature distributions from the same machine type, which can enhance the distinguishing ability for anomaly detection.3. We design a linear phase embedding (LPE) to replace the traditional positional encoding (PE) to preserve the own temporal information of machine sounds by the phase of sounds.4. In the anomaly score calculation, we introduce the global weighted ranking pooling (GWRP) to highlight the anomaly score of sounds with short-time non-stationary anomalies, which obtains a more stable and consistent detection performance.5. Experimental results verify that the proposed IDC-TransAE method can mitigate the generalization problem of AE for ASD.Ablation studies and visualizations further verify the effectiveness of the design of ID constraint, LPE and GWRP for ASD.Our study employs the DCASE 2020 Challenge Task2 dataset to address AE's generalization problem in ASD, excluding DCASE 2022 and 2023 datasets tailored for domain-shift and first-shot scenarios beyond our paper's scope.
is the corresponding output frames, and �•� F denotes Frobenius norm.It results in a reconstruction error sequence e = {e i } I i=1 for Y , and the mean reconstruction error of e can be used as the anom- aly score Here, A(e) mean represents the anomalous degree of the audio clip.The normal or anomaly of the clip is determined by H(e, θ) [44]:

( 4 )Fig. 1
Fig. 1 Typical architecture of AE for unsupervised ASD uses the reconstruction error between the input and output as the anomaly score and T denotes transposition.The loss function of IDNN is formulated as where x N +1 2

Fig. 4 Fig. 5
Fig. 4 The histograms of anomaly scores distribution on Slider, Valve, and ToyCar using IDNN and the proposed IDC-TransAE-mean, where blue and orange indicate the anomaly score distribution of normal and anomalous sound, respectively

Fig. 7 Fig. 8
Fig. 7 The comparison between TransAE/PE-W and TransAE/PE/CFP-W on histograms of anomaly scores distribution of Valve

Fig. 9
Fig. 9 visualization of latent feature on the test dataset for the machine type ToyCar using TransAE/LPE/CFP and IDC-TransAE.Different color represents different machine ID.The " • " and " × " denote normal and anomalous samples, respectively

Table 1
Implementation details for all machine types

Table 2
Performance comparison in terms of AUC (%) and pAUC (%) for different types of machines

Table 3
Performance comparison in terms mAUC (%) among the individual machines of the same type To show the effectiveness of LPE, we compare the performance of TransAE/PE/CFP-W and TransAE/LPE/ CFP-W.The result shows that TransAE/LPE/CFP-W can improve the detection performance on Fan, Slider, Valve, and ToyConveyor and achieve better average AUC

Table 4
Validation of different modules of IDC-TransAE