The OMR system for Kunqu Opera (KOMR), shown in Figure 6, is an off-line optical recognition system comprising six independent stages: (1) image pre-processing, (2) document image segmentation, (3) feature extraction, (4) musical symbol recognition, (5) musical semantics, and (6) MIDI representation.
Because of the maturity of image digitizing technology [29], paper-based Kunqu Opera musical scores can easily be converted into digital images. Therefore, this paper does not discuss the image acquisition stage. The system processes a gray-level bitmap from a scanner or reads directly from image files, with the input bitmap generally being 300 dpi with 256 gray levels.
3.1 Image pre-processing
Pre-processing might involve any of the standard image-processing operations, including noise removal, blurring, de-skewing, contrast adjustment, sharpening, binarization, or morphology. Many operations may be necessary to prepare a raw input image for recognition, such as the selection of an interesting area, the elimination of nonmusical elements, image binarization, and the correction of image rotation.
Many documents, particularly historical documents such as those in Kunqu Opera GCN scores, rely on careful pre-processing to ensure good overall system performance and, in some cases, to significantly improve the recognition performance. In this work, we briefly touch upon the basic pre-processing operations of Kunqu Opera GCN scores, such as binarization, noise removal, skew correction, and the selection of an area of interest. Image binarization uses the algorithm of Otsu [30], noise removal is conducted using basic morphological operations [31], and image rotation has been corrected using the least squares criterion [32]. The area of interest in a Kunqu score is surrounded by a frame (wide lines) (see Figure 3). The frame is a connected component containing the longest line in the score, so we use the Hough transform [33] to locate the longest line then delete the connected component with the longest line from the score, leaving the area of interest. Because these are common image-processing operations, we do not give a detailed description here.
3.2 Document segmentation and analysis of GCN scores
Document segmentation is a key phase in the KOMR system. With the symbols in a GCN score document having been classified, this stage first segments the document into two sections, one including symbols for the notes and the other for the non-notes. Elements of non-note symbols, such as the title, key signature, qupai, lyrics, noise, and border lines of a textural framework, are then identified and removed.
Because music is a time-based art, the arrangement of the notes is one of its most important factors. Therefore, obtaining the arrangement of the notes in a GCN score is requisite to document the segmentation. Concordant with the writing style of the GCN score, the arrangement of the notes can be organized based on high-level field knowledge of Kunqu Opera scores.
Several document image segmentation methods have been proposed, with the best known being X-Y projection [34], run length smoothing algorithm (RLSA) [35], component grouping [36], scale-space analysis [37], Voronoi tessellation [38], and the Hough transform [39]. These methods are suitable for handwritten document segmentation, such as for GCN scores, which are compatible with X-Y projection, RLSA, scale-space analysis, and the Hough transform [40].
A preliminary result of GCN score segmentation has been presented in [41]. A self-adaptive RLSA was used to segment the image according to an X-axis function (denoted by PF(x)) indicating the number of foreground pixels in each column of the image. This X-axis function uses X-projection to compute the number of flex points that satisfy the following conditions: PF(x - 1) < PF(x) and PF(x) > PF(x + 1) or PF(x - 1) > PF(x) and PF(x) < PF(x + 1). Next, the algorithm iteratively smoothes the function and analyzes the next smoothed function to ensure that the number of flex points in both functions is equal. Finally, the image is segmented into several sub-images based on the X-axis values for flex points in the function. To extract notes from the image, all connected components are identified using a conventional connected component labeling algorithm, and the minimum bounding box of all connected components is computed. Next, the algorithm matches each connected component to its corresponding sub-image. According to the experimental results in [41], the rate of correct segmentation in lines is 98.9%, and the loss rate of notes is 2.7%; however, the total error rate of all notes and lyrics is almost 22%.
3.3 Symbol feature extraction
Selecting suitable features for pattern classes is a critical process in the OMR system. Feature representation is also a crucial step, because perfect feature data can effectively enhance the symbol recognition rate. The goal of feature extraction is to characterize a symbol to be recognized by measurements whose values are highly similar for symbols in the same category, but different for symbols in different categories. The feature data must also be invariant during relevant transformations, such as translation, rotation, and scaling.
Popular feature extraction methods for OCR and Chinese character recognition include peripheral feature [42], cellular feature [43], and so on [44]. Because symbols are written in different sizes in a GCN score, to construct a simple and intuitive approach, four types of structural features have been used in this exploratory work based on the reference [42, 43]; these are suited to feature extraction of symbols in a GCN score and to compare with the recognition rate of KOMR. An n × m matrix featuring symbol data is used to obtain symbols of the same size as in the feature matrix. These are obtained using a grid to segment the symbols [45]. In Figure 7, a sample symbol with size H × W is shown in sub-graph (a), and an n × m grid is shown in sub-graph (b). In this example, , and the four features of each symbol are given by the following equations:
(1)
(2)
(3)
(4)
where f(x, y) is the value of a pixel at coordinates (x, y) in an image. If f(x, y) = 1, then the pixel is a foreground pixel; otherwise, 0 indicates a background pixel. The T1 feature is the number of pixels in each cell of the grid as a cellular feature, T2 feature appears to be the number of pixels falling on the upper edge of each grid cell, T3 is the number of pixels falling on the right edge of each grid cell, and T4 is the number to be some measure of how many background pixels there are from the edge to the first foreground pixel, if f(1,1) = 1, then t1,1 = 0; if f(n,1) = 1, then tn,1 = 0; if f(1,m) = 1, then t1,m = 0; if f(n,m) = 1, then tn,m = 0; and if there is no foreground pixel in the image, then ti,j = 0, for 1 ≤ i ≤ m and 1 ≤ j ≤ n.
3.4 Musical symbol recognition
Computer recognition of handwritten characters has been intensely researched for many years. Optical character recognition (OCR) is an active field, particularly for handwritten documents in such languages as Roman [45], Arabic [46], Chinese [47], and Indian [48].
Several Chinese character recognition methods have been proposed, with the best known being the transformation invariant matching algorithm [49], adaptive confidence transform based classifier combination [50], probabilistic neural networks [51], radical decomposition [52], statistical character structure modeling [53], Markov random fields [54], and affine sparse matrix factorization [55].
The basic symbols for musical pitch (see Figure 2) in a Kunqu Opera GCN score are Chinese characters, but other musical pitch symbols and all rhythm symbols are specialized symbols for GCN. Thus, the methods of Kunqu Opera GCN score recognition refer to the techniques of both OMR and OCR. In this paper, the following three approaches to musical symbol recognition are compared.
3.4.1 K-nearest neighbor
The K-nearest neighbor (KNN) classifier is one of the simplest machine learning algorithms. It classifies foregrounds based on the closest training examples in the feature space and is a type of instance-based, or lazy, learning in which the function is only approximated locally, and all computation is deferred until classification [56]. The neighbors are selected from a set of foregrounds for which the correct classification is known, a set that can be considered the training set for the algorithm, though no explicit training step is required.
In the experiment described in the manuscript, we choose half of the test musical symbols as training samples; then, for the convenience of computing, we set the value of K to 1. The feature data of each class are obtained from the samples by calculating the average feature data of all samples. Although the distance function can also be learned during the training, the corresponding distance functions for the four features above are determined by the following equations:
(5)
(6)
(7)
(8)
where f(S, T)1 ‒ 4 is the distance function of the corresponding feature 1-4, s
i,j
is the feature matrix of the prototype class (S), and t
i,j
is the feature matrix of the test sample (T). In Equation 5, if si,j = ti,j, 1 ≤ i ≤ m, 1 ≤ j ≤ n, then ρ = 1; otherwise, ρ = 0. f(S, T) counts the number of unequal elements in the feature matrixes (S and T). In Equations 6 and 7, f(S, T) calculates the sum of the difference between all corresponding elements in the feature matrixes (S and T). Equation 8 counts the number of non-approximate elements in S and T, using the parameters α and β as experience values. In this work, α = 0.9 and β = 1.1.
3.4.2 Bayesian decision theory
Bayesian decision theory is used in many statistics-based methods. Classifiers based on Bayesian decision theory are simple, probabilistic classifiers that apply Bayesian decision theory with conditional independence assumptions, providing a simple approach to discriminative classification learning [57].
In this work, the conditional probabilities P(T|S
k
)
r
, 1 ≤ r ≤ 4 with each of the four features for the Bayesian decision theory classifier are calculated as follows:
(9)
(10)
(11)
(12)
where P(T|S
k
)1 ‒ 4 is the conditional probability of the corresponding feature 1-4, S = {S1, …, S
k
, …, S
c
} is the set of prototype classes, is the feature matrix of the k th class S
k
in the set of prototype classes, and t
i,j
is the feature matrix of the test sample (T). In Equation 9, if = t
i,j
, then ρ = 1; otherwise, ρ = 0. In Equation 12, α and β are again experience values, and in this work, we set α = 0.9 and β = 1.1.
In the experiment, the prior probability P(S
k
), 1 ≤ k ≤ c of different symbols S
k
is not equal, and all prior probabilities come from the prior statistics of the training dataset. For example, the beat symbol ‘
’ has the prior probability 0.354511. If P(S
k
|T) = max P(S
j
|T), T is classified to S
k
. The Bayesian rule was used. and has four expressions to estimate P(S
k
|T) which correspond to each of the four features.
3.4.3 Genetic algorithm
In the field of artificial intelligence, a genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. GAs generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover. In GAs, the search space parameters are encoded as strings, and a collection of strings constitutes a population. The processes of selection, crossover, and mutation continue for a fixed number of generations or until some condition is satisfied. GAs have been applied in such fields as image processing, neural networks, and machine learning [58].
In this work, the key GA parameter values are as follows:
-
Selection-reproduction rate: ps = 0.5
-
Crossover rate: pc = 0.6
-
Mutation rate: pm = 0.05
-
Population class: C = 12
-
Number of individuals in the population: Np = 200
-
Maximum iteration number: G = 300
An individual's fitness value is determined from the following fitness function:
(13)
where I
u
is the k th individual, F(I
u
) is the fitness value of I
u
, R is the number of a gene bits in I
u
, h
i,j
is the j th gene of I
u
, s
i,j
is the feature matrix of the set of prototype classes corresponding to h
i,j
, the function e() computes the value of the comparative result between h
i,j
and s
i,j
, and m and n are the width and length, respectively, of a symbol's grid.
3.5 Semantic analysis and MIDI representation
After all the stages of the OMR system are complete (see Figure 6), the recognized symbols can be employed to write the score in different data formats, such as MIDI, Nyquist, MusicXML, WEDELMUSIC, MPEG-SMR, notation interchange file format (NIFF), and standard music description language (SMDL). Although different representation formats are now available, no standard exists for computer symbolic music representation. However, MIDI is the most popular digital music format in modern China, analogous to the status of the GCN format in ancient China.
This work selected MIDI for music representation, because the musical melody in a Kunqu Opera GCN score provides monophonic information. Thus, the symbols recognized from the GCN score can be represented by an array melody[L][2], with L representing the number of notes in the GCN score, the first dimension in the array representing the pitch of all notes, and the second dimension representing the duration of all notes.
Finally, the note information in melody[L][2] can be transformed into a MIDI file using an associated coding language, such as Visual C++, and the MIDI file of the Kunqu Opera GCN score can then be disseminated globally via the Internet.