The OMR system for Kunqu Opera (KOMR), shown in Figure 6, is an offline optical recognition system comprising six independent stages: (1) image preprocessing, (2) document image segmentation, (3) feature extraction, (4) musical symbol recognition, (5) musical semantics, and (6) MIDI representation.
Because of the maturity of image digitizing technology [29], paperbased Kunqu Opera musical scores can easily be converted into digital images. Therefore, this paper does not discuss the image acquisition stage. The system processes a graylevel bitmap from a scanner or reads directly from image files, with the input bitmap generally being 300 dpi with 256 gray levels.
3.1 Image preprocessing
Preprocessing might involve any of the standard imageprocessing operations, including noise removal, blurring, deskewing, contrast adjustment, sharpening, binarization, or morphology. Many operations may be necessary to prepare a raw input image for recognition, such as the selection of an interesting area, the elimination of nonmusical elements, image binarization, and the correction of image rotation.
Many documents, particularly historical documents such as those in Kunqu Opera GCN scores, rely on careful preprocessing to ensure good overall system performance and, in some cases, to significantly improve the recognition performance. In this work, we briefly touch upon the basic preprocessing operations of Kunqu Opera GCN scores, such as binarization, noise removal, skew correction, and the selection of an area of interest. Image binarization uses the algorithm of Otsu [30], noise removal is conducted using basic morphological operations [31], and image rotation has been corrected using the least squares criterion [32]. The area of interest in a Kunqu score is surrounded by a frame (wide lines) (see Figure 3). The frame is a connected component containing the longest line in the score, so we use the Hough transform [33] to locate the longest line then delete the connected component with the longest line from the score, leaving the area of interest. Because these are common imageprocessing operations, we do not give a detailed description here.
3.2 Document segmentation and analysis of GCN scores
Document segmentation is a key phase in the KOMR system. With the symbols in a GCN score document having been classified, this stage first segments the document into two sections, one including symbols for the notes and the other for the nonnotes. Elements of nonnote symbols, such as the title, key signature, qupai, lyrics, noise, and border lines of a textural framework, are then identified and removed.
Because music is a timebased art, the arrangement of the notes is one of its most important factors. Therefore, obtaining the arrangement of the notes in a GCN score is requisite to document the segmentation. Concordant with the writing style of the GCN score, the arrangement of the notes can be organized based on highlevel field knowledge of Kunqu Opera scores.
Several document image segmentation methods have been proposed, with the best known being XY projection [34], run length smoothing algorithm (RLSA) [35], component grouping [36], scalespace analysis [37], Voronoi tessellation [38], and the Hough transform [39]. These methods are suitable for handwritten document segmentation, such as for GCN scores, which are compatible with XY projection, RLSA, scalespace analysis, and the Hough transform [40].
A preliminary result of GCN score segmentation has been presented in [41]. A selfadaptive RLSA was used to segment the image according to an Xaxis function (denoted by PF(x)) indicating the number of foreground pixels in each column of the image. This Xaxis function uses Xprojection to compute the number of flex points that satisfy the following conditions: PF(x  1) < PF(x) and PF(x) > PF(x + 1) or PF(x  1) > PF(x) and PF(x) < PF(x + 1). Next, the algorithm iteratively smoothes the function and analyzes the next smoothed function to ensure that the number of flex points in both functions is equal. Finally, the image is segmented into several subimages based on the Xaxis values for flex points in the function. To extract notes from the image, all connected components are identified using a conventional connected component labeling algorithm, and the minimum bounding box of all connected components is computed. Next, the algorithm matches each connected component to its corresponding subimage. According to the experimental results in [41], the rate of correct segmentation in lines is 98.9%, and the loss rate of notes is 2.7%; however, the total error rate of all notes and lyrics is almost 22%.
3.3 Symbol feature extraction
Selecting suitable features for pattern classes is a critical process in the OMR system. Feature representation is also a crucial step, because perfect feature data can effectively enhance the symbol recognition rate. The goal of feature extraction is to characterize a symbol to be recognized by measurements whose values are highly similar for symbols in the same category, but different for symbols in different categories. The feature data must also be invariant during relevant transformations, such as translation, rotation, and scaling.
Popular feature extraction methods for OCR and Chinese character recognition include peripheral feature [42], cellular feature [43], and so on [44]. Because symbols are written in different sizes in a GCN score, to construct a simple and intuitive approach, four types of structural features have been used in this exploratory work based on the reference [42, 43]; these are suited to feature extraction of symbols in a GCN score and to compare with the recognition rate of KOMR. An n × m matrix featuring symbol data is used to obtain symbols of the same size as in the feature matrix. These are obtained using a grid to segment the symbols [45]. In Figure 7, a sample symbol with size H × W is shown in subgraph (a), and an n × m grid is shown in subgraph (b). In this example, {\mathit{h}}_{0}=0,{\mathit{h}}_{\mathit{n}}=\mathit{H}\mathit{,}{\mathit{h}}_{\mathit{i}}=\left[\frac{\mathit{H}}{\mathit{n}}\right]\times \mathit{i}\mathit{,}{\mathit{w}}_{0}=0,{\mathit{w}}_{\mathit{m}}=\mathit{W}\mathit{,}{\mathit{w}}_{\mathit{j}}=\left[\frac{\mathit{W}}{\mathit{m}}\right]\times \mathit{j}, and the four features of each symbol are given by the following equations:
\begin{array}{ll}\hfill {\mathit{T}}_{1}& =\left[\begin{array}{l}\dots ,\dots ,\dots \\ \dots ,{\mathit{t}}_{\mathit{i},\mathit{j}},\dots \\ \dots ,\dots ,\dots \end{array}\right],{\mathit{t}}_{\mathit{i},\mathit{j}}\\ ={\displaystyle \sum _{\mathit{x}={\mathit{w}}_{\mathit{i}1}}^{{\mathit{w}}_{\mathit{i}}}{\displaystyle \sum _{\mathit{y}={\mathit{h}}_{\mathit{j}1}}^{{\mathit{h}}_{\mathit{j}}}\mathit{f}\left(\mathit{x},\mathit{y}\right)}},1\le \mathit{i}\le \mathit{m},1\le \mathit{j}\le \mathit{n}\end{array}
(1)
\begin{array}{ll}\hfill {\mathit{T}}_{2}& =\left[\begin{array}{l}\dots ,\dots ,\dots \\ \dots ,{\mathit{t}}_{\mathit{i},\mathit{j}},\dots \\ \dots ,\dots ,\dots \end{array}\right],{\mathit{t}}_{\mathit{i},\mathit{j}}\\ ={\displaystyle \sum _{\mathit{x}={\mathit{w}}_{\mathit{i}1}}^{{\mathit{w}}_{\mathit{i}}}\mathit{f}\left(\mathit{x},{\mathit{h}}_{\mathit{j}}\right),1\le \mathit{i}\le \mathit{m},1\le \mathit{j}\le \mathit{n}}\end{array}
(2)
\begin{array}{ll}\hfill {\mathit{T}}_{3}& =\left[\begin{array}{l}\dots ,\dots ,\dots \\ \dots ,{\mathit{t}}_{\mathit{i},\mathit{j}},\dots \\ \dots ,\dots ,\dots \end{array}\right],{\mathit{t}}_{\mathit{i},\mathit{j}}\\ ={\displaystyle \sum _{\mathit{y}={\mathit{h}}_{\mathit{j}1}}^{{\mathit{h}}_{\mathit{j}}}\mathit{f}\left({\mathit{w}}_{\mathit{i}},\mathit{y}\right)},1\le \mathit{i}\le \mathit{m},1\le \mathit{j}\le \mathit{n}\end{array}
(3)
\begin{array}{l}{\mathit{T}}_{4}=\left[\begin{array}{l}\dots ,\dots ,\dots \\ \dots ,{\mathit{t}}_{\mathit{i},\mathit{j}},\dots \\ \dots ,\dots ,\dots \end{array}\right],{\mathit{t}}_{\mathit{i},\mathit{j}}=\left\{\begin{array}{l}0,\phantom{\rule{2em}{0ex}}\mathrm{for}\phantom{\rule{0.25em}{0ex}}1<\mathit{i}<\mathit{m}\phantom{\rule{0.25em}{0ex}}\mathrm{and}\phantom{\rule{0.25em}{0ex}}1<\mathit{j}<\mathit{n}\\ \mathit{x},\phantom{\rule{2em}{0ex}}\mathrm{if}\phantom{\rule{0.25em}{0ex}}{\displaystyle \sum _{\mathit{k}=0}^{\mathit{x}}\mathit{f}\left(\mathit{k},{\mathit{h}}_{\mathit{j}}\right)}=0\phantom{\rule{0.25em}{0ex}}\mathrm{and}\phantom{\rule{0.25em}{0ex}}\mathit{f}\left(\mathit{x}+1,{\mathit{h}}_{\mathit{j}}\right)=1,\mathrm{for}\phantom{\rule{0.25em}{0ex}}\mathit{j}=1\\ \mathit{W}\mathit{x},\phantom{\rule{0.5em}{0ex}}\mathrm{if}\phantom{\rule{0.25em}{0ex}}{\displaystyle \sum _{\mathit{k}=\mathit{x}}^{{\mathit{w}}_{\mathit{m}}}\mathit{f}\left(\mathit{k},{\mathit{h}}_{\mathit{j}}\right)}=0\phantom{\rule{0.25em}{0ex}}\mathrm{and}\phantom{\rule{0.25em}{0ex}}\mathit{f}\left(\mathit{x}1,{\mathit{h}}_{\mathit{j}}\right)=1,\mathrm{for}\phantom{\rule{0.25em}{0ex}}\mathit{j}=\mathit{m}\\ \mathit{x},\phantom{\rule{2.25em}{0ex}}\mathrm{if}\phantom{\rule{0.25em}{0ex}}{\displaystyle \sum _{\mathit{k}=0}^{\mathit{x}}\mathit{f}\left({\mathit{w}}_{\mathit{i}},\mathit{k}\right)}=0\phantom{\rule{0.25em}{0ex}}\mathrm{and}\phantom{\rule{0.25em}{0ex}}\mathit{f}\left({\mathit{w}}_{\mathit{i}},\mathit{x}+1\right)=1,\mathrm{for}\phantom{\rule{0.25em}{0ex}}\mathit{i}=1\\ \mathit{H}\mathit{x},\phantom{\rule{0.5em}{0ex}}\mathrm{if}\phantom{\rule{0.25em}{0ex}}{\displaystyle \sum _{\mathit{k}=\mathit{x}}^{{\mathit{h}}_{\mathit{n}}}\mathit{f}\left({\mathit{w}}_{\mathit{i}},\mathit{k}\right)}=0\phantom{\rule{0.25em}{0ex}}\mathrm{and}\phantom{\rule{0.5em}{0ex}}\mathit{f}\left({\mathit{w}}_{\mathit{i}},\mathit{x}1\right)=1,\mathrm{for}\phantom{\rule{0.25em}{0ex}}\mathit{i}=\mathit{n}\end{array}\right.\end{array}
(4)
where f(x, y) is the value of a pixel at coordinates (x, y) in an image. If f(x, y) = 1, then the pixel is a foreground pixel; otherwise, 0 indicates a background pixel. The T_{1} feature is the number of pixels in each cell of the grid as a cellular feature, T_{2} feature appears to be the number of pixels falling on the upper edge of each grid cell, T_{3} is the number of pixels falling on the right edge of each grid cell, and T_{4} is the number to be some measure of how many background pixels there are from the edge to the first foreground pixel, if f(1,1) = 1, then t_{1,1} = 0; if f(n,1) = 1, then t_{n,1} = 0; if f(1,m) = 1, then t_{1,m} = 0; if f(n,m) = 1, then t_{n,m} = 0; and if there is no foreground pixel in the image, then t_{i,j} = 0, for 1 ≤ i ≤ m and 1 ≤ j ≤ n.
3.4 Musical symbol recognition
Computer recognition of handwritten characters has been intensely researched for many years. Optical character recognition (OCR) is an active field, particularly for handwritten documents in such languages as Roman [45], Arabic [46], Chinese [47], and Indian [48].
Several Chinese character recognition methods have been proposed, with the best known being the transformation invariant matching algorithm [49], adaptive confidence transform based classifier combination [50], probabilistic neural networks [51], radical decomposition [52], statistical character structure modeling [53], Markov random fields [54], and affine sparse matrix factorization [55].
The basic symbols for musical pitch (see Figure 2) in a Kunqu Opera GCN score are Chinese characters, but other musical pitch symbols and all rhythm symbols are specialized symbols for GCN. Thus, the methods of Kunqu Opera GCN score recognition refer to the techniques of both OMR and OCR. In this paper, the following three approaches to musical symbol recognition are compared.
3.4.1 Knearest neighbor
The Knearest neighbor (KNN) classifier is one of the simplest machine learning algorithms. It classifies foregrounds based on the closest training examples in the feature space and is a type of instancebased, or lazy, learning in which the function is only approximated locally, and all computation is deferred until classification [56]. The neighbors are selected from a set of foregrounds for which the correct classification is known, a set that can be considered the training set for the algorithm, though no explicit training step is required.
In the experiment described in the manuscript, we choose half of the test musical symbols as training samples; then, for the convenience of computing, we set the value of K to 1. The feature data of each class are obtained from the samples by calculating the average feature data of all samples. Although the distance function can also be learned during the training, the corresponding distance functions for the four features above are determined by the following equations:
\mathit{f}{\left(\mathit{S},\mathit{T}\right)}_{1}=\frac{\mathit{m}\times \mathit{n}{\displaystyle \sum _{\mathit{i}=1}^{\mathit{m}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{n}}\mathit{\rho}},\mathit{\rho}=1\phantom{\rule{0.25em}{0ex}}\mathrm{if}\phantom{\rule{0.25em}{0ex}}{\mathit{s}}_{\mathit{i},\mathit{j}}={\mathit{t}}_{\mathit{i},\mathit{j}},\mathrm{other}\phantom{\rule{0.25em}{0ex}}\mathit{\rho}=0}}{\mathit{m}\times \mathit{n}}
(5)
\mathit{f}{\left(\mathit{S},\mathit{T}\right)}_{2}=\frac{\sum _{\mathit{i}=1}^{\mathit{m}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{n}}\left{\mathit{s}}_{\mathit{i},\mathit{j}}{\mathit{t}}_{\mathit{i},\mathit{j}}\right}}{\mathit{m}\times \mathit{n}}
(6)
\mathit{f}{\left(\mathit{S},\mathit{T}\right)}_{3}=\frac{\sum _{\mathit{i}=1}^{\mathit{m}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{n}}\left{\mathit{s}}_{\mathit{i},\mathit{j}}{\mathit{t}}_{\mathit{i},\mathit{j}}\right}}{\mathit{m}\times \mathit{n}}
(7)
\mathit{f}{\left(\mathit{S},\mathit{T}\right)}_{4}=\frac{\mathit{m}\times \mathit{n}{\displaystyle \sum _{\mathit{i}=1}^{\mathit{m}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{n}}\mathit{\rho},\mathrm{if}\mathit{\alpha}<\frac{{\mathit{t}}_{\mathit{i},\mathit{j}}}{{\mathit{s}}_{\mathit{i},\mathit{j}}}<\mathit{\beta},\mathit{\rho}=1,\mathrm{other}\phantom{\rule{0.25em}{0ex}}\mathit{\rho}=0}}}{\mathit{m}\times \mathit{n}}
(8)
where f(S, T)_{1 ‒ 4} is the distance function of the corresponding feature 14, s_{
i,j
} is the feature matrix of the prototype class (S), and t_{
i,j
} is the feature matrix of the test sample (T). In Equation 5, if s_{i,j} = t_{i,j}, 1 ≤ i ≤ m, 1 ≤ j ≤ n, then ρ = 1; otherwise, ρ = 0. f(S, T) counts the number of unequal elements in the feature matrixes (S and T). In Equations 6 and 7, f(S, T) calculates the sum of the difference between all corresponding elements in the feature matrixes (S and T). Equation 8 counts the number of nonapproximate elements in S and T, using the parameters α and β as experience values. In this work, α = 0.9 and β = 1.1.
3.4.2 Bayesian decision theory
Bayesian decision theory is used in many statisticsbased methods. Classifiers based on Bayesian decision theory are simple, probabilistic classifiers that apply Bayesian decision theory with conditional independence assumptions, providing a simple approach to discriminative classification learning [57].
In this work, the conditional probabilities P(TS_{
k
})_{
r
}, 1 ≤ r ≤ 4 with each of the four features for the Bayesian decision theory classifier are calculated as follows:
\mathit{P}{\left(\mathit{T}{\mathit{S}}_{\mathit{k}}\right)}_{1}=\frac{\sum _{\mathit{i}=1}^{\mathit{m}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{n}}\mathit{\rho}},\mathit{\rho}=1\phantom{\rule{0.25em}{0ex}}\mathrm{if}\phantom{\rule{0.25em}{0ex}}{\mathit{s}}_{\mathit{i},\mathit{j}}^{\mathit{k}}={\mathit{t}}_{\mathit{i},\mathit{j}},\mathrm{other}\phantom{\rule{0.25em}{0ex}}\mathit{\rho}=0}{\mathit{m}\times \mathit{n}}
(9)
\mathit{P}{\left(\mathit{T}{\mathit{S}}_{\mathit{k}}\right)}_{2}=\frac{\sum _{\mathit{i}=1}^{\mathit{m}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{n}}\left{\mathit{s}}_{\mathit{i},\mathit{j}}^{\mathit{k}}{\mathit{t}}_{\mathit{i},\mathit{j}}\right}}{\mathit{m}\times \mathit{n}}
(10)
\mathit{P}{\left(\mathit{T}{\mathit{S}}_{\mathit{k}}\right)}_{3}\frac{\sum _{\mathit{i}=1}^{\mathit{m}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{n}}\left{\mathit{s}}_{\mathit{i},\mathit{j}}^{\mathit{k}}{\mathit{t}}_{\mathit{i},\mathit{j}}\right}}{\mathit{m}\times \mathit{n}}
(11)
\mathit{P}{\left(\mathit{T}{\mathit{S}}_{\mathit{k}}\right)}_{4}=\frac{\mathit{m}\times \mathit{n}{\displaystyle \sum _{\mathit{i}=1}^{\mathit{m}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{n}}\mathit{\rho},\mathit{if\alpha}<\frac{{\mathit{t}}_{\mathit{i},\mathit{j}}}{{\mathit{s}}_{\mathit{i},\mathit{j}}^{\mathit{k}}}<\mathit{\beta},\mathit{\rho}=1,\mathrm{other}\phantom{\rule{0.25em}{0ex}}\mathit{\rho}=0}}}{\mathit{m}\times \mathit{n}}
(12)
where P(TS_{
k
})_{1 ‒ 4} is the conditional probability of the corresponding feature 14, S = {S_{1}, …, S_{
k
}, …, S_{
c
}} is the set of prototype classes, {\mathit{s}}_{\mathit{i},\mathit{j}}^{\mathit{k}} is the feature matrix of the k th class S_{
k
} in the set of prototype classes, and t_{
i,j
} is the feature matrix of the test sample (T). In Equation 9, if {\mathit{s}}_{\mathit{i},\mathit{j}}^{\mathit{k}} = t_{
i,j
}, then ρ = 1; otherwise, ρ = 0. In Equation 12, α and β are again experience values, and in this work, we set α = 0.9 and β = 1.1.
In the experiment, the prior probability P(S_{
k
}), 1 ≤ k ≤ c of different symbols S_{
k
} is not equal, and all prior probabilities come from the prior statistics of the training dataset. For example, the beat symbol ‘’ has the prior probability 0.354511. If P(S_{
k
}T) = max P(S_{
j
}T), T is classified to S_{
k
}. The Bayesian rule was used. \mathit{P}\left({\mathit{S}}_{\mathit{k}}\mathit{T}\right)=\mathit{P}{\left(\mathit{T}{\mathit{S}}_{\mathit{k}}\right)}_{\mathit{r}}\mathit{P}\left({\mathit{S}}_{\mathit{k}}\right)/{\displaystyle \sum _{\mathit{i}=1}^{\mathit{c}}\mathit{P}\left(\mathit{T}{\mathit{S}}_{\mathit{i}}\right)\mathit{P}\left({\mathit{S}}_{\mathit{i}}\right)},1\le \mathit{r}\le 4 and has four expressions to estimate P(S_{
k
}T) which correspond to each of the four features.
3.4.3 Genetic algorithm
In the field of artificial intelligence, a genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. GAs generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover. In GAs, the search space parameters are encoded as strings, and a collection of strings constitutes a population. The processes of selection, crossover, and mutation continue for a fixed number of generations or until some condition is satisfied. GAs have been applied in such fields as image processing, neural networks, and machine learning [58].
In this work, the key GA parameter values are as follows:

Selectionreproduction rate: p_{s} = 0.5

Crossover rate: p_{c} = 0.6

Mutation rate: p_{m} = 0.05

Population class: C = 12

Number of individuals in the population: N_{p} = 200

Maximum iteration number: G = 300
An individual's fitness value is determined from the following fitness function:
\begin{array}{ll}\hfill \mathit{F}\left({\mathit{I}}_{\mathit{u}}\right)& ={\displaystyle \sum _{\mathit{v}=1}^{\mathit{R}}\mathit{f}\left(\mathit{e}\left({\mathit{h}}_{\mathit{u},\mathit{v}},{\mathit{s}}_{\mathit{u},\mathit{v}}\right)\right)={\displaystyle \sum _{\mathit{v}=1}^{\mathit{R}}\mathit{f}\left(\mathit{\rho}\right)}}\\ ={\displaystyle \sum _{\mathit{v}=1}^{\mathit{R}}\frac{\mathit{\rho}}{\mathit{m}\times \mathit{n}}},\mathit{if}\phantom{\rule{0.25em}{0ex}}{\mathit{h}}_{\mathit{i},\mathit{j}}={\mathit{s}}_{\mathit{i},\mathit{j}},\mathrm{then}\phantom{\rule{0.25em}{0ex}}\mathit{\rho}\\ =1,\mathrm{else}\phantom{\rule{0.25em}{0ex}}\mathit{\rho}=0\end{array}
(13)
where I_{
u
} is the k th individual, F(I_{
u
}) is the fitness value of I_{
u
}, R is the number of a gene bits in I_{
u
}, h_{
i,j
} is the j th gene of I_{
u
}, s_{
i,j
} is the feature matrix of the set of prototype classes corresponding to h_{
i,j
}, the function e() computes the value of the comparative result between h_{
i,j
} and s_{
i,j
}, and m and n are the width and length, respectively, of a symbol's grid.
3.5 Semantic analysis and MIDI representation
After all the stages of the OMR system are complete (see Figure 6), the recognized symbols can be employed to write the score in different data formats, such as MIDI, Nyquist, MusicXML, WEDELMUSIC, MPEGSMR, notation interchange file format (NIFF), and standard music description language (SMDL). Although different representation formats are now available, no standard exists for computer symbolic music representation. However, MIDI is the most popular digital music format in modern China, analogous to the status of the GCN format in ancient China.
This work selected MIDI for music representation, because the musical melody in a Kunqu Opera GCN score provides monophonic information. Thus, the symbols recognized from the GCN score can be represented by an array melody[L][2], with L representing the number of notes in the GCN score, the first dimension in the array representing the pitch of all notes, and the second dimension representing the duration of all notes.
Finally, the note information in melody[L][2] can be transformed into a MIDI file using an associated coding language, such as Visual C++, and the MIDI file of the Kunqu Opera GCN score can then be disseminated globally via the Internet.