2.1 Psim matrix
Traditional similarity measurement methods (e.g., the Euclidean distance, Jaccard coefficient [12], and Pearson coefficient [12]) fail in high-dimensional space because in these methods, equidistance is a common phenomenon in high-dimensional space; hence, the calculated distance is not the real distance. To solve this problem, the Hsim function [13] was proposed; however, the relative difference and noise distribution were not considered. The Gsim function [14] was also proposed and the relative differences of the properties in different dimensions were analyzed, but the weight discrepancy was ignored. The proposed Close function [15] can reduce the influence of components in certain dimensions, whose variances are larger; however, the relative difference was not considered and it would be affected by noise. The Esim [16] function was proposed by improving the Hsim and Close functions and considering the influence of the property on the similarity. In every dimension, the Esim component has a positive correlation. All the dimensions are divided into normal and noisy. In a noisy dimension, noise is the main ingredient. When it is similar and larger than the signal, in a normal dimension, Esim is invalid. The secondary measurement method [17] is used to calculate the similarity by considering the property distribution, space distance, etc. However, the noise distribution and weight are not taken into account. In addition, its formula is complicated and the calculation is slow. In high-dimensional space, a large difference exists in certain dimensionalities [10], even though the data is similar. This difference occupies a large portion of the similarity calculation; hence, all the calculation results are similar. Therefore, the Psim function [10] was proposed to diminish the influence of noise on the similarity data; experimental results indicate that this method is suitable.
When using the Psim function to measure the similarity, the data component in every dimension must be sorted and the value range divided into several intervals. The similarity between X and Y in the j-th dimension is added to the Psim function, if and only if, their data components are in the same interval.
In an n-dimensional space, the Psim value between X and Y is as follow:
$$ \mathrm{Psim}\left(X,Y\right)=\sum \limits_{j\in {D}_s\left(X,Y\right)}\left(1-\frac{\left|{X}_j-{Y}_j\right|}{l_j-{u}_j}\right)\frac{\left|{D}_s\left(X,Y\right)\right|}{n} $$
(1)
where X
j
and Y
j
are the data components of X and Y in the j-th dimension. Ds(X, Y) is a subscript set of X
j
and Y
j
, which are in the same interval [u
j
, l
j
], and |Ds(X, Y)| is the number of elements in Ds(X, Y). The above is the outline of the Psim function; a detailed introduction can be found in [10].
Data organization is critical in a clustering algorithm. In the traditional method, the data space is separated using an index tree and mapped onto the index-tree nodes. The commonly used index trees are the R tree [18], cR tree [19], VP tree [20], M tree [21], SA tree [22], etc. The partitioning of the data space is the foundation for building an index tree, but its complexity increases with the increase in dimensionality. Thus, it is difficult to build index trees for high-dimensional data. In addition, the retrieval efficiency of the index tree falls sharply with the increase in dimensionality. The retrieval function works effectively, when the dimensionality is less than 16; however, it weakens rapidly, for dimensionalities greater than 16, even down to the level of a linear search [23]. A sequential Psim matrix is used to solve this problem. First, all the Psim values between points, S
1
, S
2
, ⋯, S
M
, are calculated to build a Psim matrix, PsimMat, with a size, M × M. PsimMat(i, t) is composed of three properties: i, t, and Psim(S
i
, S
t
). Next, the sequential Psim matrix, SortPsimMat, is generated by sorting the elements in every row of PsimMat in the descending order of the Psim value. The elements in the i-th row represent the similarities between S
i
and the other points. From left to right, the Psim values gradually diminish, indicating a decrease in the similarity. It can be seen that the sequential Psim matrix is not affected by the dimensionality and can represent the similarity distribution of all the points. Therefore, it is suitable for high-dimensional data clustering.
2.2 Differential truncation
The elements in every row of SortPsimMat are regarded as a sequence, A, whose length is M. The sequential Psim differential matrix, DeltaPsimMat, is generated with a differential operation on the sequence, A. The size of DeltaPsimMat is M × (M − 1). The elements in the i-th row of SortPsimMat represent the similarities between S
i
and the other points. Several points corresponding to the frontier elements in this row, from the left, would form a cluster centered at S
i
because the similarity between the elements inside the cluster is higher than that of those outside. Thus, the similarity differences between the elements inside the cluster are lesser than that of the others. Assuming that the cluster centered at S
i
has p
i
elements, the left p
i
− 1 elements in the i-th row of DeltaPsimMat are lesser than the differential threshold, ΔA
max. Thus, a reasonable ΔA
max is set up and all the elements that are less than ΔA
max in the i-th row of DeltaPsimMat are found to form a cluster centered at S
i
.
2.3 Tabu Search
After differential truncation, the intersection of some of the clusters may not be null. Thus, the overlapping elements should be eliminated by refinement. The clusters that are to be refined are called the imminent-refining cluster sets, and the initial values are the clusters after differential truncation. The clusters that have been refined are called the refined cluster sets and their initial values are null. The refinement of the cluster is an iterative process. Considering the average Psim of the remainder elements in the i-th row of SortPsimMat, after differential truncation, as the similarity of a cluster centered at S
i
, the operation in every iteration is as follows: First, the similarity of every cluster is calculated. Next, the cluster with the highest similarity is added into the refined cluster set and the element in the other cluster is deleted, if it is in the selected cluster. However, there is a problem. After deleting the overlapping elements, the similarity of the cluster in the imminent-refining cluster set may be greater than that of the selected cluster. To solve this problem, Tabu Search is used for refinement.
Tabu Search is an expansion of the neighborhood search, a global optimum algorithm [11], and is mainly used for combinatorial optimization. A roundabout search can be avoided using the Tabu rule and aspiration criterion, for improving the global search efficiency. This method can accept an inferior solution and has a strong “climbing” ability; it has a higher probability of obtaining a global optimal solution.
The main process of Tabu Search is as follows: Initially, a random initial solution is regarded as the current solution, and several neighboring solutions are considered as the candidate solutions. Further, if the objective function value of a certain candidate solution meets the aspiration criterion, the current solution is replaced by this candidate solution and added to the Tabu list. Else, the best choice of a non-Tabu object is considered as the new current solution. In addition, the corresponding solution must be added into the Tabu list [24]. The above steps are repeated, until the terminate criterion is satisfied.
In order to use Tabu Search for refining the cluster, an appropriate Tabu object, Tabu list, aspiration criterion, and terminate criterion are required. The Tabu objects are the elements in the refined cluster set and are saved into the Tabu list to prevent the Tabu Search from falling into the local optimum. The Tabu length is set as the number of clusters after differential truncation. In every iteration process, the selected cluster is considered as the Tabu object. After eliminating the overlapping elements, the cluster, whose similarity is higher than that of the previously selected cluster, is considered as the better cluster and it replaces the previously selected cluster. The previously selected cluster is removed from the Tabu list and added into the imminent-refining cluster set. The above “eliminating the overlapping elements—searching for a better cluster” process is repeated, until a better cluster can no longer be found. Then, the previously selected cluster is considered as the optimal cluster of this iteration. The search for the better cluster of the next iteration then commences, until the imminent-refining cluster set is null.