TY - GEN
T1 - Efficient subsequence matching using the longest common subsequence with a dual match index
AU - Han, Tae Sik
AU - Ko, Seung Kyu
AU - Kang, Jaewoo
PY - 2007
Y1 - 2007
N2 - The purpose of subsequence matching is to find a query sequence from a long data sequence. Due to the abundance of applications, many solutions have been proposed. Virtually all previous solutions use the Euclidean measure as the basis for measuring distance between sequences. Recent studies, however, suggest that the Euclidean distance often fails to produce proper results due to the irregularity in the data, which is not so uncommon in our problem domain. Addressing this problem, some non-Euclidean measures, such as Dynamic Time Warping (DTW) and Longest Common Subsequence (LCS), have been proposed. However, most of the previous work in this direction focused on the whole sequence matching problem where query and data sequences are the same length. In this paper, we propose a novel subsequence matching framework using a non-Euclidean measure, in particular, LCS, and a new index query scheme. The proposed framework is based on the Dual Match framework where data sequences are divided into a series of disjoint equi-length subsequences and then indexed in an R-tree. We introduced similarity bound for index matching with LCS. The proposed query matching scheme reduces significant numbers of false positives in the match result. Furthermore, we developed an algorithm to skip expensive LCS computations through observing the warping paths. We validated our framework through extensive experiments using 48 different time series datasets. The results of the experiments suggest that our approach significantly improves the subsequence matching performance in various metrics.
AB - The purpose of subsequence matching is to find a query sequence from a long data sequence. Due to the abundance of applications, many solutions have been proposed. Virtually all previous solutions use the Euclidean measure as the basis for measuring distance between sequences. Recent studies, however, suggest that the Euclidean distance often fails to produce proper results due to the irregularity in the data, which is not so uncommon in our problem domain. Addressing this problem, some non-Euclidean measures, such as Dynamic Time Warping (DTW) and Longest Common Subsequence (LCS), have been proposed. However, most of the previous work in this direction focused on the whole sequence matching problem where query and data sequences are the same length. In this paper, we propose a novel subsequence matching framework using a non-Euclidean measure, in particular, LCS, and a new index query scheme. The proposed framework is based on the Dual Match framework where data sequences are divided into a series of disjoint equi-length subsequences and then indexed in an R-tree. We introduced similarity bound for index matching with LCS. The proposed query matching scheme reduces significant numbers of false positives in the match result. Furthermore, we developed an algorithm to skip expensive LCS computations through observing the warping paths. We validated our framework through extensive experiments using 48 different time series datasets. The results of the experiments suggest that our approach significantly improves the subsequence matching performance in various metrics.
KW - Dual match
KW - Longest common subsequence
KW - Subsequence matching
UR - http://www.scopus.com/inward/record.url?scp=37249057768&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-73499-4_44
DO - 10.1007/978-3-540-73499-4_44
M3 - Conference contribution
AN - SCOPUS:37249057768
SN - 9783540734987
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 585
EP - 600
BT - Machine Learning and Data Mining in Pattern Recognition - 5th International Conference, MLDM 2007, Proceedings
PB - Springer Verlag
T2 - 5th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM 2007
Y2 - 18 July 2007 through 20 July 2007
ER -