Clustering of time-series subsequences is meaningless: implications for previous and future research *Knowledge and Information Systems*, Vol. 8, No. 2. (31 August 2004), pp. 154-177, doi:10.1007/s10115-004-0172-7 by Eamonn Keogh, Jessica Lin

@article{keogh-and-lin-2005, abstract = {Given the recent explosion of interest in streaming data and online algorithms, clustering of time-series subsequences, extracted via a sliding window, has received much attention. In this work, we make a surprising claim. Clustering of time-series subsequences is meaningless. More concretely, clusters extracted from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature. We can justify calling our claim surprising because it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method that, based on the concept of time-series motifs, is able to meaningfully cluster subsequences on some time-series datasets.}, author = {Keogh, Eamonn and Lin, Jessica}, day = {31}, doi = {10.1007/s10115-004-0172-7}, issn = {0219-1377}, journal = {Knowledge and Information Systems}, keywords = {algorithmic, clustering, learning, som, time-series}, month = aug, number = {2}, pages = {154--177}, posted-at = {2012-11-26 09:10:23}, priority = {2}, publisher = {Springer London}, title = {Clustering of time-series subsequences is meaningless: implications for previous and future research}, url = {http://dx.doi.org/10.1007/s10115-004-0172-7}, volume = {8}, year = {2004} }

See the CiteULike entry for more info, PDF links, BibTex etc.

Keogh and Lin claim that what they call time series subsequence clustering is meaningless. Specifically, this means that the clusters found by clustering all subsequences of a time series of a certain length will yield the same (kind of) clusters as clustering random sequences.⇒

Intuitively, clustering time series subsequences is meaningless for two reasons:

- The sum (or average) of all subsequences of a time series is always a straight line, and the cluster centers (in $k$-means and related algorithms) sum to the global mean. Thus, only if the
*interesting*features of a time series together average to a straight line can an algorithm find them. This, however, is not the case very often. - It should be expected from any meaningful clustering that subsequences which start at near-by time points end up in the same cluster (what Keogh and Lin call
*`trivial matches'*). However, the similarity between close-by subsequences depends highly on the rate of change around a subsequence and thus a clustering algorithm will find cluster centers close to subsequences with a low rate of change rather than with a high rate of change and therefore typically for the less interesting subsequences.⇒

Keogh and Lin propose focusing on finding motifs in time series instead of clusters.⇒