如何使用 KNN/K-means 在数据框中对时间序列进行聚类

时间：2023-01-23

本文介绍了如何使用 KNN/K-means 在数据框中对时间序列进行聚类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

假设一个数据框包含 1000 行.每行代表一个时间序列.

然后我构建了一个 DTW 算法来计算 2 行之间的距离.

我不知道接下来要做什么来为数据帧完成无监督分类任务.

如何标注数据框的所有行?

解决方案

定义

<块引用>

KNN 算法 = K-最近邻分类算法

K-means = 基于质心的聚类算法

DTW = Dynamic Time Warping 一种用于时间序列的相似性测量算法

我将在下面逐步展示如何构建两个时间序列以及如何计算动态时间规整 (DTW) 算法.您可以使用

将 pandas 导入为 pd将 numpy 导入为 np随机导入从 dtw 导入 dtw从 matplotlib.pyplot 导入绘图从 matplotlib.pyplot 导入 imshow从 matplotlib.pyplot 导入厘米从 sklearn.cluster 导入 KMeans从 sklearn.preprocessing 导入 MultiLabelBinarizer#关于分类，看教程#http://scikit-learn.org/stable/tutorial/basic/tutorial.htmldef createTs(myStart, myLength):index = pd.date_range(myStart, period=myLength, freq='H');values= [random.random() for _ in range(myLength)];系列= pd.Series(值，索引=索引)；回归(系列)#长度为 30 的时间序列，从 1/1/2000 &1/2/2000 所以重叠我的开始='1/1/2000'我的长度=30timeS1=createTs(myStart, myLength)我的开始='1/2/2000'timeS2=createTs(myStart, myLength)#这可能是您的数据框，但在这里没有必要#myDF = pd.DataFrame([x for x in timeS1.data], [x for x in timeS2.data])#, columns=['data1', 'data2'])x=[xxx*100 for xxx in sorted(timeS1.data)]y=[xx for xx in timeS2.data]选择=dtw"；如果(选择=时间序列"):打印(时间S1)打印(时间S2)if (choice=="drawingPlots"):情节(x)情节(y)如果(选择==dtw"):#DTW 具有一阶范数myDiff=[xx-yy for xx,yy in zip(x,y)]dist, 成本, acc, 路径 = dtw(x, y, dist=lambda x, y: np.linalg.norm(myDiff, ord=1))imshow(acc.T, origin='lower', cmap=cm.gray, 插值='最近的')情节(路径[0]，路径[1]，'w')

使用 KNN 对时间序列进行分类

关于应该标记什么以及使用哪些标签的问题并不明显?所以请提供以下详细信息

我们应该在数据框中标记什么?DTW算法计算的路径?
哪种类型的标签?二进制?多类?

之后我们可以决定我们的分类算法，可能是所谓的KNN算法.它的工作原理是你有两个独立的数据集:训练集和测试集.通过训练集，您可以教算法标记时间序列，而测试集是一种工具，我们可以通过它来衡量模型与 AUC 等模型选择工具的配合情况.

小谜题在提供有关问题的详细信息之前一直打开

#PUZZLE#来自教程(#http://scikit-learn.org/stable/tutorial/basic/tutorial.html)newX = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]newY = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]newY = MultiLabelBinarizer().fit_transform(newY)#继续文章.

关于分类器的 Scikit-learn 比较文章在下面的第二个枚举项中提供.

使用 K-means 进行聚类(与 KNN 不同)

K-means 是聚类算法及其无监督版本，您可以这样使用

#无监督版本自动"的 KMeans 没有分配给 n_clustersmyClusters=KMeans(路径)#myClusters.fit(YourDataHere)

这是与 KNN 算法非常不同的算法:这里我们不需要任何标签.我在第一个枚举项中为您提供了有关以下主题的更多材料.

进一步阅读

K-means 是否包含K最近邻算法?
关于 scikit learn 中分类器的比较这里

Suppose a dataframe which contains 1000 rows. Each row represents a time series.

Then I built a DTW algorithm to calculate the distance between 2 rows.

I don't know what to do next to complish an unsupervised classification task for the dataframe.

How to label all rows of the dataframe?

解决方案

Definitions

KNN algorithm = K-nearest-neighbour classification algorithm

K-means = centroid-based clustering algorithm

DTW = Dynamic Time Warping a similarity-measurement algorithm for time-series

I show below step by step about how the two time-series can be built and how the Dynamic Time Warping (DTW) algorithm can be computed. You can build a unsupervised k-means clustering with scikit-learn without specifying the number of centroids, then the scikit-learn knows to use the algorithm called auto.

Building the time-series and computing the DTW

You have have two time-series and you compute the DTW such that

import pandas as pd
import numpy as np
import random
from dtw import dtw
from matplotlib.pyplot import plot
from matplotlib.pyplot import imshow
from matplotlib.pyplot import cm

from sklearn.cluster import KMeans
from sklearn.preprocessing import MultiLabelBinarizer 
#About classification, read the tutorial
#http://scikit-learn.org/stable/tutorial/basic/tutorial.html


def createTs(myStart, myLength):
    index = pd.date_range(myStart, periods=myLength, freq='H'); 
    values= [random.random() for _ in range(myLength)];
    series = pd.Series(values, index=index);  
    return(series)


#Time series of length 30, start from 1/1/2000 & 1/2/2000 so overlap
myStart='1/1/2000'
myLength=30
timeS1=createTs(myStart, myLength)
myStart='1/2/2000'
timeS2=createTs(myStart, myLength) 

#This could be your dataframe but unnecessary here
#myDF = pd.DataFrame([x for x in timeS1.data], [x for x in timeS2.data])#, columns=['data1', 'data2'])

x=[xxx*100 for xxx in sorted(timeS1.data)]
y=[xx for xx in timeS2.data]

choice="dtw"

if (choice="timeseries"):
    print(timeS1)
    print(timeS2)
if (choice=="drawingPlots"):
    plot(x)
    plot(y)
if (choice=="dtw"):
    #DTW with the 1st order norm
    myDiff=[xx-yy for xx,yy in zip(x,y)]
    dist, cost, acc, path = dtw(x, y, dist=lambda x, y: np.linalg.norm(myDiff, ord=1))
    imshow(acc.T, origin='lower', cmap=cm.gray, interpolation='nearest')
    plot(path[0], path[1], 'w')

Classification of the time-series with KNN

It is not evident in the question about what should be labelled and with which labels? So please provide the following details

What should we label in the data-frame? The path computed by DTW algorithm?
Which type of labeling? Binary? Multiclass?

after which we can decide our classification algorithm that may be the so-called KNN algorithm. It works such that you have two separate data sets: training set and test set. By training set, you teach the algorithm to label the time series while the test set is a tool by which we can measure about how well the model works with model selection tools such as AUC.

Small puzzle left open until details provided about the questions

#PUZZLE
#from tutorial (#http://scikit-learn.org/stable/tutorial/basic/tutorial.html)
newX = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
newY = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
newY = MultiLabelBinarizer().fit_transform(newY)
#Continue to the article.

Scikit-learn comparison article about classifiers is provided in the second enumerate item below.

Clustering with K-means (not the same as KNN)

K-means is the clustering algorithm and its unsupervised version you can use such that

#Unsupervised version "auto" of the KMeans as no assignment for the n_clusters
myClusters=KMeans(path)
#myClusters.fit(YourDataHere)

which is very different algorithm than the KNN algorithm: here we do not need any labels. I provide you further material on the topic below in the first enumerate item.