Here is a Jupyter notebook I was using today to parse the classifications from the Steelpan Vibrations project. I'm leaving some of the notes here as a reminder to myself for the future. (I learned how to put the Jupyter notebook into the blog from this page.)
I really want to share this because in all my reading on using DBSCAN to do cluster analysis, I had a hard time finding any page online that was describing how the coordinates of the points identified in a cluster could be paired with matched data from the larger (original) data set. When I found the solution (see link in the comments between cells below) it was really obvious, but it was painful not knowing even how to google for what I was looking for.
Function to do the cluster identification with DBSCAN:
In [31]:
def dbscan(crds):
bad_xy = [] #might need to change this
X = np.array(crds)
db = DBSCAN(eps=18, min_samples=3).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
# These are the definitely "good" xy values.
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
#print("\n Good? xy = ",xy)
#print("X = ",X)
# These are the "bad" xy values. Note that some maybe-bad and maybe-good are included here.
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=6)
#print("\n Bad? xy = ",xy)
bad_xy.append(xy)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.xlim(0, 512)
plt.ylim(0, 384)
clusters = [X[labels == i] for i in range(n_clusters_)]
#print(clusters)
#print(db.labels_)
return clusters, labels
Import the classifications into a pandas DataFrame. I'm using header=None because there were no headings in the csv file:
In [32]:
import pandas as pd
df=pd.read_csv('averages-strike1.csv', sep=',',header=None)
This is the main part of the code that ends up calling the dbscan function at the end:
In [34]:
from matplotlib.patches import Ellipse
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as col
cmap_1 = cm.ScalarMappable(col.Normalize(1, 11, cm.gist_rainbow))
import numpy as np
from sklearn.cluster import DBSCAN
x_val = []
y_val = []
frng = []
crds = []
ell = []
for centers in df.values:
x_val.append(centers[0])
y_val.append(centers[1])
frng.append(centers[3])
crds.append([centers[0], centers[1]])
ell.append(Ellipse(xy=[centers[0], centers[1]], width=centers[4], height=centers[5], angle=centers[6]))
centers_raw = {'XVal': x_val,
'YVal': y_val,
'Fringe': frng}
centers_df = pd.DataFrame(centers_raw, columns=['XVal', 'YVal', 'Fringe'])
plt.figure(0)
plt.scatter(centers_df.XVal, centers_df.YVal, s=20, c=cmap_1.to_rgba(centers_df.Fringe), alpha=.6)
plt.xlim(0, 512)
plt.ylim(0, 384)
#plt.title('Subject id = %s'%(coords_x[0][2]))
plt.show()
#print(crds)
plt.figure(1)
clusters, labels = dbscan(crds)
Check the DataFrame once, and then check it again after renaming the columns:
In [30]:
df[:15]
Out[30]:
In [7]:
labels
Out[7]:
These next two lines are the magic that connect the clusters identified by DBSCAN with the original classifications so that we can plot the fringe measurements for each cluster over time.
Finally figured this out by reading the question posted here: https://datascience.stackexchange.com/questions/29587/python-clustering-and-labels
Finally figured this out by reading the question posted here: https://datascience.stackexchange.com/questions/29587/python-clustering-and-labels
In [8]:
cluster=pd.Series(labels)
df["cluster"] = cluster
Rename the DataFrame columns:
In [10]:
df = df.rename(index=str, columns={0: "x", 1: "y",2:"filename", 3:"fringe",4:"rx", 5:"ry",6:"angle"})
Assign each cluster its own variable:
In [27]:
cluster0 = df[df['cluster']==0]
cluster1 = df[df['cluster']==1]
cluster2 = df[df['cluster']==2]
cluster3 = df[df['cluster']==3]
cluster4 = df[df['cluster']==4]
cluster5 = df[df['cluster']==5]
cluster6 = df[df['cluster']==6]
cluster7 = df[df['cluster']==7]
Make plots!!!
In [29]:
plt.scatter(cluster0.index, cluster0.fringe)
plt.show()
In [36]:
plt.scatter(cluster1.index, cluster1.fringe)
plt.show()
In [37]:
plt.scatter(cluster2.index, cluster2.fringe)
plt.show()
In [38]:
plt.scatter(cluster3.index, cluster3.fringe)
plt.show()
In [39]:
plt.scatter(cluster4.index, cluster4.fringe)
plt.show()
In [43]:
plt.scatter(cluster5.index, cluster5.fringe)
plt.show()