Spatial Clustering in C++ (post 2 of 5) – Running DBSCAN

In the last post in this series I explained how to get and read in some spatial data points for California. In this post we’ll run the DBSCAN algorithm to perform spatial clustering.

DBSCAN is a well-known algorithm that identifies clusters based on density. A major advantage of DBSCAN is that it can identify arbitrary shape objects (ie. it does not presuppose the size, shape, or number of clusters in the data, as some other methods do), and it removes noise during the clustering process. It’s also quite a simple algorithm to implement.

The two parameters controlling DBSCAN are eps and minPts. They work together to define how the clusters should “grow”: if more than minPts points are within a radius of eps of a point, then those neighbouring points are also included in the cluster; each point in the expanded cluster is then subjected to the same test.

Our kdtree data structure provides just what we need for the “neighbours” test with its kd_nearest_range function. However as a side note we should observe that our eps (radius) parameter here is expressed in latitude/longitude distance units – in a real application we would likely want to use a more natural unit such as metres or kilometres. (One can convert, but the conversion is dependent on latitude). One consequence of using lat/long distance, apart from the its non-intuitiveness, is that the radius will be slightly different depending on the latitude of the point.

So, we’ll now show the code to get the points in a given radius – the GetNodesInRadius function:

bool GetNodesInRadius(Node* pt, double dblRadius, int nMinPts, vector<Node*>& rgpNodesFound)

{

double pos[2] = {pt->GetLongitude(), pt->GetLatitude()};

kdres* res = kd_nearest_range(kdTree, pos, dblRadius);

int number_near = kd_res_size(res);

if (number_near >= nMinPts)

{

while (!kd_res_end(res))

{

Node* ptNear = (Node*)kd_res_item(res, pos);

rgpNodesFound.push_back(ptNear);

kd_res_next(res);

}

}

kd_res_free(res);

return number_near >= nMinPts;

}

And this is called by our DBSCAN code, which is as follows.

void ExpandCluster(vector<Node*>& rgp, int nCluster, double dblEpsilon, int nMinPts)

{

vector<Node*> rgpNeighbourhood;

for (int i = 0; i < (int)rgp.size(); i++)

if (!rgp[i]->m_fVisited)

{

rgp[i]->m_nCluster = nCluster;

rgp[i]->m_fVisited = true;

rgpNeighbourhood.push_back(rgp[i]);

}

for (int i = 0; i < (int)rgpNeighbourhood.size(); i++)

{

Node* pNodeNear = rgpNeighbourhood[i];

vector<Node*> rgpNeighbourhood2;

if (GetNodesInRadius(pNodeNear, dblEpsilon, nMinPts, rgpNeighbourhood2))

{

// append to rgpNeighbourhood items in rgpNeighbourhood2 that aren’t already in rgpNeighbourhood

for (int j = 0; j < (int)rgpNeighbourhood2.size(); j++)

{

Node* pNode = rgpNeighbourhood2[j];

if (!pNode->m_fVisited)

pNode->m_fVisited = true;

if (pNode->m_nCluster < 0)

{

pNode->m_nCluster = nCluster;

rgpNeighbourhood.push_back(pNode);

}

}

}

}

}

int RunDBScan()

{

double dblEpsilon = 0.0045;

int nCluster = 0, nMinPts = 8;

for (vector<Node*>::const_iterator it = vecNodes.begin(); it != vecNodes.end(); it++)

{

Node* pNode = *it;

if (!pNode->m_fVisited)

{

pNode->m_fVisited = true;

vector<Node*> rgpNeighbourhood;

if (GetNodesInRadius(pNode, dblEpsilon, nMinPts, rgpNeighbourhood))

{

pNode->m_nCluster = nCluster;

pNode->m_fVisited = true;

ExpandCluster(rgpNeighbourhood, nCluster, dblEpsilon, nMinPts);

nCluster++;

}

}

}

return nCluster;

}

Here I’ve implemented a very simple performance improvement over the DBSCAN pseudo-code as it appears in Wikipedia. I don’t add the point P’ to the current neighbourhood N unless it has no cluster already; this guarantees that subsequent iterations over N will only process unvisited points. The results are exactly the same but it runs 2-3 times faster (at least with this STL vector-based implementation).

Even so, this code uses a fairly naive implementation with STL vectors. We’ll see in the next blog post in this series how to refine this implementation and in the process make it run very fast.

With this example data set (California census block data) there tend to be many clusters, including many small ones (though, to be sure, there is at least one truly massive cluster – Los Angeles!). One could obviously cull out the smallest clusters after DBSCAN is run if that is appropriate for your application.

I’m writing out the cluster points to Mid/Mif files and importing them into Tab using MapInfoPro. The code to write the points is:

bool OpenMidMifAndWritePreamble(string strFileName, ofstream& osFileMid, ofstream& osFileMif)

{

osFileMid.open((strFileName + ".mid").c_str(), ios::trunc);

if (!osFileMid.is_open())

return false;

osFileMif.open((strFileName + ".mif").c_str(), ios::trunc);

if (!osFileMif.is_open())

return false;

osFileMid << setprecision(20);

osFileMif << "Version 300" << endl;

osFileMif << "Charset \"Neutral\"" << endl;

osFileMif << "Delimiter \",\"" << endl;

osFileMif << "CoordSys Earth Projection 1, 0 Bounds (-1000, -1000) (1000, 1000)" << endl;

osFileMif << "Columns 1" << endl;

osFileMif << "  Cluster Integer" << endl;

osFileMif << "Data" << endl << endl;

return true;

}

bool WritePoints(string strFileName)

{

ofstream osFileMid, osFileMif;

if (!OpenMidMifAndWritePreamble(strFileName, osFileMid, osFileMif))

return false;

for (vector<Node*>::const_iterator it = vecNodes.begin(); it != vecNodes.end(); it++)

{

osFileMid << (*it)->m_nCluster << endl;

osFileMif << “Point ” << (*it)->GetLongitude() << ” ” << (*it)->GetLatitude() << endl;

int c = (*it)->m_nCluster;

int rgb = (40 + ((c * 13) % 215)) * 255 * 255 + (40 + ((c * 4573) % 215)) * 255 + (40 + (c % 215));

int size = ((*it)->m_nCluster >= 0 ? 6 : 2);

osFileMif << ”  Symbol (34,” << ((*it)->m_nCluster >= 0xFFFFFF ? rgb : size) << “,” << 8 << “)” << endl;

}

osFileMif.close();

osFileMid.close();

return true;

}

Here I’ve used a cute little one-liner to generate unique colours for each cluster. Non-clustered points (“noise” in DBSCAN terms) are coloured white (with black outline) and drawn smaller than the clustered points.

The results with eps set to 0.0045 and minPts = 8 give 3706 clusters, and 488,761 out of the total of 710,146 points end up in a cluster. Zooming in on the San Francisco area looks like:

Changing eps to 0.009 gives larger clusters, as you would expect. This time there are 3473 clusters, and 592,652 points are clustered. San Francisco looks like:

In the next post I’ll show how to significantly speed up DBSCAN, and in further posts demonstrate how to get our output to look like this:

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published.