Spatial Clustering in C++ (Part 1 of 5)

Welcome to the first of a five-part blog post where I work through an end-to-end example of spatial data clustering using C++. I’ll be using a number of tools and techniques, including:

– processing spatial point data using MapInfoPro

– storing points in a kdtree data structure (which will enable fast lookup of nearby points)

– performing spatial clustering using an implementation of the DBSCAN algorithm

– creating Alpha Shapes (think “shrink-wrapped” polygons) to visualise the output

– using OGR for various miscellaneous geometry tasks

Aside from the commercial product MapInfoPro (which is optional for this), all of the tools and codes I use are free/open source with permissive licences. We’ll also see (particularly by blog post #3 in this series) that the combination of these tools means our spatial clustering can be blazingly fast to compute.

The simple version of spatial data clustering I’ll be doing here is to develop code that will automatically figure out groups (clusters) of densely connected points on a map. A typical application for this might be to automatically find towns or cities from a population map.

The eventual output we’ll get looks like the following (and I hope this whets your appetite to follow the entire 5 parts of this blog post):

The above picture shows each cluster found in the data, and coloured uniquely.

The above picture shows the output of running Alpha Shapes to obtain polygons which represent the clusters.

 

For this exercise I’ll be using a subset of the US Tiger spatial data. In this case I’m choosing California block data. (I did in fact find it very hard to find a freely available file of spatial data points – e.g. addresses for an entire country – to use as my sample data). To get this TIGER data, go to the TIGER download site and Select Block data, California, All Counties.

There are at least two easy ways to get points from this data. The first, and probably quickest (if you have MapInfoPro installed), is to open the block file in MapInfoPro, and go Query / SQL Select, and select CentroidX(obj), CentroidY(obj) from the table, then save this selection as a separate file and export it to MID/MIF format. The MID file will then contain all the centroid points. An alternative shown below is to use OGR-computed centroids. There are 710,146 points in all.

Now let’s write some C++ to read in these points, and store them in a 2-dimensional kdtree, as well as in an STL vector of Node objects. Our kdtree code here is provided by the excellent implementation hosted at Google Code (and consists of just two files, kdtree.h and kdtree.cpp). This code is compiled with Visual Studio but should be relatively cross-platform.

First we define a little helper class to store our points (which we’ll call Nodes). The Visited and Cluster members will be used later for DBSCAN.

#include

class Node : public std::pair<double, double>

{

public:

Node(double x, double y) : std::pair<double, double>(y,x)

{

m_fVisited = false;

m_nCluster = -1;

}

double GetLongitude() const {return second;}

double GetLatitude() const {return first;}

bool m_fVisited;

int m_nCluster;

};

Next is the main code that reads in the points and stores them in both a kdtree (so that we can get nearest nodes later) and a vector. I’ve included all the #includes that we’re going to need in the code of subsequent blog posts.

#include "kdtree.h"

#include

#include

#include

#include

#include

#include

#include "node.h"

using namespace std;

kdtree* kdTree;

vector<Node*> vecNodes;

bool ReadInPointsFromFile(string strFile)

{

ifstream in;

in.open(strFile.c_str());

if (!in.is_open())

return false;

kdTree = kd_create(2);

while (!in.eof())

{

double x, y;

char comma;

in >> x >> comma >> y;

Node* pt = new Node(x, y);

vecNodes.push_back(pt);

double xy[2] = {x, y};

kd_insert(kdTree, xy, pt);

}

return true;

}

int main(int argc, char* argv[])

{

ReadInPointsFromFile(“C:/Temp/CentroidPoints.MID”);

}

The alternative approach (mentioned above) for getting points from the block data – without needing MapInfoPro – is to read in the shape file data directly and use OGR library functions to calculate the polygon centroids. This code is shown below. You’ll need to have OGR installed for this – the easiest way is to install the FWTools binaries. You’ll then need to make sure your project includes the FWTools Include file, and links with the appropriate lib.

This approach is a little slower to run than using MapInfoPro. (If you just wanted a fast approximation of the block centroids, however, you could probably just take the centroid of the bounding box of the polygons as they are read in.)

bool ReadInPolygonsFromFile(string strFile)

{

kdTree = kd_create(2);

OGRRegisterAll();

OGRDataSource* poDS = OGRSFDriverRegistrar::Open(strFile.c_str(), false);

if (poDS == NULL || poDS->GetLayerCount() <= 0)

return false;

OGRLayer* poLayer = poDS->GetLayer(0);

OGRFeature* poFeature;

while ((poFeature = poLayer->GetNextFeature()) != NULL)

{

OGRGeometry* poGeometry = poFeature->GetGeometryRef();

if (poGeometry != NULL)

{

OGRwkbGeometryType nType = wkbFlatten(poGeometry->getGeometryType());

OGRFeatureDefn* poFDefn = poLayer->GetLayerDefn();

long lType = wkbFlatten(nType);

if (lType == wkbPolygon)

{

if (static_cast<OGRPolygon*>(poGeometry)->get_Area() != 0)

{

OGRPoint pt;

static_cast<OGRPolygon*>(poGeometry)->Centroid(&pt);

Node* node = new Node(pt.getX(), pt.getY());

vecNodes.push_back(node);

double xy[2] = {pt.getX(), pt.getY()};

kd_insert(kdTree, xy, node);

}

}

else if (lType == wkbMultiPolygon)

{

OGRMultiPolygon* pmpg = static_cast<OGRMultiPolygon*>(poGeometry->clone());

for (int nPoly = 0; nPoly < pmpg->getNumGeometries(); nPoly++)

{

OGRGeometry* poSubGeometry = pmpg->getGeometryRef(nPoly);

if (poSubGeometry != NULL && static_cast<OGRPolygon*>(poSubGeometry) != NULL)

{

if (static_cast<OGRPolygon*>(poSubGeometry)->get_Area() != 0)

{

OGRPoint pt;

static_cast<OGRPolygon*>(poSubGeometry)->Centroid(&pt);

Node* node = new Node(pt.getX(), pt.getY());

vecNodes.push_back(node);

double xy[2] = {pt.getX(), pt.getY()};

kd_insert(kdTree, xy, node);

}

}

}

}

else  //  maybe lType == wkbPoint or wkbLineString ?

return false;

}

OGRFeature::DestroyFeature(poFeature);

}

return false;

}

int main(int argc, char* argv[])

{

ReadInPolygonsFromFile("C:/Temp/tl_2010_06_tabblock10/tl_2010_06_tabblock10.shp");

}

Stay tuned for the next post, where we’ll run the DBSCAN algorithm and start to see some nice outputs!

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published.