Posts

Cross Platform Development By David

I was involved in the development of a ‘simple’ application in C++ in windows and wanted to get it work in multiple versions of linux as well. By ‘simple’, I mean there is no windows GUI or links to other complicated third part libraries so a lot of the C++ should just port straight over to linux. Below are a few tips/lessons learned while I went about this task.

VirtualBox and Code Repository

I was working on a windows box and wanted to port to Ubuntu (a Debian flavour of linux) and Oracale Enterprise Linux (a Redhat flavour). The use of virtual technology is definely your friend here. As a result on my windows box I had virtual machines running Ubuntu and Oracle Linux Enterprise edition.

As we have a number of developers in our organisation working on a number of projects with libraries that are shared across several applications it is logical that we have a code repository. As I spend most of my time in a windows environment and am more familiar with it I prefer to do most of my editing in windows (the application was originally written in windows and most of our applications don’t need to work on other platforms).

I could have chosen to setup the code repository (in our case Mecurial) on each of the linux VMs. However, this would have been more time consuming and I don’t want to have to push code to the main shared repository and then pull onto the linux VMs everytime I want to test changes on other platforms, especially when the other platforms are all on the same computer. As a result, I set up shared folders in the VMs and made sure they pointed to the copy of the repo that I had in windows. Now I could easily make changes in windows and build and test them in all the environments.

Building and Testing

A couple of things to note is that while windows generally builds and runs quite happily without impacting what goes on in the linux world, the two linux worlds impact each other. This is because they both use the same makefile and clean/build binaries with the same names. I found that I had to be careful when building in different linux enviornments that the clean was done completely before the build started. The clean script when run in Ubuntu did not always clean/remove the binaries created in the Oracle Linux Enterprise environment. If the biaries didn’t get cleaned properly I got some build errors (e.g. /usr/bin/ld: cannot open output file executable_filename: Operation not permitted) or I got a segmentation fault at run time.

Different versions of g++ in Ubuntu (4.5.2) and Oracle Enterprise Linux (4.1.2) also meant that there were different compilation issus that needed to be dealt with but overall these were not too difficult to work through. Some of the issues resolved around simple compilation problems (e.g. g++ 4.1.2 was stricter with linking to libraries that started with the letters ‘lib’, 4.5.2 didn’t seem to mind as much).

Another main sets of issues were related to the use of third party libraries that we used (e.g. curl, coinosi). The coin library problems were overcome by simply ensuring the source code was downloaded, built and installed on the required linux platforms (i.e. no changes were required to the source code itself). Curl behaved a bit differently (i.e. it didn’t work for Oracle Enterprise Linux) but that was because a difference in the way different g++ versions treated the addition of strings. Once I made a small change to the code it worked fine.

The end result was that I managed to have the same source code build and run in 3 different environments: Windows, Ubuntu and Oracle Enterprise Linux.

Spatial Clustering in C++ (post 5 of 5) – The final outputs

In this fifth and final post of the series on Spatial Clustering in C++, we’ll look at some of the final outputs produced and wrap up the discussion of techniques and codes used.

In the last post I presented some code that can be used to generate an Alpha Shape from a set of points, create it as an OGR Polygon, and output these polygons to MID/MIF files. I’ve since found that one of the functions used in that code, GetNextRing, is quite inefficient when the polygon has a complex perimeter (many points on its boundary), which happens with very small values for alpha. Still it wouldn’t be too hard to speed that code up.

It’s interesting to generate some different Alpha Shapes with varying of the alpha parameter. Recall that alpha corresponds to how tightly the polygon is “shrink wrapped” around the points it encloses; it is actually the radius of circles that define the “bites” taken out around the perimeter of the shape. If alpha is too high, one gets a very approximate shape that starts to approach the convex hull of the points, while if alpha is too low, the shape starts to get very “spiky”, and eventually one starts to lose whole parts of the polygon completely. The good news is that the polygons produced are not too visually dissimilar if you stay within reasonable bounds for alpha – that is, they are not overly sensitive to its value.

Let’s have a look with the San Francisco bay area as our example. Here I’m using an eps in the DBSCAN of 0.009 with minPts = 8, which gives the following clusters:

And for the Alpha Shapes, an alpha of 2500 gives the following:

In the next picture you can see how the shapes represent the points (note also the hole in the large green cluster – the Alpha Shapes code we’re using doesn’t allow for actual holes within the shapes produced):

Here I’ve used Bing Aerial in MapInfoPro to give the shapes produced some spatial context, and I’ve used 33% transparency on the polygon layer (use Layer Properties). Incidentally you can use Style Overrides to give patterned output as well, such as:

With alpha 5000, the results are very similar, though some of the “holes” along the polygon boundaries have now been “filled in”:

The colours of the shapes are different here just because of the funky unique RGB generator function I’m using which depends on the creation index of the polygon; the clusters are the same, only alpha is different.

Even with alpha = 20000, the shapes are still good visual approximations of the points:

With 50000 the shapes are still OK but some big holes have now been filled in:

And with alpha 5000000 one starts to get nearer to the convex hull:

With smaller values of alpha, such as 1000, the shapes are similar to alpha 2500. Below that, however, the Alpha Shapes algorithm starts to produce multiple, very spiky (many points defining a complex boundary) polygons for each cluster, such as this example with alpha 500:

These anemic shapes no longer approximate the points very well.

With larger data sets just showing the clustered points gets slower, so you could potentially also make a geoset and use zoom levels, with polygons shown at higher zoom levels, then show the clustered points when zoomed in closer.

In terms of output, I’ve found Tab2tab is a useful little program to convert between Mid/Mif and Tab formats, though it is not without its problems: older versions have a loss of precision, and current versions appear to lose the colours of the polygons (regions in the parlance). One could potentially also write out the polygons or clusters using OGR functions directly in the output format desired. And if you don’t have MapInfoPro, QGIS is a good open source alternative.

If you’ve made it this far, then I hope you’ve enjoyed this five part blog post. I’ll leave you with a summary of the tools and algorithms that have been deployed in this exercise:
What it doesNotes/shortcomings  Display and edit spatial dataCommercial product, but very good Data structure for storing 2 dimensional pointsAllows fast lookup of nearby points. Good C++ implementation here. Library for computational geometryGood utility functions for reading and writing points and polygons Algorithm for spatial clusteringSimple to implement; previous blog posts in this series show how to make this run really fast. Shape containing a set of pointsNot a good approximation to spatial clusters, but easy and fast to compute e.g. with OGR functions Old style C code for producing Alpha Shapes (polygons representing sets of points)Slightly dodgy code, doesn’t produce polygons with holes inside, but runs OK if called as a black box executable. Map data conversion utilityOlder versions lose precision; new versions lose polygon style/colour information Open source alternative to MapInfoProSlower and not as feature rich as MapInfoPro

Tool/algorithm
TIGER spatial data Source for free US spatial data In this exercise, used as a data source of points for spatial clustering MapInfo kdtree OGR DBSCAN Convex Hull Clarkson code for Alpha Shapes Tab2Tab QGIS (Quantum GIS)

Spatial Clustering in C++ (post 4 of 5) – Generating Alpha Shapes

In previous posts we saw how to run the DBSCAN spatial clustering algorithm over a set of spatial data points: in our example, all block centroids in California. We also saw how to make it run really quickly. In this post we’ll investigate a way to turn those clusters into polygons. These polygons will represent our clusters in an approximate way, but will be fast to display and provide a visually compelling way to appraise the clusters we’ve created.

One way to get a polygon from a set of points is to form the Convex Hull. However, for our clusters, convex hulls, though easily generated using OGR’s ConvexHull() on a MultiPoint object, do not look so great – they cover over large swathes of what we’d intuitively think of as the “shape” describing the points. An example from a real cluster produced by DBSCAN is shown below:

In this example, the light green coloured cluster would be badly represented by a convex hull (which is shown in pink):

Another example can be found in Part 1 of this blog series, for the largest cluster in the picture.

Unfortunately there’s no unambiguous way to wrap a set of points in a non-convex polygon (that is, there’s no such thing as a “bounding polygon” – think about it…). However, we can use something called Alpha Shapes, which have a parameter (alpha) which describes “how concave” we’d like our shape to go (intuitively, how tightly to “shrink wrap” our points). Good descriptions of Alpha Shapes are provided here, here and here, and in the context of spatial data, here.

For the above example, with a suitably chosen alpha (in this case, 2500), the alpha shape looks like the following:

There’s not a lot of Alpha Shape C++ code around and it’s not something you’d probably want to code up yourself unless you’re very interested in computational geometry and have lots of spare time. One source is Ken Clarkson’s code. This is old-style C code which I initially made the mistake of trying to integrate into my own code as a library, eschewing its command line interface. Unfortunately this meant making the code (which is full of global variables) re-entrant; it turns out the code contains subtle crashes and can generate unexpected results if repeatedly run over many inputs. So keeping it as a “black box” and calling it as an exe from my own code seems like the only option – however it does work well in that way, though it is somewhat slower as it has to invoke an exe for every polygon to be created.

Another shortcoming of this particular code is that the alpha shapes produced do not include any “holes”.

For my alpha I’ve chosen 2500, and I’ve also used a multiplier in the command line to reduce the risk of numerical instability with points that are very close to each other (latitude/longitude coordinates going down to 6 or more decimal places).

Another minor issue to note is that when your data includes more than one data item “stacked” at the same spatial point (for example if the data has several points representing the distinct addresses within an apartment building or suchlike), then in the worst case if such stacked points are isolated, you can end up with a cluster that only occupies a single point and hence doesn’t get processed as a polygon.

For very large clusters (bigger than say, 40000 or so) one could potentially also reduce the time for the Alpha Shape calculation by putting a grid over the points and taking only a sample of points in each grid cell.

In the next post we’ll see the results of running the Alpha Shapes code on our clusters. For now, here is the code that calls the Clarkson executable. Note that it contains a Windows-centric function which is not cross-platform, RunExeFromCommandLine.

#include

#include

#include

#include

#include

#include  // for SHELLEXECUTEINFO etc

#include "node.h"

using namespace std;

bool RunExeWithCommandLine(wstring wstrExelocation, wstring wstrCommand)

{

SHELLEXECUTEINFO lpExecInfo;

lpExecInfo.cbSize  = sizeof(SHELLEXECUTEINFO);

lpExecInfo.lpFile = wstrExelocation.c_str();

lpExecInfo.fMask = SEE_MASK_DOENVSUBST | SEE_MASK_NOCLOSEPROCESS;

lpExecInfo.hwnd = NULL;

lpExecInfo.lpVerb = L”open”;

lpExecInfo.lpParameters = wstrCommand.c_str();

lpExecInfo.lpDirectory = NULL;

lpExecInfo.nShow = SW_SHOWMINNOACTIVE;

lpExecInfo.hInstApp = (HINSTANCE)SE_ERR_DDEFAIL;

BOOL fSuccess = ShellExecuteEx(&lpExecInfo);

if (lpExecInfo.hProcess != NULL)

{

::WaitForSingleObject(lpExecInfo.hProcess, INFINITE);

::CloseHandle(lpExecInfo.hProcess);

}

return fSuccess == TRUE;

}

OGRLinearRing* GetNextRing(vector<pair<int, int> >& rgptsPoly, const vector<Node*>& rgpts, int first, set<int>& new_points)

{

int rgpt_index = first;

OGRLinearRing* ring = NULL;

for (;;)

{

int from = -1, to = -1;

for (vector<pair<int, int> >::iterator it = rgptsPoly.begin(); it != rgptsPoly.end(); it++)

{

if (it->first == rgpt_index)

{

from = it->first;

to = it->second;

*it = pair<int, int>(-1,-1);

break;

}

if (it->second == rgpt_index)

{

from = it->second;

to = it->first;

*it = pair<int, int>(-1,-1);

break;

}

}

if (from == -1)

return NULL;

OGRPoint point(rgpts[from]->second, rgpts[from]->first);

if (ring == NULL)

ring = new OGRLinearRing;

ring->addPoint(&point);

rgpt_index = to;

new_points.insert(from);

new_points.insert(to);

if (to == first)

break;

}

return ring;

}

int GetFirstNonZero(vector<pair<int, int> >& rgptsPoly)

{

for (vector<pair<int, int> >::const_iterator it = rgptsPoly.begin(); it != rgptsPoly.end(); it++)

if (it->first >= 0)

return it->first;

return -1;

}

int GetRings(vector<pair<int, int> >& rgptsPoly, const vector<Node*>& rgpts, vector<OGRLinearRing*>& rings)

{

set<int> setStartingPoints;

setStartingPoints.insert(GetFirstNonZero(rgptsPoly));

for (;;)

{

if (GetFirstNonZero(rgptsPoly) < 0)

return rings.size();

set<int> new_points;

for (set<int>::const_iterator it = setStartingPoints.begin(); it != setStartingPoints.end(); it++)

{

OGRLinearRing* ring = GetNextRing(rgptsPoly, rgpts, *it, new_points);

if (ring != NULL && ring->getNumPoints() >= 3)

rings.push_back(ring);

}

if (new_points.size() == 0)

{

new_points.insert(GetFirstNonZero(rgptsPoly));

if (new_points.size() == 0 || *(new_points.begin()) < 0)

break;

}

setStartingPoints.clear();

setStartingPoints.insert(new_points.begin(), new_points.end());

}

return rings.size();

}

int GetPolygonBoundary(const vector<Node*>& rgpts, int nCluster, vector<OGRPolygon*>& rgNewPolys)

{

wstring wstrPathForTempFiles = L”C:/temp/”;

wstring wstrTempFileInput = wstrPathForTempFiles + L”sites”;

wstring wstrTempFileOutput = wstrPathForTempFiles + L”sites-alf”;

wstring wstrCommandLineArgs = L”-A -oN -i” + wstrTempFileInput + L” -aa2500 -m10000″;

ofstream os;

os.open(wstrTempFileInput.c_str(), ios::trunc);

if (!os.is_open())

return 0;

for (vector<Node*>::const_iterator it = rgpts.begin(); it != rgpts.end(); it++)

os << setprecision(9) << (*it)->second << ” ” << (*it)->first << endl;

os.close();

if (RunExeWithCommandLine(L”C:/Temp/hull/Release/addcl.exe”, wstrCommandLineArgs))

{

vector<pair<int, int> > rgptsPoly;

ifstream in;

in.open(wstrTempFileOutput.c_str());

if (!in.is_open())

return 0;

char blank[500];

in.getline(blank, 500); // ignore first line

while (!in.eof())

{

int n1, n2;

in >> n1 >> n2;

rgptsPoly.push_back(pair<int, int>(n1, n2));

}

if (rgptsPoly.size() == 0)

return 0;

vector<OGRLinearRing*> rings;

if (rgptsPoly.size() <= 2)

return 1;

if (GetRings(rgptsPoly, rgpts, rings) <= 0)

return 0;

int nTotalNumberOfPointsInRings = 0;

for (vector<OGRLinearRing*>::const_iterator it = rings.begin(); it != rings.end(); it++)

nTotalNumberOfPointsInRings += (*it)->getNumPoints();

// Note if nTotalNumberOfPointsInRings < (int)rgptsPoly.size(), then GetRing only processed

//         nTotalNumberOfPointsInRings out of rgptsPoly.size() points.  This is OK, as sometimes

//         the shape includes a line (polygon with 2 points), which we’ll generally ignore.

for (vector<OGRLinearRing*>::const_iterator it = rings.begin(); it != rings.end(); it++)

{

OGRPolygon* poly = new OGRPolygon;

poly->addRing(*it);

if (poly->getExteriorRing()->getNumPoints() >= 3)

poly->closeRings();

rgNewPolys.push_back(poly);

}

return rgNewPolys.size();

}

return 0;

}

Note that I do not consider the code above to be particularly good; I was still trying to gain an understanding of the output of Clarkson’s alpha shape code as I was developing it, and it shows. Anyhow, it has the useful property of working fairly robustly.

The above code is called with code like the following:

vector<OGRPolygon*> rgNewPolys;

map<int, vector<Node*> > mapNodesInEachCluster;

for (vector<Node*>::const_iterator it = vecNodes.begin(); it != vecNodes.end(); it++)

if ((*it)->m_nCluster >= 0)

mapNodesInEachCluster[(*it)->m_nCluster].push_back(*it);

for (map<int, vector<Node*> >::const_iterator it = mapNodesInEachCluster.begin(); it != mapNodesInEachCluster.end(); it++)

if (GetPolygonBoundary(it->second, it->first, rgNewPolys) <= 0)

cout << “Failure with ” << it->second.size() << ” points” << endl;

WritePolygonsToMidMifFile(“C:/Temp/Polygons”, rgNewPolys);
Where the WritePolygonsToMidMifFile function is:

bool WritePolygonsToMidMifFile(string strFileName, vector<OGRPolygon*>& polys)

{

ofstream osFileMid, osFileMif;

if (!OpenMidMifAndWritePreamble(strFileName, osFileMid, osFileMif))

return false;

int i = 0;

for (vector<OGRPolygon*>::const_iterator it = polys.begin(); it != polys.end(); it++, i++)

{

osFileMid << “i” << endl;

osFileMif  << “Region  1″ << endl

<< ”    ” << (*it)->getExteriorRing()->getNumPoints() << endl;

for (int n = 0; n < (*it)->getExteriorRing()->getNumPoints(); n++) // first and last point are the same

{

OGRPoint pt;

(*it)->getExteriorRing()->getPoint(n, &pt);

osFileMif << pt.getX() << ” ” << pt.getY() << endl;

}

int rgb = (40 + ((i * 13) % 215)) * 255 * 255 + (40 + ((i * 4573) % 215)) * 255 + (40 + (i % 215));

osFileMif  << ” Pen (1,2,0) ” << endl

<< ”    Brush (” << 2 << “,” << rgb << “)” << endl;

}

osFileMif.close();

osFileMid.close();

return true;

}

We’ll see some of the outputs produced in the next and final post in this series, where we’ll also look at the effect of different values of alpha.

Spatial Clustering in C++ (post 3 of 5) – A Speedier DBSCAN

Welcome to part 3 of this 5 part blog post on Spatial Clustering using C++. In the two previous posts we read in some map data (Californian census block centroids) and performed some clustering on the points using the DBSCAN algorithm. In this post we’ll investigate how to make the code run faster. This is important if you want this type of algorithm to run in reasonable time over a very large data set (e.g. all of the USA), or you want to run it repeatedly (e.g. to investigate different parameters).

In this investigation I’ll employ some of the techniques for code optimisation that I keep in my mental kit bag, namely: using raw arrays instead of container objects; pre-calculate and cache values wherever possible; use statics to avoid memory overheads; “pool” up frequently re-allocated objects.

But first, the golden rule: always profile. The first thing I’ll do is surround the call to DBScan() with a simple time clocking:

time_t tmStartTime;

time(&tmStartTime);

int nClusters = RunDBScan();

time_t lTimeNow;

time(&lTimeNow);

time_t tsDuration = lTimeNow – tmStartTime;

cout << tsDuration << ” seconds to run DBScan” << endl;
The version of DBSCAN presented in the previous post runs in about 86 seconds, so this is the benchmark where we start from (note that is in the Debug build configuration).

Next I do my “poor man’s profiling” – in Visual Studio, I just hit Debug / Break when the main algorithm is running. Breaking a few times gives a picture of where the time is being spent the most. [Incidentally, I love this method of profiling, as it’s just so quick, given that it doesn’t require any external tools or specialised performance builds, yet it still often enables you optimise 90% of the worst bottlenecks in the code].

In this case I find the majority of time is spent allocating memory when pushing items onto the STL vector rgpNeighbourhood2 inside ExpandCluster / GetNodesInRadius.

I tried to use vector’s “reserve” function to allocate memory in larger chunks. However, it’s not clear what the best chunk size to use would be; in fact experimenting with different values can dramatically affect the overall performance. For my test data set, I found the best value was 25, and this reduced the running time to 60 seconds.

Another similar strategy when using STL vectors is to use static and clear; replace the declaration of the vector with the following code:

static vector<Node*> rgpNeighbourhood2;

rgpNeighbourhood2.clear();

vector::clear does not change the vector’s capacity in this version of STL (with MS Visual Studio 8), so the vector is made only once, and it is re-allocated presumably only a handful of times (whenever it’s required to get larger than it previously was). This method is better than using reserve, as it doesn’t require any arguments or parameters, and it is faster too – only 53 seconds.

Of course one can also fall back on good old raw C arrays instead of using vectors. I haven’t posted the code for this, but it reduces the time further to 39 seconds. Using raw C arrays does really impact the readability and maintainability of the code and would need to be wrapped up in some nice class wrapper or suchlike in an industrial strength application.

Once this is all done, the bottleneck is now inside kdtree, specifically where it allocates and de-allocates its result objects (objects of type kdres). Looking into the code in this area, I saw the allocation guarded with #defines: USE_LIST_NODE_ALLOCATOR. Looking at the documentation, it advocates switching this tag on. In Windows you will also need to define NO_PTHREADS and comment out the line #ERROR that warns about the code no longer being thread-safe.

Using the list node allocator reduces the time down to an astounding 15 seconds. Finally, running the code in Release instead of Debug reduces the time down to 4 seconds – not bad considering the first naive version took nearly a minute and a half to run! Profiling now shows most time is spent inside kdtree functions to return the nearest nodes, and there’s not much we can do about that.

Summarising what we’ve found:

86 seconds for the first version of the algorithm

60 seconds with reserve of 25 on rgpNeighbourhood2

53 seconds with static instead of reserve

39 seconds with raw C arrays instead of STL vectors

15 seconds with kdtree list_node_allocator switched on (and raw C arrays)

4 seconds in Release configuration (and raw C arrays)

And the lessons learnt (as always, it seems) from optimising code:

  • always measure timings – bottlenecks are very often not where you’d think they would be
  • it’s often sufficient to use very simple profiling
  • it’s frequently surprising how fast you can eventually make the code run

Now that we have our DBSCAN running blazingly fast, we’ll in the next post see how to create alpha shapes that will give the output more visual appeal.

Spatial Clustering in C++ (post 2 of 5) – Running DBSCAN

In the last post in this series I explained how to get and read in some spatial data points for California. In this post we’ll run the DBSCAN algorithm to perform spatial clustering.

DBSCAN is a well-known algorithm that identifies clusters based on density. A major advantage of DBSCAN is that it can identify arbitrary shape objects (ie. it does not presuppose the size, shape, or number of clusters in the data, as some other methods do), and it removes noise during the clustering process. It’s also quite a simple algorithm to implement.

The two parameters controlling DBSCAN are eps and minPts. They work together to define how the clusters should “grow”: if more than minPts points are within a radius of eps of a point, then those neighbouring points are also included in the cluster; each point in the expanded cluster is then subjected to the same test.

Our kdtree data structure provides just what we need for the “neighbours” test with its kd_nearest_range function. However as a side note we should observe that our eps (radius) parameter here is expressed in latitude/longitude distance units – in a real application we would likely want to use a more natural unit such as metres or kilometres. (One can convert, but the conversion is dependent on latitude). One consequence of using lat/long distance, apart from the its non-intuitiveness, is that the radius will be slightly different depending on the latitude of the point.

So, we’ll now show the code to get the points in a given radius – the GetNodesInRadius function:

bool GetNodesInRadius(Node* pt, double dblRadius, int nMinPts, vector<Node*>& rgpNodesFound)

{

double pos[2] = {pt->GetLongitude(), pt->GetLatitude()};

kdres* res = kd_nearest_range(kdTree, pos, dblRadius);

int number_near = kd_res_size(res);

if (number_near >= nMinPts)

{

while (!kd_res_end(res))

{

Node* ptNear = (Node*)kd_res_item(res, pos);

rgpNodesFound.push_back(ptNear);

kd_res_next(res);

}

}

kd_res_free(res);

return number_near >= nMinPts;

}

And this is called by our DBSCAN code, which is as follows.

void ExpandCluster(vector<Node*>& rgp, int nCluster, double dblEpsilon, int nMinPts)

{

vector<Node*> rgpNeighbourhood;

for (int i = 0; i < (int)rgp.size(); i++)

if (!rgp[i]->m_fVisited)

{

rgp[i]->m_nCluster = nCluster;

rgp[i]->m_fVisited = true;

rgpNeighbourhood.push_back(rgp[i]);

}

for (int i = 0; i < (int)rgpNeighbourhood.size(); i++)

{

Node* pNodeNear = rgpNeighbourhood[i];

vector<Node*> rgpNeighbourhood2;

if (GetNodesInRadius(pNodeNear, dblEpsilon, nMinPts, rgpNeighbourhood2))

{

// append to rgpNeighbourhood items in rgpNeighbourhood2 that aren’t already in rgpNeighbourhood

for (int j = 0; j < (int)rgpNeighbourhood2.size(); j++)

{

Node* pNode = rgpNeighbourhood2[j];

if (!pNode->m_fVisited)

pNode->m_fVisited = true;

if (pNode->m_nCluster < 0)

{

pNode->m_nCluster = nCluster;

rgpNeighbourhood.push_back(pNode);

}

}

}

}

}

int RunDBScan()

{

double dblEpsilon = 0.0045;

int nCluster = 0, nMinPts = 8;

for (vector<Node*>::const_iterator it = vecNodes.begin(); it != vecNodes.end(); it++)

{

Node* pNode = *it;

if (!pNode->m_fVisited)

{

pNode->m_fVisited = true;

vector<Node*> rgpNeighbourhood;

if (GetNodesInRadius(pNode, dblEpsilon, nMinPts, rgpNeighbourhood))

{

pNode->m_nCluster = nCluster;

pNode->m_fVisited = true;

ExpandCluster(rgpNeighbourhood, nCluster, dblEpsilon, nMinPts);

nCluster++;

}

}

}

return nCluster;

}

Here I’ve implemented a very simple performance improvement over the DBSCAN pseudo-code as it appears in Wikipedia. I don’t add the point P’ to the current neighbourhood N unless it has no cluster already; this guarantees that subsequent iterations over N will only process unvisited points. The results are exactly the same but it runs 2-3 times faster (at least with this STL vector-based implementation).

Even so, this code uses a fairly naive implementation with STL vectors. We’ll see in the next blog post in this series how to refine this implementation and in the process make it run very fast.

With this example data set (California census block data) there tend to be many clusters, including many small ones (though, to be sure, there is at least one truly massive cluster – Los Angeles!). One could obviously cull out the smallest clusters after DBSCAN is run if that is appropriate for your application.

I’m writing out the cluster points to Mid/Mif files and importing them into Tab using MapInfoPro. The code to write the points is:

bool OpenMidMifAndWritePreamble(string strFileName, ofstream& osFileMid, ofstream& osFileMif)

{

osFileMid.open((strFileName + ".mid").c_str(), ios::trunc);

if (!osFileMid.is_open())

return false;

osFileMif.open((strFileName + ".mif").c_str(), ios::trunc);

if (!osFileMif.is_open())

return false;

osFileMid << setprecision(20);

osFileMif << "Version 300" << endl;

osFileMif << "Charset \"Neutral\"" << endl;

osFileMif << "Delimiter \",\"" << endl;

osFileMif << "CoordSys Earth Projection 1, 0 Bounds (-1000, -1000) (1000, 1000)" << endl;

osFileMif << "Columns 1" << endl;

osFileMif << "  Cluster Integer" << endl;

osFileMif << "Data" << endl << endl;

return true;

}

bool WritePoints(string strFileName)

{

ofstream osFileMid, osFileMif;

if (!OpenMidMifAndWritePreamble(strFileName, osFileMid, osFileMif))

return false;

for (vector<Node*>::const_iterator it = vecNodes.begin(); it != vecNodes.end(); it++)

{

osFileMid << (*it)->m_nCluster << endl;

osFileMif << “Point ” << (*it)->GetLongitude() << ” ” << (*it)->GetLatitude() << endl;

int c = (*it)->m_nCluster;

int rgb = (40 + ((c * 13) % 215)) * 255 * 255 + (40 + ((c * 4573) % 215)) * 255 + (40 + (c % 215));

int size = ((*it)->m_nCluster >= 0 ? 6 : 2);

osFileMif << ”  Symbol (34,” << ((*it)->m_nCluster >= 0xFFFFFF ? rgb : size) << “,” << 8 << “)” << endl;

}

osFileMif.close();

osFileMid.close();

return true;

}

Here I’ve used a cute little one-liner to generate unique colours for each cluster. Non-clustered points (“noise” in DBSCAN terms) are coloured white (with black outline) and drawn smaller than the clustered points.

The results with eps set to 0.0045 and minPts = 8 give 3706 clusters, and 488,761 out of the total of 710,146 points end up in a cluster. Zooming in on the San Francisco area looks like:

Changing eps to 0.009 gives larger clusters, as you would expect. This time there are 3473 clusters, and 592,652 points are clustered. San Francisco looks like:

In the next post I’ll show how to significantly speed up DBSCAN, and in further posts demonstrate how to get our output to look like this:

Spatial Clustering in C++ (Part 1 of 5)

Welcome to the first of a five-part blog post where I work through an end-to-end example of spatial data clustering using C++. I’ll be using a number of tools and techniques, including:

– processing spatial point data using MapInfoPro

– storing points in a kdtree data structure (which will enable fast lookup of nearby points)

– performing spatial clustering using an implementation of the DBSCAN algorithm

– creating Alpha Shapes (think “shrink-wrapped” polygons) to visualise the output

– using OGR for various miscellaneous geometry tasks

Aside from the commercial product MapInfoPro (which is optional for this), all of the tools and codes I use are free/open source with permissive licences. We’ll also see (particularly by blog post #3 in this series) that the combination of these tools means our spatial clustering can be blazingly fast to compute.

The simple version of spatial data clustering I’ll be doing here is to develop code that will automatically figure out groups (clusters) of densely connected points on a map. A typical application for this might be to automatically find towns or cities from a population map.

The eventual output we’ll get looks like the following (and I hope this whets your appetite to follow the entire 5 parts of this blog post):

The above picture shows each cluster found in the data, and coloured uniquely.

The above picture shows the output of running Alpha Shapes to obtain polygons which represent the clusters.

 

For this exercise I’ll be using a subset of the US Tiger spatial data. In this case I’m choosing California block data. (I did in fact find it very hard to find a freely available file of spatial data points – e.g. addresses for an entire country – to use as my sample data). To get this TIGER data, go to the TIGER download site and Select Block data, California, All Counties.

There are at least two easy ways to get points from this data. The first, and probably quickest (if you have MapInfoPro installed), is to open the block file in MapInfoPro, and go Query / SQL Select, and select CentroidX(obj), CentroidY(obj) from the table, then save this selection as a separate file and export it to MID/MIF format. The MID file will then contain all the centroid points. An alternative shown below is to use OGR-computed centroids. There are 710,146 points in all.

Now let’s write some C++ to read in these points, and store them in a 2-dimensional kdtree, as well as in an STL vector of Node objects. Our kdtree code here is provided by the excellent implementation hosted at Google Code (and consists of just two files, kdtree.h and kdtree.cpp). This code is compiled with Visual Studio but should be relatively cross-platform.

First we define a little helper class to store our points (which we’ll call Nodes). The Visited and Cluster members will be used later for DBSCAN.

#include

class Node : public std::pair<double, double>

{

public:

Node(double x, double y) : std::pair<double, double>(y,x)

{

m_fVisited = false;

m_nCluster = -1;

}

double GetLongitude() const {return second;}

double GetLatitude() const {return first;}

bool m_fVisited;

int m_nCluster;

};

Next is the main code that reads in the points and stores them in both a kdtree (so that we can get nearest nodes later) and a vector. I’ve included all the #includes that we’re going to need in the code of subsequent blog posts.

#include "kdtree.h"

#include

#include

#include

#include

#include

#include

#include "node.h"

using namespace std;

kdtree* kdTree;

vector<Node*> vecNodes;

bool ReadInPointsFromFile(string strFile)

{

ifstream in;

in.open(strFile.c_str());

if (!in.is_open())

return false;

kdTree = kd_create(2);

while (!in.eof())

{

double x, y;

char comma;

in >> x >> comma >> y;

Node* pt = new Node(x, y);

vecNodes.push_back(pt);

double xy[2] = {x, y};

kd_insert(kdTree, xy, pt);

}

return true;

}

int main(int argc, char* argv[])

{

ReadInPointsFromFile(“C:/Temp/CentroidPoints.MID”);

}

The alternative approach (mentioned above) for getting points from the block data – without needing MapInfoPro – is to read in the shape file data directly and use OGR library functions to calculate the polygon centroids. This code is shown below. You’ll need to have OGR installed for this – the easiest way is to install the FWTools binaries. You’ll then need to make sure your project includes the FWTools Include file, and links with the appropriate lib.

This approach is a little slower to run than using MapInfoPro. (If you just wanted a fast approximation of the block centroids, however, you could probably just take the centroid of the bounding box of the polygons as they are read in.)

bool ReadInPolygonsFromFile(string strFile)

{

kdTree = kd_create(2);

OGRRegisterAll();

OGRDataSource* poDS = OGRSFDriverRegistrar::Open(strFile.c_str(), false);

if (poDS == NULL || poDS->GetLayerCount() <= 0)

return false;

OGRLayer* poLayer = poDS->GetLayer(0);

OGRFeature* poFeature;

while ((poFeature = poLayer->GetNextFeature()) != NULL)

{

OGRGeometry* poGeometry = poFeature->GetGeometryRef();

if (poGeometry != NULL)

{

OGRwkbGeometryType nType = wkbFlatten(poGeometry->getGeometryType());

OGRFeatureDefn* poFDefn = poLayer->GetLayerDefn();

long lType = wkbFlatten(nType);

if (lType == wkbPolygon)

{

if (static_cast<OGRPolygon*>(poGeometry)->get_Area() != 0)

{

OGRPoint pt;

static_cast<OGRPolygon*>(poGeometry)->Centroid(&pt);

Node* node = new Node(pt.getX(), pt.getY());

vecNodes.push_back(node);

double xy[2] = {pt.getX(), pt.getY()};

kd_insert(kdTree, xy, node);

}

}

else if (lType == wkbMultiPolygon)

{

OGRMultiPolygon* pmpg = static_cast<OGRMultiPolygon*>(poGeometry->clone());

for (int nPoly = 0; nPoly < pmpg->getNumGeometries(); nPoly++)

{

OGRGeometry* poSubGeometry = pmpg->getGeometryRef(nPoly);

if (poSubGeometry != NULL && static_cast<OGRPolygon*>(poSubGeometry) != NULL)

{

if (static_cast<OGRPolygon*>(poSubGeometry)->get_Area() != 0)

{

OGRPoint pt;

static_cast<OGRPolygon*>(poSubGeometry)->Centroid(&pt);

Node* node = new Node(pt.getX(), pt.getY());

vecNodes.push_back(node);

double xy[2] = {pt.getX(), pt.getY()};

kd_insert(kdTree, xy, node);

}

}

}

}

else  //  maybe lType == wkbPoint or wkbLineString ?

return false;

}

OGRFeature::DestroyFeature(poFeature);

}

return false;

}

int main(int argc, char* argv[])

{

ReadInPolygonsFromFile("C:/Temp/tl_2010_06_tabblock10/tl_2010_06_tabblock10.shp");

}

Stay tuned for the next post, where we’ll run the DBSCAN algorithm and start to see some nice outputs!

Function within a Function in C++

I have been coding in C++ for a number of years. Only recently did I find out that there is an indirect way to write a function within a function. I have found myself using this little workaround more and more recently. When coupled with the use of a static variable within a function, the ability to have “Everything In Its Right Place” i.e. avoid unnecessary functions and variables in the global namespace, is particularly appealing to my inner code nazi.

The below code snippet shows an example…

Note

– I have only tested the code snippet below for compilation using Visual Studio
– Visual Studio’s debugger does not allow you to “watch” spndI when execution is inside the function NodeIsOfSameTypeAsIButNotI

– I have formatted the code for the narrow width afforded this blog! So the example looks more verbose than it would otherwise be, and the blogger may mangle some of the code when it is posted (causing me to re edit the post over and over until google gets it right)

struct CNode
{
// shell struct to enable code compilation
double GetLongitude() const {return 0;}
double GetLatitude() const {return 0;}
double GetType() const {return 0;}
};
template class CSomeClassHoldingNodesForFastFinding
{
public:
typedef bool(*NodeCanBeClosest)(const T*);
template inline T* FindClosest
(
double dblLongitude,
double dblLatitude,
NodeCanBeClosest pNodeCanBeClosest
)
{
return NULL; // Really return a T* not Null
}
};
void FindClosestNodePairsAndDoSomethingWithThem
(
vector& rgpnd,
CSomeClassHoldingNodesForFastFinding* pFastFindNodes
)
{
static CNode* spndI = NULL;
for (size_t nNodeI = 0; nNodeI != rgpnd.size(); nNodeI++)
{
spndI = rgpnd[nNodeI];
struct Func
{
static bool NodeIsOfSameTypeAsIButNotI(const CNode* pnd)
{
return
pnd != spndI
&&
pnd->GetType() == spndI->GetType();
}
};
CNode* pndJ = pFastFindNodes->FindClosest
(
spndI->GetLongitude(),
spndI->GetLatitude(),
&Func::NodeIsOfSameTypeAsIButNotI
);
// .. Do something with spndI and pndJ
}
}

Some Linux IDEs for C++

I recently went on a short excursion looking for good, lightweight IDEs for C++ in Linux, similar to my adventures with Python IDEs. My (modest) needs are similar to my Python IDE needs, specifically however I want very good step-through debugging support, and in terms of a lightweight solution, would like a tool that doesn’t have a big project/workspace overhead with the IDE creating many of it’s own files (our C++ programs all have simple makefiles, with debugging info turned on and code optimisation suppressed by compiling with the -g and -O0 flags).

Here is what I found:

  • GDB is the GNU debugger used by GCC and actually has a text interface for step-through debugging that is actually usable – if you are very patient. There is also a graphical interface called DDD, however I did not try this.
  • Anjuta DevStudio is an IDE for GNOME. Debugging is activated with a plugin found under Edit/Preferences/Installed Plugins. This IDE worked OK for me in 32 bit Linux, but failed (shutting itself down) when trying to debug under 64 bit Linux.
  • CodeLite is a cross-platform IDE, that seemed to work well at first, however I could not get it to respect my program’s command line arguments.
  • Geany is an IDE with a clean presentation that I have used in the past just to look at C++ code; debugging is supposedly achieved with a plug-in called GeanyGDB, however I did not try it.
  • Code::Blocks is a very mature IDE that is a little chunky-looking in appearance (though it’s download size suggests it is more lightweight than the IDEs listed above), and seemingly requires you to make various project files (at least that seemed the easiest way to get it to a point where I could use it for debugging). However, the debugging does work rather well (and it even supports the Mighty Middle Click). Though it does not seem to show variable internals for more complex variables (dereferencing STL iterators would be nice, but maybe I’m missing something there), this will be my IDE of choice in Linux for a little while at least, I think.

Trying out these IDEs in Linux has, if nothing else, emphasised to me just how impressive, powerful and usable Microsoft’s Visual Studio is (for Windows only, of course). I have used Visual Studio in it’s various incarnations since 1996, and it keeps upping the bar even with the 2010 version with awesome debugging features like “Pin to source”, which lets you make little variable watch windows right alongside the line of code that it is relevant for. It offers so much at a glance (local variable values, current stack, etc) in an elegant way, and one quickly becomes used to being able to look deep into the internals of STL data types, change the line of execution or a variable’s value while debugging, and the like.

One thing that would be really good is if all these IDEs standardised their debugging hotkeys; as the table below shows (where I have also included Eric and Firebug), they are generally all different. I think Eric’s scheme is the best, as it seems logical and doesn’t require any contortions with Shift and Control keys.

 

Start Continue Step Step into Step out Stop
Anjuta F4 F4 F6 F5 Shift-F5 none Codeblocks F8 Ctrl-F7 F7 Shift-F7 Ctrl-Shift-F7 none Visual Studio and CodeLite F5 F5 F10 F11 Shift-F11 Shift-F5 Eric F5 F6 F7 F8 F9 F10 Firebug n/a F8 F10 F11 Shift-F11 n/a

Cross-platform development

During the course of developing Biarri’s flagship Workbench product, we’ve taken pains to ensure that our (GUI-less) optimisation “engines” work well under both Windows and Linux operating systems (so-called cross-platform). This turns out to be relatively easy as long as you stay away from the big OS-specific frameworks (e.g. Microsoft’s MFC/COM/ATL etc). We’ve picked up some handy tips along the way, particularly applicable to C++ development, which are worth sharing here.

  • Be aware of differences in line endings – Windows uses carriage return and line feed \r\n, while Linux/Unix uses just line feed \n. (Note that Visual Studio will show files with Linux line feeds correctly, but Notepad won’t – this is one way to tell what line endings your file has in Windows). This can be particularly important when importing data e.g. into databases where the file originates from another OS.
  • Always use forward slashes for file paths, not backslashes. Also, file names and folder paths are case sensitive under Linux but not under Windows. And don’t assume there is a C: or D: drive!
  • You may have to be careful writing to temporary files and folders. In Linux /tmp is often used; in Windows /[user]/AppData/local/temp (location of the TEMP environment variable – e.g. type “%TEMP%” into the start menu or Windows Explorer). For Linux, it is sometimes necessary to manipulate a folder’s “sticky bit” to ensure that the folder is accessible by other users (e.g. a Postgres database user) – e.g. in Python:
os.chmod(temp_dir_name, os.stat(temp_dir_name).st_mode | stat.S_ISVTX | stat.S_IRGRP | stat.S_IROTH | stat.S_IWGRP | stat.S_IXOTH)
  • Be aware of the differences in file permissions in Windows and Linux. In Linux files have an “executable” bit. chmod a+x [file] makes a file an exe, which can then be run with “./filename”.

For C++ development:

  • Name all cpp and h files in lower case if possible. Files are case sensitive in Linux and this includes #include’s!
  • For compiling with GCC under Linux, the last line in a C++ file must be blank.
  • In Linux C++ programs, general exception handling with catch(…) does not work. You can use sighandlers instead (see this for example), though it’s not as good – it is more equivalent to an exit(), with a chance to clean up.
  • Beware doubles comparisons and inequality checking, at least in C++ programs. Always use a delta i.e. A == B may not be the case in both Windows and Linux if they are essentially the same number so use fabs(A – B)
  • Build tips for Linux: Type “make” when you are in the directory to build the project. This will search for a file called “Makefile” and run it. (Use “make -f filename” to make from a different makefile). To force a recompile you can “touch” a file using “touch filename”.
    To clean out all object files type “make clean” (as long as your make file defines what cleaning does…). Use “make -j4” to run make with for concurrent jobs, to take advantage of multicore.
  • In bash, to get a recursive line count of .cpp/.h files: find [directory] -type f -name *.cpp -exec wc -l {} \; | awk ‘{total += $1} END{print total}’

An IDE for Python

For some time now I have been trying to find a decent IDE with step-through debugging support for Python. I’ve wanted it for Linux, but Windows support would also be a bonus.

There’s some debate about the need for an IDE for Python, which (as a veteran of C++ development with Visual Studio) I am still pondering. I get that Python is a higher level language (umm, that’s why I’m using it), but the central problem of ironing out the kinks in the business/engine logic of my code is never going to go away. It really makes me wonder what size and types of code bases the IDE/debugger naysayers are building.

People also talk about Eclipse with PyDev, but I’m deterred by the reputedly formidable learning curve, the reportedly sluggish performance, and the apparent bloat of it. I wanted something lighter, but still free. And I didn’t want something that would require a big project hierarchy, settings tweakings etc, just to run a small Python script. I don’t think my needs are outlandish: easy thing should be easy, hard things should be possible…

This comparison of Python IDEs – the first hit on Google – seems good but is 5 years old (ancient in software development terms). And the Wikipedia comparison table is just the basic facts, ma’am. The Python.org list of IDEs is better, but without some sort of detail or commentary it’s difficult to figure out what’s best for your needs, and parts of it are out of date. Mile long feature lists are all well and good, but how well do the features do what they’re supposed to do?

So I embarked on a trial of a few of the free IDEs out there. First stop was SPE – “Stani’s Python Editor”… which I couldn’t get installing. I know, I know, in Linux Land you’ve got to be prepared to tinker in the innards with spanner in hand… but a frustrating hour or two later, no go. Perhaps because this tool doesn’t seem to be actively developed any more (as I found out afterwards). Next I tried Boa Constructor. It installed, and first impressions were cautiously positive, though it felt like beta software to me. Sure enough after trying to use it in anger the pain points came – I couldn’t figure out the rhyme or reason why it wouldn’t just run the Python script I had open, I had to constantly restart, breakpoints weren’t always respected, etc. Overall it seems more aimed at GUI building than running scripts.

Next was the Eric IDE. Eric installs with just a simple “apt-get install eric”. The Python file you have open runs with “Start/Debug Script…” Breakpoints and stepping through just work (in fact, debugging is an absolute breeze). Lines with Python parse errors get a little bug icon on them in the editor margin (cute, but also handy). It’s not perfect by any means – it takes a while to start up, it occasionally automatically breaks at places where you have no breakpoints, etc etc. It’s GUI is formidable at first glance, but it’s set out logically and should seem familiar to those like me steeped in Visual Studio. It’s also still being very actively developed.

One interesting aspect of Eric’s editor is that it uses rich text with proportionally spaced fonts, in contrast to the plain monospaced font that most code editors sport. This might seem sacrilegious to some, but it seems to work fine for me, and in fact lets me see more code on the screen. It’s not so good for ASCII art though.

Obviously I haven’t tried trials of the commercial IDEs out there – Komodo Edit, WingWare, etc, and I’d be curious what “killer features” (that work out of the box!) they have that Eric doesn’t. But for now, the journey’s over, with a clear winner.