Adventures with Javascript Graphing Libraries

I was looking for a Javascript library that would display a graph – in the mathematical sense of a node-arc network such as the following:

I might require curved lines and maybe arrows, and I need it to be zoomable and pannable.  I also need node labels and click events on both nodes and edges.

Remembering a few links someone sent around showing off impressive applications of libraries such as d3.js and jQuery Sparklines I had high hopes of finding something I could use that was polished and snazzy with superb documentation and an enthusiastic community.

The first thing I noticed is that it seems like everyone is in love with Force-Directed Layouts these days – you know those graphs where the nodes appear “springy” and the spatial layout is “discovered” by the physics algorithm, such as this example.  By their nature, however, these graphs have a dynamic layout – you generally can’t specify exact positions for your nodes.

The libraries I looked at included the following.  A discussion of SVG vs Canvas is here.

d3.js – this library is SVG-based (as opposed to say, HTML5 canvas-based).  With a cursory look I couldn’t see how to zoom (you should be able to though, right – it’s scalable vector graphics…!), and the docs are pretty large and initially confusing.  The consensus seems to be that it is a power users tool.

Sigma.js has some nice-looking examples and appears to scale to hundreds of nodes and arcs nicely. It can use GEXF graph format. However, it lacks good documentation and doesn’t seem to have much momentum – the last source code change was 9 months ago.

jsPlumb has demos that do not appear to be zoomable, though the library seems to be still actively developed and has good docs.  It can render by canvas/SVG/VML using jQuery/MooTools etc.  However it seems more aimed at connecting elements with drag and drop than showing graphs per se.

Raphael seems nice enough, is SVG based, has intriguing demos and reasonable docs, the last change on github was 10 months ago, but again the demos had no zoom.

The Javascript InfoVis Toolkit strikes the right balance of functionality with simplicity for me – there’s enough simplicity that I can actually delve into the source (only one file needed – jit.js) and pretty easily see what’s going on.  It’s canvas-based, but appears to mousewheel zoom (very smoothly) using canvas’ scale function.  The project was also involved in a Google Summer of Code.  Although there was no static graph example, I found it quite easy to adapt the force-directed example code to show a nice zoomable graph with node coordinates that I explictly specified (and has node and edge click events).  (In doing this, I also discovered something interesting – the force-directed layout example gives you a totally new layout every time you refresh – you can see this by refreshing the demo page here multiple times.  This randomised behaviour seems somewhat less than useful. Also interesting is the drag and drop of the nodes in the force directed example).

So I did not find anything that fit squarely with my requirements, though I’m going with the Infovis Toolit for now.  (Of course there is always the possibility to handcode it myself, which for reasonably simple requirements is always an option, though less preferred particularly as you might encounter issues such as cross-browser bugs that have been addressed in the libraries and frameworks.)

On the plus side I truly learnt what Bezier curves actually are (and how they differ from quadratic Bezier curves), thanks to some help from the Wikipedia page and this neat interactive page where you can drag the control points around to see the effect on the curve.

Some time after doing the above (admittedly rather cursory) research, I then came across another cluster of libraries with a slightly different focus.  These libraries are more geared towards interactive elements on canvas (e.g. drag and drop) and persisting the state of the elements when they are changed – a good starting discussion is at the Stack Overflow page here.  The “master list” of these libraries I found in a Google doc here.  These libraries provide more of a “scene graph” implementation which would be useful if you need a framework for proper tracking of the elements being drawn (especially if they are to be animated, e.g. a particle simulator).

Displaying multiple KML files in CloudMade

The documentation for CloudMade’s Web Maps API is often a bit wanting. For example, a few pages consist of the words “Interface description coming soon…”  In any case, I wanted to load multiple KML files at once, and the way to get that working ended up being a bit unintuitive, so I thought I’d share my experiences here.

Loading one KML file is easy enough. Straight from the Web Maps API wiki:

var geoxml = new CM.GeoXml('http://earthquake.usgs.gov/eqcenter/catalogs/eqs7day-M2.5.xml');

CM.Event.addListener(geoxml, 'load', function() {
    map.zoomToBounds(geoxml.getDefaultBounds());
    map.addOverlay(geoxml);
});

One weakness with the API is that you can’t pass CM.GeoXml a string. You can, however provide {local: true} as an additional argument if the url points to a local address:

var geoxml = new CM.GeoXml('/static/kmls/catalogs/eqs7day-M2.5.xml', {local: true});

CM.Event.addListener(geoxml, 'load', function() {
    map.zoomToBounds(geoxml.getDefaultBounds());
    map.addOverlay(geoxml);
});

For my current project, I have about 30 KML files of location data, each KML depicting the state of a network in a different year. I want to show one of these KMLs at a time, but I’d like to have them all loaded up front to keep interaction with the user nice and quick. Rather than combining all 30 KMLs into one master KML file and then using javascript and CloudMade to manually control which objects to hide and show, I wanted towrite a function that just looped through the 30 filenames and loaded them all in the method shown above.

The problem was that doing so caused my browser to go into meltdown.

Here’s an example of that first (unsuccessful) attempt to load all the KML files:

var geoxml;
for (i in filenames)
{
geoxmls[i] = new CM.GeoXml(filenames[i], {local: true});

CM.Event.addListener(geoxmls[i], 'load', function() {
    map.zoomToBounds(geoxmls[i].getDefaultBounds());
    map.addOverlay(geoxmls[i]);
    alert("Finished loading XML #" + i);
}
});

Honestly, I’ve never seen a browser freak out more, except maybe ten years ago on Internet Explorer when I inevitably stumbled onto an endless minefield of pop-up windows all offering me chances to win free iPods and free trials of get-rich-quick schemes.

Anyway, the next thing I did was try to understand what this code was actually doing–a strange second step, I realize, but usually having faith in code examples works just fine for me.

Apparently, when you call CM.GeoXml that call to the url you provide is made right away. But there is a CM.Event attached to that GeoXml object, called ‘load,’ that will fire whenever the server call is complete. The “Listener” function that you add will then be called as soon as the KML object is loaded.

In other words, I can’t be guaranteed that my KML files will be loaded in the same order as they are listed in my “filenames” array. This was immediately evident when, just before my browser melted, I saw a flash of alert windows listing the recently loaded KML filenames indices out of order.

Possibly, the reason the above code doesn’t work is because all the load events will fire at strange times, and maybe it’s too much for the CM.Event listeners to handle. Maybe calling map.addOverlay on multiple things at once just screws things up. Who knows.

In the end, I decided to nest my Listeners so that each KML file would not be loaded until the one before it had fired its ‘load’ Event and been added to the overlay. This might take a while, so I made sure to temporarily dim the screen so that the user doesn’t try to interact with anything until every file is loaded.

The final (working) result looks something like this:

initialize_assignments();

load_kml_and_listen(0, 0);

function initialize_assignments()
{
current_bmap.kmls = [];
current_bmap.kmls[“all”] = [];
for (var i in all_subproblems)
{
cur_sub = all_subproblems[i];
current_bmap.kmls.all[cur_sub] = [];
}
$.loading({pulse:’fade’, mask: true, align: ‘center’, text: ‘Loading KMLs…’}); // dim screen

}

function load_kml_and_listen(i, j)
{
cur_subproblem = all_subproblems[i];
cur_product = all_products[j];
current_bmap.kmls.all[cur_subproblem][cur_product] = new CM.GeoXml(‘quoll/assignments_to_kml?subproblem_name=’+cur_subproblem+’&product_name=’+cur_product, {local:true} );
CM.Event.addListener(current_bmap.kmls.all[cur_subproblem][cur_product], ‘load’, function()
{
if (i == 0 && j == 0)
{
show_kml(current_bmap, cur_subproblem, cur_product);
}

if (i < all_subproblems.length – 1)
{
next_i = i + 1;
next_j = j;
}
else
{
next_i = 0;
next_j = j + 1;
}
if (next_j <= all_products.length – 1)
{
load_kml_and_listen(current_bmap, te_key, next_i, next_j);
}
else
{
$.loading(); // lighten screen
}
});
}

A build and test script in Python

I’ve recently created a script in Python for continuous get-build-test.

It pulls the latest code from a Mercurial repository, builds a bunch of C++ projects, then runs pytest.

The script demonstrates a few simple, but tasty Python techniques including:
– parsing command line flags using optparse
– using subprocess to run shell commands from Python, and capturing the output as it runs
– archiving the build and test results to a log file
– scraping the last line of output of hg pull, py.test etc as a simple (albeit fragile) way to detect success / failure

I’ve set up a cron job to run this every hour. It only actually does anything if there is changed code from the hg pull.

The cron job is set up with crontab -e and the file looks like:

SHELL=/bin/bash
PATH=/usr/bin:/bin:/usr/local/bin
0 * * * * cd /vol/automatic_build_area && python pull_code_and_build_and_test.py

The path /usr/local/bin had to be added as py.test would not run without it (the path was discovered with the useful “which” command, as in “which py.test”). Furthermore, pytest seemed to need to be run with < /dev/null. (I have noticed that, despite its general awesomeness, pytest does have some strange quirks when it comes to general environment issues – the above for example, plus treatment of global variables).

Here is the script:

from optparse import OptionParser
import subprocess
importdatetime

brief_output = False
all_lines = []

def runProcess(cmd):
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True)
while p.poll() is None:
if not p.stdout.closed:
data = p.communicate()[0]
if data is not None:
for line in data.split(“\n”):
yield line

def run_shell_cmd(cmd, force_brief_output = False):
if type(cmd) is str:
cmd = [cmd]

lines = [“Running: ” + ” “.join(cmd) + “\n”]
print “”.join(lines)
for line in runProcess(cmd):
if not brief_output and not force_brief_output:
print line.replace(“\n”, “”)
lines.append(line + “\n”)

while not lines[-1] or lines[-1] == “\n”:  # pop off trailing empty lines
lines.pop()

if not force_brief_output:
all_lines.extend(lines)
return lines

def pull_build_and_test(build_only, test_only):
if build_only and not test_only:
print “Build only – not pulling or testing”
if not build_only and test_only:
print “Test only – not pulling or building”
if build_only and test_only:
print”Build and test only – not pulling”

if not build_only and not test_only:
pull_output = run_shell_cmd(“hg pull -u”)

if not build_only and not test_only and (not pull_output or pull_output[-1] == “no changes found\n”):
print “No changes to repo from hg pull”
else:
if not not build_only and test_only:
make_clean_output =  run_shell_cmd(“make clean”)
make_output = run_shell_cmd(“make”)

if not make_output or make_output[-1] != “– Success –\n”:
print “Build failure!”
# send an email, for example: “Failure at ” + datetime.datetime.now().strftime(“%d %b %Y %H:%M”) \
+ ” – Build failure” with body “”.join(pull_output + [“\n\n”] + make_output))
return False
print “Build success!  C++ engines all built”

if not build_only and not test_only:
pytest_output = run_shell_cmd(“py.test -v < /dev/null”)

if not pytest_output or not “======================” in pytest_output[-1] or not ” passed” in pytest_output[-1] \
or ” failed” in pytest_output[-1] or ” error” in pytest_output[-1]:
print “Pytest failure!”
# send an email, for example: “Failure at ” + datetime.datetime.now().strftime(“%d %b %Y %H:%M”) \
+ ” – Pytest failure” with body “”.join(pull_output + [“\n\n”] + pytest_output))
return False
print “Test success!  All tests have passed”
returnTrue

if __name__ == “__main__”:
all_lines = [“\n\n\n—–**—- Automatic build and test ” + datetime.datetime.now().strftime(“%d %b %Y %H:%M”) + “\n\n” ]
parser = OptionParser()
parser.add_option(“-b”, “–build_only”, dest=”build_only”, action=”store_true”, default=False)
parser.add_option(“-t”, “–test_only”, dest=”test_only”, action=”store_true”, default=False)
parser.add_option(“-l”, “–less_output”, dest=”less_output”, action=”store_true”, default=False)
(options, args) = parser.parse_args()
brief_output = options.less_output
success = pull_build_and_test(options.build_only, options.test_only)
all_lines.append(“\n\n——————– Automatic build and test summary: success = ” + str(success) + \
” ——- Finished running ” + datetime.datetime.now().strftime(“%d %b %Y %H:%M”) + ” ————-\n\n”)
open(“automatic_build_and_test.log”, “a”).write(“”.join(all_lines))    # append results to the log file

 

Acknowledgments to this Stack Overflow solution for pointers on how to capture subprocess output as it’s running, although the above function is much more robust (doesn’t seem to fail from timing problems when there is multiline output etc).

An efficient and effective research environment

So, I would like to share the environment that I have created for the purposes of doing research. Specifically it is an environment that allows me to:

  • Gather research papers,
  • Comment on them in various ways, and review these comments at large,
  • Store this information in source control for the purposes of sharing between my machines, and
  • Write up and deal with ideas in a systematic fashion.

So, the perhaps the first component of this system is, what operating system? Happily, it doesn’t exactly matter. I use Ubuntu, but I also use Windows 7. The great thing about this scheme is that it is adaptable to any environment that runs the tools (well, obviously) and the tools all have multi-platform support.

I personally find editing in vim nicer on Ubuntu, and there is one or two arguably minor things that linux has that Windows does not (XMonad, for example), but I will elaborate on these later.

The general scheme

This approach is, obviously, tailored specifically for me, and given that I have a significant programming background, I am happy to solve some problems with actual programming. I also quite enjoy “systematic” approaches; i.e. things that may not neccessarily be the most user friendly or easy, but ones that follow a specific and clear pattern that makes logical sense.

This approach may not suit everyone, but hopefully there are at least interesting and useful ideas here that you could adapt to your own situation.

The beginning – reference management

So, of course in order to gather research papers it is neccessary to store them in a useful way. JabRef is free, and is a very nice option for this. I’ve described my custom JabRef set up on my own personal blog a few months ago, please read that for how to do this.

One thing on using JabRef is that sometimes you need to correct the format that bibtex exports give you. For example, one thing I am often changing is lists of authors like: “Billy Smith, Joey Anderson” to “Billy Smith and Joey Anderson”

It’s not immediately clear to me why the bibtex are generated the wrong way from some of these sites, but nevertheless. This simple correction is neccessary for the data to be stored properly, and the authors to be picked up correctly, etc.

Okay, now that you’ve read that, you understand that I save all my PDFs to a specific location, a folder like “…/research/reference library”. Where is this folder? It’s in the research repository.

The “research” repository

I keep all my research, on any topic, in one generic folder, called “research”. This is a private git repository hosted on BitBucket.org. I chose bitbucket over github because bitbucket has free unlimited-space private repositories, while githubs cost money. It is neccessary for the research repository to be private for two reasons, one obvious one is that it contains paywall-restricted PDFs, and the other is that it’s just not appropriate to have in-progress research notes be viewable by anyone.

So, the general structure of my research repository is as follows:

~/research
    /articles
    /conferences
    /diary
    /jabref
    /reference library
    /projects
        /semigroups
        /...
    /quantum-lunch
    /...

The contents of these folders are as follows:

~/research/articles – Articles

This contains folders that map directly to papers that I’m trying to write (or more correctly at the moment, scholarships that I’m applying for, and misc notes). These are all, unsurprisingly, LaTeX documents that I edit with Vim.

When I complete an article, I created a “submitted” folder under the specific article, and put the generated PDF in there, but up until that time I only add the non-generated files to source control (this is the generally correct practice for using any kind of source control; only “source” files are included, anything that is generated from the source files is not).

~/research/conferences – Notes on conferences

In here I have folders that map to the short conference name, for example “ACC” for Australian Control Conference. Under that, I have the year, and within that I have my notes, and also any agenda’s that I may have needed, to see what lectures I would attend. The notes should be in vimwiki format (I will describe this later) for easy/nice reading and editing.

~/research/dairy – Research diary and general ideas area

This is the main place I work on day-to-day. It contains all my notes, and minutes from various meetings and lectures I attend. It contains a somewhat-daily research diary, and a list of current research ideas, past research ideas (that were bad, and reasons why) and so on.

My preferred note taking form is vimwiki (to be described below), so in here are purely vimwiki files.

It’s not essential that you also use vim (and hence vimwiki), but it is appropriate that whatever mechanism you use, it is a format that is ameneable to source control (i.e. allows nice text-based diffs). Emacs or any plain-text editor will be sufficient here.

~/research/jabref – Bibtex files

This is perhaps not the most appropriately named folder, but nevertheless. It contains all my .bib databases. I actually only have 3. One is very inappropriately called “2010.bib”, with the view that I would store research by the year I gathered it. I’m not following this approach and I actually just keep all my research related to quantum computing (and more general subjects) in here.

I have two other bib files, one is related to a secondary field of that I am interested in researching. That is to suggest, in 2010.bib I have only documents related to quantum computing, theoretical physics and some theoretical computer science. I have a different .bib for research in completely seperate fields, say investment. The other is “lectures.bib”, and it is obvious what that contains.

It’s worth noting that I actually have two systems for storing lectures. One is the above, where the lecture set fits into a nice single PDF. The other is when the lecture series is split over several PDFs. These I store under a generic “University” repository that I use for my current studies (including assignments and so on). This component of my current setup needs work, and I’m open to suggestions here.

~/research/reference library – All the PDFs

Every PDF releated to my research is stored in here, prefixed with the bibtex key. So, for example I have “Datta2005, 0505213v1.pdf”. JabRef associates this with the relevant item in the .bib file by the prefix, and virtue of this link in the .bib I have a trivial way to programmatically (if I’m so inclined) get this PDF.

I don’t ever browse this folder directly, and currently it contains ~600 PDFs and is about 1 gig in size. Storing this much data in a git repository may offend some people, but essentially they are wrong to be offended. It is okay to store binary files as long as they are not constantly changing, and they are considered a key component of the repository; which in this case they are.

There are some downsides to this, though, and I think it’s plausible to consider alternative arrangements.

The viable alternatives are:

  • Dropbox for PDFs,
  • Secondary git repository for PDFs only, and include as submodule, or
  • Some other non-local storage (say, the one offered by Zotero).

I dislike all of them because for me I prefer to have everything together. I could see dropbox being suitable, because technically it’s not neccessary to have verioned PDFs.

If you have any comments on this, please share them.

~/research/projects – Specific Projects

You’ll notice I have one folder here, “semigroups”. This relates directly to a research scholarship I completed at RMIT. This actually involved a python program, which I have in this directory, as well as some miscellaneous files. It may be appropriate to have nicer codenames for projects, or somehow related them directly to the scholarship details. I think the best approach here is to have a codename, which is detailed in the “diary” folder, and then there is no risk on confusion or duplicate names. The scholarship details could be held seperately in the folder, because perhaps the work could be continued across scholarships.

Anyway, it’s probably not neccessary to overwork this structure. It can always be changed, and it shouldn’t be prohibitively difficult.

~/research/quantum-lunch – Files related to my reading group

This folder is indicative of the other types of folders you may find in this directory. In here, I have some misc python scripts related to this group. There are no notes in here, they are kept in the “diary” folder.

Technically this should be a transition area, where scripts and programs that reach an appropriate level of maturity/usefulness are either published publically (in a different repository), or moved to an appropriate folder under project, but I’ve not yet gotten to that stage.

It’s worth noting that I do have a public github profile: silky, underwhich I will, and do, publish and tools that are worthwhile being public. If one of these projects reaches that stage, I’d essentially move it out of here (delete it) and continue it in that area.

The tools

So, with the repository layout described, let me know discuss the tools I use. We’ve already covered JabRef, for reference management, so we have:

  • JabRef (as mentioned), for reference management,
  • Vim + Vimwiki plugin, for taking notes, keeping ideas, and writing LaTeX,
  • Okular, for reading PDFs, and annotating them [linux],
  • Python, for programming small scripts,
  • XMonad, for window management [linux], and
  • pdflatex and bibtex, for compiling latex (from the TeXLive distribution).

So, almost all of these are available on any platform. Okular is worth commenting on, because it has an important feature that I make use of – it stores annotations not in the PDF but in a seperate file, that can then be programmatically investigated. If you can’t use okular, then you may find that your annotations to PDFs are written back into the PDF itself, and it will be difficult to extract this. You can decide whether or not this bothers you when I describe how I use my annotations.

I will now describe the usage pattern for the various tools, starting in order for easiest to hardest.

Tools – Okular

So, install okular via your favourite method, say “sudo apt-get install okoular”, and then open it. You will want to make it your default PDF editor, and I also choose to have it be very minimal in it’s display; setting the toolbar to text only, hiding the menu, and hiding the list of pages on the left. I also configured a shortcut for exiting, namely pressing “qq”.

For me this is indicative of an important principle – make small customisations that improve your life. It’s worth thinking about, as they can often be trivial, but provide a nice noticable benefit.

You will also want to enable the ‘Review’ toolbar. This allows you to highlight lines of interest, and also add comments. Your comments are saved in a location like:

~/.kde/share/apps/okular/docdata/[number].[pdf-name].xml

This is where it gets fun. I’ve written a program to capture these comments, as well as comments in the ‘Review’ field of the .bib file. This tool is available on my github: get-notes.

You may need to adjust the ‘main.conf’ to suit your needs, or even change the source in some fashion. The code is pretty trivial, but requires some python libraries that you can install with easy_install.

This file products vimwiki output (you can trivially change this however you like, if you program in python). I then symbolically link this generated file (“AutoGeneratedNotes.wiki”) to my “~/research/diary”. Of course, following the general strategy of not including generated files in the source code, I do not break this rule for this file. There is one perhaps obvious downside to this: The output might be different on different machines, because the ~/.kde/… folder is not under source control. I consider this acceptable, because this file is a “transitional” file, in that it is not supposed to be for long-term storage of ideas and comments.

The contents of this file should be reviewed, occasionally, and then moved into either a PDF of comments, or into the research diary for an idea to investiage, or removed because you’ve dealt with it.

For example, I have a comment in the “Review” field of the file “Arora2002”. It says: “Has a section on the Fourier Transform that might be interesting”. This should, eventually, be transitioned into a minor topic to investigate further, or a small writeup in a private “Comments on things” LaTeX document, where you write up, slightly more formally, and with maths, your thoughts on things you’ve learned. I have this document under my “articles” folder.

With this in mind, it is then not an issue that the generated output is different bewteen machines, because ideally there will be no output on any machine, one it has been sufficiently transitioned.

Tools – LaTeX

As indicated, I use LaTeX to write up maths and more detailed notes, proposals, applications, etc. You may wish to use some front-end for LaTeX authoring, for example LyX, but as I already do a lot of work in vim, I prefer to also do LaTeX in here. If I were to switch to another editor, it would probably be Emacs.

Tools – Python

As mentioned in the above comment, I use python to write small scripts. Because they’re in python, they are essentially directly runnable on any system (provided the associated packages can be installed).

I also like python because it provides various packages that are actually useful for my research (like, for example, numpy). You can get similar funcionality from free Math environments, though, such as Octave.

Tools – XMonad

XMonad is not particularly neccessary for this workflow, but I include it because I find it’s ease of use aids in efficient reading and editing. I don’t want to go into significant detail of XMonad configuration (but it’s a fun way to spend your time), you may simply review my XMonad configuration on github.

What I like about it is the concept of focus. You can simply and easily make a PDF full screen, for distraction-free reading, and then switch things around to have vim side-by-side for commenting with context.

Feel free to disregard this, if you are using Windows, as it’s equally possible to do fullscreen and side-by-side editing in Windows 7. XMonad also offers other benefits for general programming, which is the main reason I have it.

Tools – Vim + Vimwiki + LaTeX Box

Essentially the last item in my setup is Vim. It’s hard to express the level of obsession one has for Vim, after a while of using it. It is highly customisable, and includes and inbuilt help system, which I used all the time, when initially learning it.

Most people will find Vim initially difficult to use (I did, when I first learned it when starting work here), but if you dedicate a few days to using it correctly, and you make significant use of the “:help [topic]” command, you will get the hang of it.

You aren’t truly using Vim correctly (or, indeed, living a full life), if you don’t get various plugins. The neccessary ones for LaTeX + note taking are: Vimwiki and LaTeX Box or Vim-LaTeX.

You can find the current state of my Vim configuration, again on my github – .vim

I actually currently use Vim-LaTeX, but I am planning on changing to LaTeX Box because it is more lightweight, so I would recommend starting with LaTeX Box.

The nice thing about using okular is that you can recompile your LaTeX document with the PDF open, and it will refresh it, keeping the current page open. This is very useful when typing long formulas, and reviewing your work.

I have configured Vimwiki to open with “rw”, so I can type this at any time, in Vim, and be looking at the index to all my research notes. In this I have links to all my diaries, my storage spots for old search ideas, and a big list of topics to look into. I also make “TODO” notes in here, and review them with one of my other tools, “find-todo” (on the aforemention github, under /utils). This gives me a list inside Vim, and I can easily navigate to the appropriate file. Again, the TODO’s are items that should be transitioned.

Review

I have documented my reseach environment, as it stands currently. It allows me to make notes easily, transition them in an appropriate workflow, and access all my documents at any time, from any computer.

The proof of a good research environment obviously in the blogging, it’s in the producing of legitimately good research output, and of course that’s yet to be delivered (by myself personally), so it’s not possible to objectively rate this strategy for it’s actual effectiveness. Nevertheless, I do feel comfortable with this layout; I feel like I can take the appropriate amount of notes; I feel my notes are always tracked, and I feel that I have a nice and readable history of what I’ve done. I like that I can track bad ideas; I like that I can make comments “anywhere” (i.e. in Okular or in JabRef) and have them captured automatically for later review, and I like the feeling of having everything organised.

I hope this description has been useful, and I would love to hear about any adjustments you’d propose, or just your own research strategies.

— Noon

Selecting a cell in an HTML table with jQuery

I wanted to get (and set) a cell value in a table using jQuery, selecting it by row and column index (its x and y position in the table, if you like). Frustratingly, I couldn’t find a simple one-liner to do it: I could find code that gets all the cells, or get them by ID (but I don’t want to have to make explicit IDs for each cell of my table), or get all the values in a given column (close, and did lead me in the right direction).

So here’s two similar ways to select a given cell, where row and col are my x, y positions, and my_table is the ID of the table:

$(‘#my_table tr:eq(‘+row+’) td’).eq(col)

and

$(‘#my_table tbody tr:eq(‘+row+’) td:eq(‘+col+’)’)

 

PostgreSQL Tricks

Over the last year or so I’ve gathered a little trick bag of PostgreSQL recipes. Here are some of the best.

These all work in PostgreSQL version 8.4.7 and probably most other versions. The “–” below are PostgreSQL comments.

To see your PostgreSQL version, use:

select version();

It’s sometimes useful to be able to discover the tables in a schema automatically. To do this you can use the following command:

>>> \dt my_schema.* — here the * is a wildcard

Or this command, which uses the PostgreSQL internal tables pg_class and pg_namespace:

>>> select n.nspname, c.relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where n.nspname=’my_schema’;

Where “my_schema” is a schema name. Once you’ve found the table you’re interested in, \d table_name gives you the columns and their types.

To discover the columns in a table (that is in a schema):

>>> select a.attname from pg_class as c, pg_namespace as n, pg_attribute as a where n.oid=c.relnamespace and n.nspname=’my_schema’ and c.relname=’my_table’ and a.attrelid=c.oid;

Where “my_table” is the table name in your schema. This uses the PostgreSQL internal tables pg_class, pg_namespace and pg_attribute.

To discover the type of a column in a table (that is in a schema)

>>> select a.attname, t.typname, a.atttypid from pg_class as c, pg_namespace as n, pg_attribute as a, pg_type as t where n.oid=c.relnamespace and n.nspname=’my_schema’ and c.relname=’my_table’ and a.attrelid=c.oid and a.attname=’my_column’ and t.oid=a.atttypid;

Where “my_column” is the column name in the table. This is similar to the previous command, but also uses internal table pg_type.

If you have some duplicate rows in your table, you can delete them using the hidden ctid column, e.g.:

>>> select ctid, * from my_table; — show me the repeats

>>> delete from my_table where CAST(“ctid” as text) like ‘%3%’; — kill some according to some wildcard pattern

A more automatic way is:

>>> delete form my_table where ctid not in (select max(dup.ctid) from my_table as dup group by dup.x, dup.y, dup.z); — substitute appropriate column names for x, y, z

Allowable formats for dates are controlled by the PostgreSQL setting ‘datestyle’. To see it, type:

show datestyle;

To change the datestyle you can use (for example):

set datestyle to ‘iso, dmy’;

But note that this change is not a permanent change (it only applies to this session/cursor).

Lastly, I sometimes use this command:

table my_table;

as a shorthand for select * from my_table.

Pre-emptive optimisation

For one of our long-standing clients we have been running vehicle routing optimisations on a daily basis. A file of daily orders is uploaded into our Workbench system, and is split up into several regions, each of which needs to be separately optimised. A planner works through each region, going through a series of data checks (e.g. location geocode checking), before hitting an “Optimise” button.

All of the heavy lifting is done on a server (the planner accesses it through a web app via a browser), so it’s possible that the server could silently start up the required optimisations without the planner’s involvement, and in the (fairly common) case where the region does not require any data corrections, when the planner is up to the optimisation stage, the result could be immediately available (as it has already been run, or is in the process of being run). This idea now been implemented and took only a short amount of Python code.

Furthermore, it runs in parallel as each optimisation is itself split into a separate child process (running a C++ exe) which Linux distributes across the 8 cores of our server machine.
The pre-emptive optimisations are kicked off using Python’s multiprocessing package, as follows:

from multiprocessing import Process

p = Process(target=start_preemptive_optimisations, args=(…))

p.start()

Control is returned to the user at this point while the optimisations run in the background. Results are stored in a folder whose name is stored in a database table; when the planner then comes to press Optimise, the system checks if there have been any data corrections – if so, it runs the optimisation from scratch as usual (the pre-emptive result for that region is thus never referenced); however, if there are no corrections, the system simply uses the stored result from the folder.

The end result for the user is that in many cases the optimisations appear to run almost instantaneously. There are really no downsides as we do not pay for our servers on a CPU cycle basis, so we can easily be wasteful of our server CPU time and run these optimisations even if their results are sometimes not needed.

One “wrinkle” we discovered with this is that we had to make our process checking more robust. There is Javascript in our browser front end that polls for certain events, such as an optimisation finishing, which is indicated by a process ceasing to exist. The Python for this is shown below, where “pid” is a process ID. The function returns True if the given process has finished or not.
def check_pid_and_return_whether_process_has_finished(pid):
if pid and pid > 0:
multiprocessing.active_children()   # reap all zombie children first; this also seems to pick up non-children processes
try:
os.waitpid(pid, os.WNOHANG)       # this reaps zombies that are child processes, as it gives these processes a chance to output their final return value to the OS.
except OSError as e:
if int(e.errno) <> 10:  # the 10 indicates pid is not a child process; in this case we want to do nothing and let os.kill be the function to throw an exception and return True (indicating process is finished).
return True
try:
os.kill(pid, 0)   # doesn’t actually kill the process, but raises an OSError if the process pid does not exist; this indicates the process is finished.  Applies to all processes, not just children.
except OSError as e:
return True
else:
return True
return False

Note the “reaping” of zombie processes here, and the complication that arises if a process is not a direct child of the calling Python process (it might be a “child of a child”). In this case we use a call to the (in this case rather mis-named) function os.kill.

Spatial Clustering in C++ (post 5 of 5) – The final outputs

In this fifth and final post of the series on Spatial Clustering in C++, we’ll look at some of the final outputs produced and wrap up the discussion of techniques and codes used.

In the last post I presented some code that can be used to generate an Alpha Shape from a set of points, create it as an OGR Polygon, and output these polygons to MID/MIF files. I’ve since found that one of the functions used in that code, GetNextRing, is quite inefficient when the polygon has a complex perimeter (many points on its boundary), which happens with very small values for alpha. Still it wouldn’t be too hard to speed that code up.

It’s interesting to generate some different Alpha Shapes with varying of the alpha parameter. Recall that alpha corresponds to how tightly the polygon is “shrink wrapped” around the points it encloses; it is actually the radius of circles that define the “bites” taken out around the perimeter of the shape. If alpha is too high, one gets a very approximate shape that starts to approach the convex hull of the points, while if alpha is too low, the shape starts to get very “spiky”, and eventually one starts to lose whole parts of the polygon completely. The good news is that the polygons produced are not too visually dissimilar if you stay within reasonable bounds for alpha – that is, they are not overly sensitive to its value.

Let’s have a look with the San Francisco bay area as our example. Here I’m using an eps in the DBSCAN of 0.009 with minPts = 8, which gives the following clusters:

And for the Alpha Shapes, an alpha of 2500 gives the following:

In the next picture you can see how the shapes represent the points (note also the hole in the large green cluster – the Alpha Shapes code we’re using doesn’t allow for actual holes within the shapes produced):

Here I’ve used Bing Aerial in MapInfoPro to give the shapes produced some spatial context, and I’ve used 33% transparency on the polygon layer (use Layer Properties). Incidentally you can use Style Overrides to give patterned output as well, such as:

With alpha 5000, the results are very similar, though some of the “holes” along the polygon boundaries have now been “filled in”:

The colours of the shapes are different here just because of the funky unique RGB generator function I’m using which depends on the creation index of the polygon; the clusters are the same, only alpha is different.

Even with alpha = 20000, the shapes are still good visual approximations of the points:

With 50000 the shapes are still OK but some big holes have now been filled in:

And with alpha 5000000 one starts to get nearer to the convex hull:

With smaller values of alpha, such as 1000, the shapes are similar to alpha 2500. Below that, however, the Alpha Shapes algorithm starts to produce multiple, very spiky (many points defining a complex boundary) polygons for each cluster, such as this example with alpha 500:

These anemic shapes no longer approximate the points very well.

With larger data sets just showing the clustered points gets slower, so you could potentially also make a geoset and use zoom levels, with polygons shown at higher zoom levels, then show the clustered points when zoomed in closer.

In terms of output, I’ve found Tab2tab is a useful little program to convert between Mid/Mif and Tab formats, though it is not without its problems: older versions have a loss of precision, and current versions appear to lose the colours of the polygons (regions in the parlance). One could potentially also write out the polygons or clusters using OGR functions directly in the output format desired. And if you don’t have MapInfoPro, QGIS is a good open source alternative.

If you’ve made it this far, then I hope you’ve enjoyed this five part blog post. I’ll leave you with a summary of the tools and algorithms that have been deployed in this exercise:
What it doesNotes/shortcomings  Display and edit spatial dataCommercial product, but very good Data structure for storing 2 dimensional pointsAllows fast lookup of nearby points. Good C++ implementation here. Library for computational geometryGood utility functions for reading and writing points and polygons Algorithm for spatial clusteringSimple to implement; previous blog posts in this series show how to make this run really fast. Shape containing a set of pointsNot a good approximation to spatial clusters, but easy and fast to compute e.g. with OGR functions Old style C code for producing Alpha Shapes (polygons representing sets of points)Slightly dodgy code, doesn’t produce polygons with holes inside, but runs OK if called as a black box executable. Map data conversion utilityOlder versions lose precision; new versions lose polygon style/colour information Open source alternative to MapInfoProSlower and not as feature rich as MapInfoPro

Tool/algorithm
TIGER spatial data Source for free US spatial data In this exercise, used as a data source of points for spatial clustering MapInfo kdtree OGR DBSCAN Convex Hull Clarkson code for Alpha Shapes Tab2Tab QGIS (Quantum GIS)

Spatial Clustering in C++ (post 4 of 5) – Generating Alpha Shapes

In previous posts we saw how to run the DBSCAN spatial clustering algorithm over a set of spatial data points: in our example, all block centroids in California. We also saw how to make it run really quickly. In this post we’ll investigate a way to turn those clusters into polygons. These polygons will represent our clusters in an approximate way, but will be fast to display and provide a visually compelling way to appraise the clusters we’ve created.

One way to get a polygon from a set of points is to form the Convex Hull. However, for our clusters, convex hulls, though easily generated using OGR’s ConvexHull() on a MultiPoint object, do not look so great – they cover over large swathes of what we’d intuitively think of as the “shape” describing the points. An example from a real cluster produced by DBSCAN is shown below:

In this example, the light green coloured cluster would be badly represented by a convex hull (which is shown in pink):

Another example can be found in Part 1 of this blog series, for the largest cluster in the picture.

Unfortunately there’s no unambiguous way to wrap a set of points in a non-convex polygon (that is, there’s no such thing as a “bounding polygon” – think about it…). However, we can use something called Alpha Shapes, which have a parameter (alpha) which describes “how concave” we’d like our shape to go (intuitively, how tightly to “shrink wrap” our points). Good descriptions of Alpha Shapes are provided here, here and here, and in the context of spatial data, here.

For the above example, with a suitably chosen alpha (in this case, 2500), the alpha shape looks like the following:

There’s not a lot of Alpha Shape C++ code around and it’s not something you’d probably want to code up yourself unless you’re very interested in computational geometry and have lots of spare time. One source is Ken Clarkson’s code. This is old-style C code which I initially made the mistake of trying to integrate into my own code as a library, eschewing its command line interface. Unfortunately this meant making the code (which is full of global variables) re-entrant; it turns out the code contains subtle crashes and can generate unexpected results if repeatedly run over many inputs. So keeping it as a “black box” and calling it as an exe from my own code seems like the only option – however it does work well in that way, though it is somewhat slower as it has to invoke an exe for every polygon to be created.

Another shortcoming of this particular code is that the alpha shapes produced do not include any “holes”.

For my alpha I’ve chosen 2500, and I’ve also used a multiplier in the command line to reduce the risk of numerical instability with points that are very close to each other (latitude/longitude coordinates going down to 6 or more decimal places).

Another minor issue to note is that when your data includes more than one data item “stacked” at the same spatial point (for example if the data has several points representing the distinct addresses within an apartment building or suchlike), then in the worst case if such stacked points are isolated, you can end up with a cluster that only occupies a single point and hence doesn’t get processed as a polygon.

For very large clusters (bigger than say, 40000 or so) one could potentially also reduce the time for the Alpha Shape calculation by putting a grid over the points and taking only a sample of points in each grid cell.

In the next post we’ll see the results of running the Alpha Shapes code on our clusters. For now, here is the code that calls the Clarkson executable. Note that it contains a Windows-centric function which is not cross-platform, RunExeFromCommandLine.

#include

#include

#include

#include

#include

#include  // for SHELLEXECUTEINFO etc

#include "node.h"

using namespace std;

bool RunExeWithCommandLine(wstring wstrExelocation, wstring wstrCommand)

{

SHELLEXECUTEINFO lpExecInfo;

lpExecInfo.cbSize  = sizeof(SHELLEXECUTEINFO);

lpExecInfo.lpFile = wstrExelocation.c_str();

lpExecInfo.fMask = SEE_MASK_DOENVSUBST | SEE_MASK_NOCLOSEPROCESS;

lpExecInfo.hwnd = NULL;

lpExecInfo.lpVerb = L”open”;

lpExecInfo.lpParameters = wstrCommand.c_str();

lpExecInfo.lpDirectory = NULL;

lpExecInfo.nShow = SW_SHOWMINNOACTIVE;

lpExecInfo.hInstApp = (HINSTANCE)SE_ERR_DDEFAIL;

BOOL fSuccess = ShellExecuteEx(&lpExecInfo);

if (lpExecInfo.hProcess != NULL)

{

::WaitForSingleObject(lpExecInfo.hProcess, INFINITE);

::CloseHandle(lpExecInfo.hProcess);

}

return fSuccess == TRUE;

}

OGRLinearRing* GetNextRing(vector<pair<int, int> >& rgptsPoly, const vector<Node*>& rgpts, int first, set<int>& new_points)

{

int rgpt_index = first;

OGRLinearRing* ring = NULL;

for (;;)

{

int from = -1, to = -1;

for (vector<pair<int, int> >::iterator it = rgptsPoly.begin(); it != rgptsPoly.end(); it++)

{

if (it->first == rgpt_index)

{

from = it->first;

to = it->second;

*it = pair<int, int>(-1,-1);

break;

}

if (it->second == rgpt_index)

{

from = it->second;

to = it->first;

*it = pair<int, int>(-1,-1);

break;

}

}

if (from == -1)

return NULL;

OGRPoint point(rgpts[from]->second, rgpts[from]->first);

if (ring == NULL)

ring = new OGRLinearRing;

ring->addPoint(&point);

rgpt_index = to;

new_points.insert(from);

new_points.insert(to);

if (to == first)

break;

}

return ring;

}

int GetFirstNonZero(vector<pair<int, int> >& rgptsPoly)

{

for (vector<pair<int, int> >::const_iterator it = rgptsPoly.begin(); it != rgptsPoly.end(); it++)

if (it->first >= 0)

return it->first;

return -1;

}

int GetRings(vector<pair<int, int> >& rgptsPoly, const vector<Node*>& rgpts, vector<OGRLinearRing*>& rings)

{

set<int> setStartingPoints;

setStartingPoints.insert(GetFirstNonZero(rgptsPoly));

for (;;)

{

if (GetFirstNonZero(rgptsPoly) < 0)

return rings.size();

set<int> new_points;

for (set<int>::const_iterator it = setStartingPoints.begin(); it != setStartingPoints.end(); it++)

{

OGRLinearRing* ring = GetNextRing(rgptsPoly, rgpts, *it, new_points);

if (ring != NULL && ring->getNumPoints() >= 3)

rings.push_back(ring);

}

if (new_points.size() == 0)

{

new_points.insert(GetFirstNonZero(rgptsPoly));

if (new_points.size() == 0 || *(new_points.begin()) < 0)

break;

}

setStartingPoints.clear();

setStartingPoints.insert(new_points.begin(), new_points.end());

}

return rings.size();

}

int GetPolygonBoundary(const vector<Node*>& rgpts, int nCluster, vector<OGRPolygon*>& rgNewPolys)

{

wstring wstrPathForTempFiles = L”C:/temp/”;

wstring wstrTempFileInput = wstrPathForTempFiles + L”sites”;

wstring wstrTempFileOutput = wstrPathForTempFiles + L”sites-alf”;

wstring wstrCommandLineArgs = L”-A -oN -i” + wstrTempFileInput + L” -aa2500 -m10000″;

ofstream os;

os.open(wstrTempFileInput.c_str(), ios::trunc);

if (!os.is_open())

return 0;

for (vector<Node*>::const_iterator it = rgpts.begin(); it != rgpts.end(); it++)

os << setprecision(9) << (*it)->second << ” ” << (*it)->first << endl;

os.close();

if (RunExeWithCommandLine(L”C:/Temp/hull/Release/addcl.exe”, wstrCommandLineArgs))

{

vector<pair<int, int> > rgptsPoly;

ifstream in;

in.open(wstrTempFileOutput.c_str());

if (!in.is_open())

return 0;

char blank[500];

in.getline(blank, 500); // ignore first line

while (!in.eof())

{

int n1, n2;

in >> n1 >> n2;

rgptsPoly.push_back(pair<int, int>(n1, n2));

}

if (rgptsPoly.size() == 0)

return 0;

vector<OGRLinearRing*> rings;

if (rgptsPoly.size() <= 2)

return 1;

if (GetRings(rgptsPoly, rgpts, rings) <= 0)

return 0;

int nTotalNumberOfPointsInRings = 0;

for (vector<OGRLinearRing*>::const_iterator it = rings.begin(); it != rings.end(); it++)

nTotalNumberOfPointsInRings += (*it)->getNumPoints();

// Note if nTotalNumberOfPointsInRings < (int)rgptsPoly.size(), then GetRing only processed

//         nTotalNumberOfPointsInRings out of rgptsPoly.size() points.  This is OK, as sometimes

//         the shape includes a line (polygon with 2 points), which we’ll generally ignore.

for (vector<OGRLinearRing*>::const_iterator it = rings.begin(); it != rings.end(); it++)

{

OGRPolygon* poly = new OGRPolygon;

poly->addRing(*it);

if (poly->getExteriorRing()->getNumPoints() >= 3)

poly->closeRings();

rgNewPolys.push_back(poly);

}

return rgNewPolys.size();

}

return 0;

}

Note that I do not consider the code above to be particularly good; I was still trying to gain an understanding of the output of Clarkson’s alpha shape code as I was developing it, and it shows. Anyhow, it has the useful property of working fairly robustly.

The above code is called with code like the following:

vector<OGRPolygon*> rgNewPolys;

map<int, vector<Node*> > mapNodesInEachCluster;

for (vector<Node*>::const_iterator it = vecNodes.begin(); it != vecNodes.end(); it++)

if ((*it)->m_nCluster >= 0)

mapNodesInEachCluster[(*it)->m_nCluster].push_back(*it);

for (map<int, vector<Node*> >::const_iterator it = mapNodesInEachCluster.begin(); it != mapNodesInEachCluster.end(); it++)

if (GetPolygonBoundary(it->second, it->first, rgNewPolys) <= 0)

cout << “Failure with ” << it->second.size() << ” points” << endl;

WritePolygonsToMidMifFile(“C:/Temp/Polygons”, rgNewPolys);
Where the WritePolygonsToMidMifFile function is:

bool WritePolygonsToMidMifFile(string strFileName, vector<OGRPolygon*>& polys)

{

ofstream osFileMid, osFileMif;

if (!OpenMidMifAndWritePreamble(strFileName, osFileMid, osFileMif))

return false;

int i = 0;

for (vector<OGRPolygon*>::const_iterator it = polys.begin(); it != polys.end(); it++, i++)

{

osFileMid << “i” << endl;

osFileMif  << “Region  1″ << endl

<< ”    ” << (*it)->getExteriorRing()->getNumPoints() << endl;

for (int n = 0; n < (*it)->getExteriorRing()->getNumPoints(); n++) // first and last point are the same

{

OGRPoint pt;

(*it)->getExteriorRing()->getPoint(n, &pt);

osFileMif << pt.getX() << ” ” << pt.getY() << endl;

}

int rgb = (40 + ((i * 13) % 215)) * 255 * 255 + (40 + ((i * 4573) % 215)) * 255 + (40 + (i % 215));

osFileMif  << ” Pen (1,2,0) ” << endl

<< ”    Brush (” << 2 << “,” << rgb << “)” << endl;

}

osFileMif.close();

osFileMid.close();

return true;

}

We’ll see some of the outputs produced in the next and final post in this series, where we’ll also look at the effect of different values of alpha.

Spatial Clustering in C++ (post 3 of 5) – A Speedier DBSCAN

Welcome to part 3 of this 5 part blog post on Spatial Clustering using C++. In the two previous posts we read in some map data (Californian census block centroids) and performed some clustering on the points using the DBSCAN algorithm. In this post we’ll investigate how to make the code run faster. This is important if you want this type of algorithm to run in reasonable time over a very large data set (e.g. all of the USA), or you want to run it repeatedly (e.g. to investigate different parameters).

In this investigation I’ll employ some of the techniques for code optimisation that I keep in my mental kit bag, namely: using raw arrays instead of container objects; pre-calculate and cache values wherever possible; use statics to avoid memory overheads; “pool” up frequently re-allocated objects.

But first, the golden rule: always profile. The first thing I’ll do is surround the call to DBScan() with a simple time clocking:

time_t tmStartTime;

time(&tmStartTime);

int nClusters = RunDBScan();

time_t lTimeNow;

time(&lTimeNow);

time_t tsDuration = lTimeNow – tmStartTime;

cout << tsDuration << ” seconds to run DBScan” << endl;
The version of DBSCAN presented in the previous post runs in about 86 seconds, so this is the benchmark where we start from (note that is in the Debug build configuration).

Next I do my “poor man’s profiling” – in Visual Studio, I just hit Debug / Break when the main algorithm is running. Breaking a few times gives a picture of where the time is being spent the most. [Incidentally, I love this method of profiling, as it’s just so quick, given that it doesn’t require any external tools or specialised performance builds, yet it still often enables you optimise 90% of the worst bottlenecks in the code].

In this case I find the majority of time is spent allocating memory when pushing items onto the STL vector rgpNeighbourhood2 inside ExpandCluster / GetNodesInRadius.

I tried to use vector’s “reserve” function to allocate memory in larger chunks. However, it’s not clear what the best chunk size to use would be; in fact experimenting with different values can dramatically affect the overall performance. For my test data set, I found the best value was 25, and this reduced the running time to 60 seconds.

Another similar strategy when using STL vectors is to use static and clear; replace the declaration of the vector with the following code:

static vector<Node*> rgpNeighbourhood2;

rgpNeighbourhood2.clear();

vector::clear does not change the vector’s capacity in this version of STL (with MS Visual Studio 8), so the vector is made only once, and it is re-allocated presumably only a handful of times (whenever it’s required to get larger than it previously was). This method is better than using reserve, as it doesn’t require any arguments or parameters, and it is faster too – only 53 seconds.

Of course one can also fall back on good old raw C arrays instead of using vectors. I haven’t posted the code for this, but it reduces the time further to 39 seconds. Using raw C arrays does really impact the readability and maintainability of the code and would need to be wrapped up in some nice class wrapper or suchlike in an industrial strength application.

Once this is all done, the bottleneck is now inside kdtree, specifically where it allocates and de-allocates its result objects (objects of type kdres). Looking into the code in this area, I saw the allocation guarded with #defines: USE_LIST_NODE_ALLOCATOR. Looking at the documentation, it advocates switching this tag on. In Windows you will also need to define NO_PTHREADS and comment out the line #ERROR that warns about the code no longer being thread-safe.

Using the list node allocator reduces the time down to an astounding 15 seconds. Finally, running the code in Release instead of Debug reduces the time down to 4 seconds – not bad considering the first naive version took nearly a minute and a half to run! Profiling now shows most time is spent inside kdtree functions to return the nearest nodes, and there’s not much we can do about that.

Summarising what we’ve found:

86 seconds for the first version of the algorithm

60 seconds with reserve of 25 on rgpNeighbourhood2

53 seconds with static instead of reserve

39 seconds with raw C arrays instead of STL vectors

15 seconds with kdtree list_node_allocator switched on (and raw C arrays)

4 seconds in Release configuration (and raw C arrays)

And the lessons learnt (as always, it seems) from optimising code:

  • always measure timings – bottlenecks are very often not where you’d think they would be
  • it’s often sufficient to use very simple profiling
  • it’s frequently surprising how fast you can eventually make the code run

Now that we have our DBSCAN running blazingly fast, we’ll in the next post see how to create alpha shapes that will give the output more visual appeal.