Posts

A build and test script in Python

I’ve recently created a script in Python for continuous get-build-test.

It pulls the latest code from a Mercurial repository, builds a bunch of C++ projects, then runs pytest.

The script demonstrates a few simple, but tasty Python techniques including:
– parsing command line flags using optparse
– using subprocess to run shell commands from Python, and capturing the output as it runs
– archiving the build and test results to a log file
– scraping the last line of output of hg pull, py.test etc as a simple (albeit fragile) way to detect success / failure

I’ve set up a cron job to run this every hour. It only actually does anything if there is changed code from the hg pull.

The cron job is set up with crontab -e and the file looks like:

SHELL=/bin/bash
PATH=/usr/bin:/bin:/usr/local/bin
0 * * * * cd /vol/automatic_build_area && python pull_code_and_build_and_test.py

The path /usr/local/bin had to be added as py.test would not run without it (the path was discovered with the useful “which” command, as in “which py.test”). Furthermore, pytest seemed to need to be run with < /dev/null. (I have noticed that, despite its general awesomeness, pytest does have some strange quirks when it comes to general environment issues – the above for example, plus treatment of global variables).

Here is the script:

from optparse import OptionParser
import subprocess
importdatetime

brief_output = False
all_lines = []

def runProcess(cmd):
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True)
while p.poll() is None:
if not p.stdout.closed:
data = p.communicate()[0]
if data is not None:
for line in data.split(“\n”):
yield line

def run_shell_cmd(cmd, force_brief_output = False):
if type(cmd) is str:
cmd = [cmd]

lines = [“Running: ” + ” “.join(cmd) + “\n”]
print “”.join(lines)
for line in runProcess(cmd):
if not brief_output and not force_brief_output:
print line.replace(“\n”, “”)
lines.append(line + “\n”)

while not lines[-1] or lines[-1] == “\n”:  # pop off trailing empty lines
lines.pop()

if not force_brief_output:
all_lines.extend(lines)
return lines

def pull_build_and_test(build_only, test_only):
if build_only and not test_only:
print “Build only – not pulling or testing”
if not build_only and test_only:
print “Test only – not pulling or building”
if build_only and test_only:
print”Build and test only – not pulling”

if not build_only and not test_only:
pull_output = run_shell_cmd(“hg pull -u”)

if not build_only and not test_only and (not pull_output or pull_output[-1] == “no changes found\n”):
print “No changes to repo from hg pull”
else:
if not not build_only and test_only:
make_clean_output =  run_shell_cmd(“make clean”)
make_output = run_shell_cmd(“make”)

if not make_output or make_output[-1] != “– Success –\n”:
print “Build failure!”
# send an email, for example: “Failure at ” + datetime.datetime.now().strftime(“%d %b %Y %H:%M”) \
+ ” – Build failure” with body “”.join(pull_output + [“\n\n”] + make_output))
return False
print “Build success!  C++ engines all built”

if not build_only and not test_only:
pytest_output = run_shell_cmd(“py.test -v < /dev/null”)

if not pytest_output or not “======================” in pytest_output[-1] or not ” passed” in pytest_output[-1] \
or ” failed” in pytest_output[-1] or ” error” in pytest_output[-1]:
print “Pytest failure!”
# send an email, for example: “Failure at ” + datetime.datetime.now().strftime(“%d %b %Y %H:%M”) \
+ ” – Pytest failure” with body “”.join(pull_output + [“\n\n”] + pytest_output))
return False
print “Test success!  All tests have passed”
returnTrue

if __name__ == “__main__”:
all_lines = [“\n\n\n—–**—- Automatic build and test ” + datetime.datetime.now().strftime(“%d %b %Y %H:%M”) + “\n\n” ]
parser = OptionParser()
parser.add_option(“-b”, “–build_only”, dest=”build_only”, action=”store_true”, default=False)
parser.add_option(“-t”, “–test_only”, dest=”test_only”, action=”store_true”, default=False)
parser.add_option(“-l”, “–less_output”, dest=”less_output”, action=”store_true”, default=False)
(options, args) = parser.parse_args()
brief_output = options.less_output
success = pull_build_and_test(options.build_only, options.test_only)
all_lines.append(“\n\n——————– Automatic build and test summary: success = ” + str(success) + \
” ——- Finished running ” + datetime.datetime.now().strftime(“%d %b %Y %H:%M”) + ” ————-\n\n”)
open(“automatic_build_and_test.log”, “a”).write(“”.join(all_lines))    # append results to the log file

 

Acknowledgments to this Stack Overflow solution for pointers on how to capture subprocess output as it’s running, although the above function is much more robust (doesn’t seem to fail from timing problems when there is multiline output etc).

An efficient and effective research environment

So, I would like to share the environment that I have created for the purposes of doing research. Specifically it is an environment that allows me to:

  • Gather research papers,
  • Comment on them in various ways, and review these comments at large,
  • Store this information in source control for the purposes of sharing between my machines, and
  • Write up and deal with ideas in a systematic fashion.

So, the perhaps the first component of this system is, what operating system? Happily, it doesn’t exactly matter. I use Ubuntu, but I also use Windows 7. The great thing about this scheme is that it is adaptable to any environment that runs the tools (well, obviously) and the tools all have multi-platform support.

I personally find editing in vim nicer on Ubuntu, and there is one or two arguably minor things that linux has that Windows does not (XMonad, for example), but I will elaborate on these later.

The general scheme

This approach is, obviously, tailored specifically for me, and given that I have a significant programming background, I am happy to solve some problems with actual programming. I also quite enjoy “systematic” approaches; i.e. things that may not neccessarily be the most user friendly or easy, but ones that follow a specific and clear pattern that makes logical sense.

This approach may not suit everyone, but hopefully there are at least interesting and useful ideas here that you could adapt to your own situation.

The beginning – reference management

So, of course in order to gather research papers it is neccessary to store them in a useful way. JabRef is free, and is a very nice option for this. I’ve described my custom JabRef set up on my own personal blog a few months ago, please read that for how to do this.

One thing on using JabRef is that sometimes you need to correct the format that bibtex exports give you. For example, one thing I am often changing is lists of authors like: “Billy Smith, Joey Anderson” to “Billy Smith and Joey Anderson”

It’s not immediately clear to me why the bibtex are generated the wrong way from some of these sites, but nevertheless. This simple correction is neccessary for the data to be stored properly, and the authors to be picked up correctly, etc.

Okay, now that you’ve read that, you understand that I save all my PDFs to a specific location, a folder like “…/research/reference library”. Where is this folder? It’s in the research repository.

The “research” repository

I keep all my research, on any topic, in one generic folder, called “research”. This is a private git repository hosted on BitBucket.org. I chose bitbucket over github because bitbucket has free unlimited-space private repositories, while githubs cost money. It is neccessary for the research repository to be private for two reasons, one obvious one is that it contains paywall-restricted PDFs, and the other is that it’s just not appropriate to have in-progress research notes be viewable by anyone.

So, the general structure of my research repository is as follows:

~/research
    /articles
    /conferences
    /diary
    /jabref
    /reference library
    /projects
        /semigroups
        /...
    /quantum-lunch
    /...

The contents of these folders are as follows:

~/research/articles – Articles

This contains folders that map directly to papers that I’m trying to write (or more correctly at the moment, scholarships that I’m applying for, and misc notes). These are all, unsurprisingly, LaTeX documents that I edit with Vim.

When I complete an article, I created a “submitted” folder under the specific article, and put the generated PDF in there, but up until that time I only add the non-generated files to source control (this is the generally correct practice for using any kind of source control; only “source” files are included, anything that is generated from the source files is not).

~/research/conferences – Notes on conferences

In here I have folders that map to the short conference name, for example “ACC” for Australian Control Conference. Under that, I have the year, and within that I have my notes, and also any agenda’s that I may have needed, to see what lectures I would attend. The notes should be in vimwiki format (I will describe this later) for easy/nice reading and editing.

~/research/dairy – Research diary and general ideas area

This is the main place I work on day-to-day. It contains all my notes, and minutes from various meetings and lectures I attend. It contains a somewhat-daily research diary, and a list of current research ideas, past research ideas (that were bad, and reasons why) and so on.

My preferred note taking form is vimwiki (to be described below), so in here are purely vimwiki files.

It’s not essential that you also use vim (and hence vimwiki), but it is appropriate that whatever mechanism you use, it is a format that is ameneable to source control (i.e. allows nice text-based diffs). Emacs or any plain-text editor will be sufficient here.

~/research/jabref – Bibtex files

This is perhaps not the most appropriately named folder, but nevertheless. It contains all my .bib databases. I actually only have 3. One is very inappropriately called “2010.bib”, with the view that I would store research by the year I gathered it. I’m not following this approach and I actually just keep all my research related to quantum computing (and more general subjects) in here.

I have two other bib files, one is related to a secondary field of that I am interested in researching. That is to suggest, in 2010.bib I have only documents related to quantum computing, theoretical physics and some theoretical computer science. I have a different .bib for research in completely seperate fields, say investment. The other is “lectures.bib”, and it is obvious what that contains.

It’s worth noting that I actually have two systems for storing lectures. One is the above, where the lecture set fits into a nice single PDF. The other is when the lecture series is split over several PDFs. These I store under a generic “University” repository that I use for my current studies (including assignments and so on). This component of my current setup needs work, and I’m open to suggestions here.

~/research/reference library – All the PDFs

Every PDF releated to my research is stored in here, prefixed with the bibtex key. So, for example I have “Datta2005, 0505213v1.pdf”. JabRef associates this with the relevant item in the .bib file by the prefix, and virtue of this link in the .bib I have a trivial way to programmatically (if I’m so inclined) get this PDF.

I don’t ever browse this folder directly, and currently it contains ~600 PDFs and is about 1 gig in size. Storing this much data in a git repository may offend some people, but essentially they are wrong to be offended. It is okay to store binary files as long as they are not constantly changing, and they are considered a key component of the repository; which in this case they are.

There are some downsides to this, though, and I think it’s plausible to consider alternative arrangements.

The viable alternatives are:

  • Dropbox for PDFs,
  • Secondary git repository for PDFs only, and include as submodule, or
  • Some other non-local storage (say, the one offered by Zotero).

I dislike all of them because for me I prefer to have everything together. I could see dropbox being suitable, because technically it’s not neccessary to have verioned PDFs.

If you have any comments on this, please share them.

~/research/projects – Specific Projects

You’ll notice I have one folder here, “semigroups”. This relates directly to a research scholarship I completed at RMIT. This actually involved a python program, which I have in this directory, as well as some miscellaneous files. It may be appropriate to have nicer codenames for projects, or somehow related them directly to the scholarship details. I think the best approach here is to have a codename, which is detailed in the “diary” folder, and then there is no risk on confusion or duplicate names. The scholarship details could be held seperately in the folder, because perhaps the work could be continued across scholarships.

Anyway, it’s probably not neccessary to overwork this structure. It can always be changed, and it shouldn’t be prohibitively difficult.

~/research/quantum-lunch – Files related to my reading group

This folder is indicative of the other types of folders you may find in this directory. In here, I have some misc python scripts related to this group. There are no notes in here, they are kept in the “diary” folder.

Technically this should be a transition area, where scripts and programs that reach an appropriate level of maturity/usefulness are either published publically (in a different repository), or moved to an appropriate folder under project, but I’ve not yet gotten to that stage.

It’s worth noting that I do have a public github profile: silky, underwhich I will, and do, publish and tools that are worthwhile being public. If one of these projects reaches that stage, I’d essentially move it out of here (delete it) and continue it in that area.

The tools

So, with the repository layout described, let me know discuss the tools I use. We’ve already covered JabRef, for reference management, so we have:

  • JabRef (as mentioned), for reference management,
  • Vim + Vimwiki plugin, for taking notes, keeping ideas, and writing LaTeX,
  • Okular, for reading PDFs, and annotating them [linux],
  • Python, for programming small scripts,
  • XMonad, for window management [linux], and
  • pdflatex and bibtex, for compiling latex (from the TeXLive distribution).

So, almost all of these are available on any platform. Okular is worth commenting on, because it has an important feature that I make use of – it stores annotations not in the PDF but in a seperate file, that can then be programmatically investigated. If you can’t use okular, then you may find that your annotations to PDFs are written back into the PDF itself, and it will be difficult to extract this. You can decide whether or not this bothers you when I describe how I use my annotations.

I will now describe the usage pattern for the various tools, starting in order for easiest to hardest.

Tools – Okular

So, install okular via your favourite method, say “sudo apt-get install okoular”, and then open it. You will want to make it your default PDF editor, and I also choose to have it be very minimal in it’s display; setting the toolbar to text only, hiding the menu, and hiding the list of pages on the left. I also configured a shortcut for exiting, namely pressing “qq”.

For me this is indicative of an important principle – make small customisations that improve your life. It’s worth thinking about, as they can often be trivial, but provide a nice noticable benefit.

You will also want to enable the ‘Review’ toolbar. This allows you to highlight lines of interest, and also add comments. Your comments are saved in a location like:

~/.kde/share/apps/okular/docdata/[number].[pdf-name].xml

This is where it gets fun. I’ve written a program to capture these comments, as well as comments in the ‘Review’ field of the .bib file. This tool is available on my github: get-notes.

You may need to adjust the ‘main.conf’ to suit your needs, or even change the source in some fashion. The code is pretty trivial, but requires some python libraries that you can install with easy_install.

This file products vimwiki output (you can trivially change this however you like, if you program in python). I then symbolically link this generated file (“AutoGeneratedNotes.wiki”) to my “~/research/diary”. Of course, following the general strategy of not including generated files in the source code, I do not break this rule for this file. There is one perhaps obvious downside to this: The output might be different on different machines, because the ~/.kde/… folder is not under source control. I consider this acceptable, because this file is a “transitional” file, in that it is not supposed to be for long-term storage of ideas and comments.

The contents of this file should be reviewed, occasionally, and then moved into either a PDF of comments, or into the research diary for an idea to investiage, or removed because you’ve dealt with it.

For example, I have a comment in the “Review” field of the file “Arora2002”. It says: “Has a section on the Fourier Transform that might be interesting”. This should, eventually, be transitioned into a minor topic to investigate further, or a small writeup in a private “Comments on things” LaTeX document, where you write up, slightly more formally, and with maths, your thoughts on things you’ve learned. I have this document under my “articles” folder.

With this in mind, it is then not an issue that the generated output is different bewteen machines, because ideally there will be no output on any machine, one it has been sufficiently transitioned.

Tools – LaTeX

As indicated, I use LaTeX to write up maths and more detailed notes, proposals, applications, etc. You may wish to use some front-end for LaTeX authoring, for example LyX, but as I already do a lot of work in vim, I prefer to also do LaTeX in here. If I were to switch to another editor, it would probably be Emacs.

Tools – Python

As mentioned in the above comment, I use python to write small scripts. Because they’re in python, they are essentially directly runnable on any system (provided the associated packages can be installed).

I also like python because it provides various packages that are actually useful for my research (like, for example, numpy). You can get similar funcionality from free Math environments, though, such as Octave.

Tools – XMonad

XMonad is not particularly neccessary for this workflow, but I include it because I find it’s ease of use aids in efficient reading and editing. I don’t want to go into significant detail of XMonad configuration (but it’s a fun way to spend your time), you may simply review my XMonad configuration on github.

What I like about it is the concept of focus. You can simply and easily make a PDF full screen, for distraction-free reading, and then switch things around to have vim side-by-side for commenting with context.

Feel free to disregard this, if you are using Windows, as it’s equally possible to do fullscreen and side-by-side editing in Windows 7. XMonad also offers other benefits for general programming, which is the main reason I have it.

Tools – Vim + Vimwiki + LaTeX Box

Essentially the last item in my setup is Vim. It’s hard to express the level of obsession one has for Vim, after a while of using it. It is highly customisable, and includes and inbuilt help system, which I used all the time, when initially learning it.

Most people will find Vim initially difficult to use (I did, when I first learned it when starting work here), but if you dedicate a few days to using it correctly, and you make significant use of the “:help [topic]” command, you will get the hang of it.

You aren’t truly using Vim correctly (or, indeed, living a full life), if you don’t get various plugins. The neccessary ones for LaTeX + note taking are: Vimwiki and LaTeX Box or Vim-LaTeX.

You can find the current state of my Vim configuration, again on my github – .vim

I actually currently use Vim-LaTeX, but I am planning on changing to LaTeX Box because it is more lightweight, so I would recommend starting with LaTeX Box.

The nice thing about using okular is that you can recompile your LaTeX document with the PDF open, and it will refresh it, keeping the current page open. This is very useful when typing long formulas, and reviewing your work.

I have configured Vimwiki to open with “rw”, so I can type this at any time, in Vim, and be looking at the index to all my research notes. In this I have links to all my diaries, my storage spots for old search ideas, and a big list of topics to look into. I also make “TODO” notes in here, and review them with one of my other tools, “find-todo” (on the aforemention github, under /utils). This gives me a list inside Vim, and I can easily navigate to the appropriate file. Again, the TODO’s are items that should be transitioned.

Review

I have documented my reseach environment, as it stands currently. It allows me to make notes easily, transition them in an appropriate workflow, and access all my documents at any time, from any computer.

The proof of a good research environment obviously in the blogging, it’s in the producing of legitimately good research output, and of course that’s yet to be delivered (by myself personally), so it’s not possible to objectively rate this strategy for it’s actual effectiveness. Nevertheless, I do feel comfortable with this layout; I feel like I can take the appropriate amount of notes; I feel my notes are always tracked, and I feel that I have a nice and readable history of what I’ve done. I like that I can track bad ideas; I like that I can make comments “anywhere” (i.e. in Okular or in JabRef) and have them captured automatically for later review, and I like the feeling of having everything organised.

I hope this description has been useful, and I would love to hear about any adjustments you’d propose, or just your own research strategies.

— Noon

Pre-emptive optimisation

For one of our long-standing clients we have been running vehicle routing optimisations on a daily basis. A file of daily orders is uploaded into our Workbench system, and is split up into several regions, each of which needs to be separately optimised. A planner works through each region, going through a series of data checks (e.g. location geocode checking), before hitting an “Optimise” button.

All of the heavy lifting is done on a server (the planner accesses it through a web app via a browser), so it’s possible that the server could silently start up the required optimisations without the planner’s involvement, and in the (fairly common) case where the region does not require any data corrections, when the planner is up to the optimisation stage, the result could be immediately available (as it has already been run, or is in the process of being run). This idea now been implemented and took only a short amount of Python code.

Furthermore, it runs in parallel as each optimisation is itself split into a separate child process (running a C++ exe) which Linux distributes across the 8 cores of our server machine.
The pre-emptive optimisations are kicked off using Python’s multiprocessing package, as follows:

from multiprocessing import Process

p = Process(target=start_preemptive_optimisations, args=(…))

p.start()

Control is returned to the user at this point while the optimisations run in the background. Results are stored in a folder whose name is stored in a database table; when the planner then comes to press Optimise, the system checks if there have been any data corrections – if so, it runs the optimisation from scratch as usual (the pre-emptive result for that region is thus never referenced); however, if there are no corrections, the system simply uses the stored result from the folder.

The end result for the user is that in many cases the optimisations appear to run almost instantaneously. There are really no downsides as we do not pay for our servers on a CPU cycle basis, so we can easily be wasteful of our server CPU time and run these optimisations even if their results are sometimes not needed.

One “wrinkle” we discovered with this is that we had to make our process checking more robust. There is Javascript in our browser front end that polls for certain events, such as an optimisation finishing, which is indicated by a process ceasing to exist. The Python for this is shown below, where “pid” is a process ID. The function returns True if the given process has finished or not.
def check_pid_and_return_whether_process_has_finished(pid):
if pid and pid > 0:
multiprocessing.active_children()   # reap all zombie children first; this also seems to pick up non-children processes
try:
os.waitpid(pid, os.WNOHANG)       # this reaps zombies that are child processes, as it gives these processes a chance to output their final return value to the OS.
except OSError as e:
if int(e.errno) <> 10:  # the 10 indicates pid is not a child process; in this case we want to do nothing and let os.kill be the function to throw an exception and return True (indicating process is finished).
return True
try:
os.kill(pid, 0)   # doesn’t actually kill the process, but raises an OSError if the process pid does not exist; this indicates the process is finished.  Applies to all processes, not just children.
except OSError as e:
return True
else:
return True
return False

Note the “reaping” of zombie processes here, and the complication that arises if a process is not a direct child of the calling Python process (it might be a “child of a child”). In this case we use a call to the (in this case rather mis-named) function os.kill.

Database vs pure python performance

I was just looking through some of our common code and noticed some maths code in python that I knew could be replaced by a function in our geo-database, Postgis. However I wondered what the performance difference between calling out to some C code via the database and doing a simple bit of maths in python would be. The code has no loops, is simple and should be very fast.

It turns out the database approach is around 100X slower…


>>> timeit.Timer('common.lat_long_distance_db(db_wrap, 144.865906, 144.865237, -37.836183, -37.831359 random.random())', 
         "from __main__ import common, db_wrap, random").timeit(10000)
2.001802921295166
>>> timeit.Timer('common.lat_long_distance(144.865906, 144.865237, -37.836183, -37.831359 random.random())', 
         "from __main__ import common, random, db_wrap").timeit(10000)
0.02936482429504394

I would suggest this isn’t the right way to use the DB! 🙂

If we write a function where we just select a constant value from the database we can find the call over head of the database calls and remove the postgis process and parsing cost.


timeit.Timer('lat_long_distance_db_just_select(db_wrap, 144.865906, 144.865237, -37.836183, -37.831359 random.random())', 
         "from __main__ import db_wrap, random, lat_long_distance_db_just_select").timeit(10000)
1.4460480213165283

Large percentage of overhead.

Seems to suggest that I should look into the difference between selecting out the data, apply the python function then updating the db with results. I suspect apply the postgis functions to data that is already in the db instead of looping in python will be best, but empirical data is enlightening!

Cross-platform development

During the course of developing Biarri’s flagship Workbench product, we’ve taken pains to ensure that our (GUI-less) optimisation “engines” work well under both Windows and Linux operating systems (so-called cross-platform). This turns out to be relatively easy as long as you stay away from the big OS-specific frameworks (e.g. Microsoft’s MFC/COM/ATL etc). We’ve picked up some handy tips along the way, particularly applicable to C++ development, which are worth sharing here.

  • Be aware of differences in line endings – Windows uses carriage return and line feed \r\n, while Linux/Unix uses just line feed \n. (Note that Visual Studio will show files with Linux line feeds correctly, but Notepad won’t – this is one way to tell what line endings your file has in Windows). This can be particularly important when importing data e.g. into databases where the file originates from another OS.
  • Always use forward slashes for file paths, not backslashes. Also, file names and folder paths are case sensitive under Linux but not under Windows. And don’t assume there is a C: or D: drive!
  • You may have to be careful writing to temporary files and folders. In Linux /tmp is often used; in Windows /[user]/AppData/local/temp (location of the TEMP environment variable – e.g. type “%TEMP%” into the start menu or Windows Explorer). For Linux, it is sometimes necessary to manipulate a folder’s “sticky bit” to ensure that the folder is accessible by other users (e.g. a Postgres database user) – e.g. in Python:
os.chmod(temp_dir_name, os.stat(temp_dir_name).st_mode | stat.S_ISVTX | stat.S_IRGRP | stat.S_IROTH | stat.S_IWGRP | stat.S_IXOTH)
  • Be aware of the differences in file permissions in Windows and Linux. In Linux files have an “executable” bit. chmod a+x [file] makes a file an exe, which can then be run with “./filename”.

For C++ development:

  • Name all cpp and h files in lower case if possible. Files are case sensitive in Linux and this includes #include’s!
  • For compiling with GCC under Linux, the last line in a C++ file must be blank.
  • In Linux C++ programs, general exception handling with catch(…) does not work. You can use sighandlers instead (see this for example), though it’s not as good – it is more equivalent to an exit(), with a chance to clean up.
  • Beware doubles comparisons and inequality checking, at least in C++ programs. Always use a delta i.e. A == B may not be the case in both Windows and Linux if they are essentially the same number so use fabs(A – B)
  • Build tips for Linux: Type “make” when you are in the directory to build the project. This will search for a file called “Makefile” and run it. (Use “make -f filename” to make from a different makefile). To force a recompile you can “touch” a file using “touch filename”.
    To clean out all object files type “make clean” (as long as your make file defines what cleaning does…). Use “make -j4” to run make with for concurrent jobs, to take advantage of multicore.
  • In bash, to get a recursive line count of .cpp/.h files: find [directory] -type f -name *.cpp -exec wc -l {} \; | awk ‘{total += $1} END{print total}’

Biari Workbench Technology Stack

Over the course of developing our Workbench solution we’ve adopted a powerful set of interconnecting components. It’s worth mentioning what these are and how they fit together.

Almost all the components of the stack are free and/or open source. We want to be as platform independent as possible and not get too locked in to one technology paradigm. This means that as much as possible, parts should be as “hot swappable” as possible – which also helps encourage strong componentisation. Using components with mature and open/standardised interfaces is very necessary when you’re crossing language boundaries (most notably, Javascript-Python and Python-C++) and client/server boundaries; otherwise you risk re-inventing the wheel. Ideally each component we use should also still be in active development (in the IT world – with the odd highly venerable exception – if software is not growing and evolving, it’s usually either dying, already in it’s death throes, or extinct).

There’s an art to using the right tool for the job, and we’ve made mistakes. We over-used Mako (see Loki’s blog post) and also originally used a slightly inferior lib for the C++ xmlrpc back end; both these mis-steps were fairly easily rectified. Arguably, we probably still use too much C++ and not enough Python – the C++ line count dwarfs the Python line count by a considerable margin. One last interesting point is that, at the moment, we’re still eschewing use of an ORM (Object Relational Mapping layer – such as SQL Alchemy) – time will tell whether that is a good idea or not.

Client:

JavaScript
– client-side browser language

jQuery
– JavaScript library for event handling and more

Cloudmade
– OSM map provider/server

 

Data Interchange:

JSON – JavaScript Object Notation

XML – eXtensible Markup Language

Mathematical Engines:

Mostly in C++ using STL

CppUnit
– C++ library for unit testing

OGR
– map data library, part of GDAL – used to read map data

Libxmlrpc-c
– C++ back end for XMLRPC – used by a running process to communicate with the front end via Python

 

Server:

Python
– language

CherryPy
– Python-based web app framework

PostgreSQL
– open source RDBMS

PsycoPg2
– database adaptor for PostgreSQL/Python

XmlRPCLib
– XML RPC (used to communicate with some of the engines)

Mako
– Python template library

Repoze
– Zope/WSGI Python middleware (for authentication)

 

 

An IDE for Python

For some time now I have been trying to find a decent IDE with step-through debugging support for Python. I’ve wanted it for Linux, but Windows support would also be a bonus.

There’s some debate about the need for an IDE for Python, which (as a veteran of C++ development with Visual Studio) I am still pondering. I get that Python is a higher level language (umm, that’s why I’m using it), but the central problem of ironing out the kinks in the business/engine logic of my code is never going to go away. It really makes me wonder what size and types of code bases the IDE/debugger naysayers are building.

People also talk about Eclipse with PyDev, but I’m deterred by the reputedly formidable learning curve, the reportedly sluggish performance, and the apparent bloat of it. I wanted something lighter, but still free. And I didn’t want something that would require a big project hierarchy, settings tweakings etc, just to run a small Python script. I don’t think my needs are outlandish: easy thing should be easy, hard things should be possible…

This comparison of Python IDEs – the first hit on Google – seems good but is 5 years old (ancient in software development terms). And the Wikipedia comparison table is just the basic facts, ma’am. The Python.org list of IDEs is better, but without some sort of detail or commentary it’s difficult to figure out what’s best for your needs, and parts of it are out of date. Mile long feature lists are all well and good, but how well do the features do what they’re supposed to do?

So I embarked on a trial of a few of the free IDEs out there. First stop was SPE – “Stani’s Python Editor”… which I couldn’t get installing. I know, I know, in Linux Land you’ve got to be prepared to tinker in the innards with spanner in hand… but a frustrating hour or two later, no go. Perhaps because this tool doesn’t seem to be actively developed any more (as I found out afterwards). Next I tried Boa Constructor. It installed, and first impressions were cautiously positive, though it felt like beta software to me. Sure enough after trying to use it in anger the pain points came – I couldn’t figure out the rhyme or reason why it wouldn’t just run the Python script I had open, I had to constantly restart, breakpoints weren’t always respected, etc. Overall it seems more aimed at GUI building than running scripts.

Next was the Eric IDE. Eric installs with just a simple “apt-get install eric”. The Python file you have open runs with “Start/Debug Script…” Breakpoints and stepping through just work (in fact, debugging is an absolute breeze). Lines with Python parse errors get a little bug icon on them in the editor margin (cute, but also handy). It’s not perfect by any means – it takes a while to start up, it occasionally automatically breaks at places where you have no breakpoints, etc etc. It’s GUI is formidable at first glance, but it’s set out logically and should seem familiar to those like me steeped in Visual Studio. It’s also still being very actively developed.

One interesting aspect of Eric’s editor is that it uses rich text with proportionally spaced fonts, in contrast to the plain monospaced font that most code editors sport. This might seem sacrilegious to some, but it seems to work fine for me, and in fact lets me see more code on the screen. It’s not so good for ASCII art though.

Obviously I haven’t tried trials of the commercial IDEs out there – Komodo Edit, WingWare, etc, and I’d be curious what “killer features” (that work out of the box!) they have that Eric doesn’t. But for now, the journey’s over, with a clear winner.

Choosing a web framework

In the ongoing evolution of the workbench I’ve solved some interesting problems and found some hard ones I haven’t solved yet. I also made some poor choices and later found more elegant solutions. All this is a learning process and I’d like to share some of those findings in this series. I’ll cover templates, web framework choices and making your own widgets.

Templates:

Originally I searched for the most elegant and expressive templating language for what I needed to do and found Mako. It’s fast, good for simple interpolation and some basic control structures. However, after trying to write even one of our basic workflows in it, I was less enthused. I needed a range of slightly dodgy hacks to get what I wanted out.

For instance, I wanted to have a mako function put a piece of text (javascript) in a buffer and output it later (in the head section). I found a way of doing this:


<%!
    def string_buffer(fn, target_buffer='html_buf'):
        def decorate(context, *args, **kw):
            context['attributes'][target_buffer]  = runtime.capture(context, fn, *args, **kw)
            return ''        
        return decorate
%>

<%def name="doc_ready_js()" decorator="partial(string_buffer, target_buffer='doc_ready_buf')">
    ${caller.body()}

<%def name="main_js()" decorator="partial(string_buffer, target_buffer='main_js_buf')">
    ${caller.body()}

Needing to use decorators, partial function application and weird context/global variable magic for quite simple features freaked me out enough and convinced me to hunt for a better solution. I wanted the simple to be easy and the hard to be possible. The template was making the the simple hard. I looked at a pile of templating languages and I still liked Mako the most out of them, but decided after reading this wonderful rant by by Tavis Rudd that the solution is obvious: Do simple stuff with templates if you need to, but mostly avoid them.

The clearest way to demonstrate the advantages of a pure python solution is with code, both sections of code generate the same simplified workflow:

The Mako:


<%namespace name="zui" file="zui.mako"/>
<%namespace name="importer" file="importer.mako"/>
<%namespace name="widgets" file="widgets.mako"/>

<%inherit file="workflow_wrapper.mako"/>
<%def name="workflow()">
    <%
        current_workflow_id = attributes['get_new_id']()
        attributes['workflow_steps'][current_workflow_id] = [[],[],[]]
        grid = attributes['get_new_id']()
        map = attributes['get_new_id']()
    %>

    <%call expr="importer.importer({'name':'text', 'address':'text'}, ['name', 'address'], r'''
    Here you need to select a CSV file with address and a name for the locations
    you want to geocode. Names have to be unique.''', table_name,  'import_done()')">

    ## events are run on the client side so are javascript
    <%call expr="widgets.run_engine_step('Geocode',
                                         {'onclick':'''calculate_button('geocode_addresses',
                                         geocode_results);'''},
                                         workflow_icon='geocode')">

    <%call expr="zui.workflow_step('output', 'View Results')">
        ${widgets.grid_holder(table_name, grid)}
        ${widgets.map_holder(map)}
    

    <%zui:main_js>
        function import_done(){
            console.log("run when import is finished");
        }

        function geocode_results(){
            ${widgets.grid_init(grid, table_name, '')}
        }
    

The Python:


import zui, widgets, importer, layout

class Geocoder(zui.Workflow):
    def __init__(self, the_zui, table):
        zui.Workflow.__init__(self, the_zui, table)
        importer.Importer(self, {'name':'text', 'address':'text'}, ['name', 'address'], r'''
            Here you need to select a CSV file with address and a name for the locations
            you want to geocode. Names have to be unique.''', self.table_name,
            'Geocoder.import_done()')

        ## events are run on the client side so are javascript
        widgets.RunEngineStep(self, 'Geocode',
                {'onclick':'''calculate_button('geocode_addresses',
                    Geocoder.geocode_results);'''}, workflow_icon='geocode')

        grid = widgets.Grid(self.table_name)
        cm_map = widgets.Map()

        self.workflow_step('output', 'View Results', container_size = "container-results1", body =
                layout.multi_column('', grid, cm_map))

        self.main_js(
            '''
            function import_done(){
                console.log("run when import is finished");
            }
            function geocode_results(){
                '''+grid.load()+'''
            }
            ''')

It is so much easier to be expressive and elegant when you minimise the usage of templates. You can clean things up by using some of the expressive power of a multi-paradigm language and not have to do anything too strange. I still see the use for templates when you need to interpolate a few variables into a long piece of HTML. The built in python template function seems fine for this. In my next post I’ll over web frameworks and what I think about widgets. You should be able to see from today’s bit of code that I think they are important and hard to do with templates.

Loki

p.s.

If you feel like relaxing after reading this, check out my band’s YouTube channel. 🙂

SQLite

I recently needed to organise and filter a heap of data from a new client. I didn’t want to deal with the overhead of a full-blown database and decided to try sqlite3. As it turns out, it was really easy to work with since the bindings are included with Python2.6. All I needed to do was read a bit on how to interface Python with sqlite here: http://docs.python.org/library/sqlite3.html#module-sqlite3 and I was good to go! I also installed a nice database management utility called SQLite Database Browser v2.0b 1 which you can get here: http://sqlitebrowser.sourceforge.net. It makes managing the structure of the database a bit easier than working in a windows command prompt and you can write SQL on-the-fly if you’re having some problems with your Python. I find that it is pretty stable (though some of my poorly written SQL queries do send it into a tizzy and I need to kill it and reopen).

Now, I’ve decided to create a sqlite database and integrate it with the Excel front-end for one of our solvers (used when clients require desktop deployment). I anticipated that integration with an SQL database would greatly simplify and speed-up the reporting (with the added bonus of a significant reduction in the need for me to write complex vba code). Initially I banged around getting really frustrated with Excel, and DAO (even after I installed the ODBC driver available here: http://www.ch-werner.de/sqliteodbc/). Then I discovered SQLite for Excel here: http://sqliteforexcel.codeplex.com/. Whew! So far, I have found it very easy to work with and I am busy completing my reporting tool.

XSLT vs Python/LXML

I recently had to adapt some old XLST which read in an XML document and did some transformations to turn it into CSV data. For those who don’t know what XSLT is (consider yourself lucky!), it is a declarative, XML-based transformation language usually used for transforming a source XML into destination XML.

Now the XSLT I was dealing with also had some Javascript and VBScript functions inside it, and after struggling with it for a while I eventually realised – to my horror, as I needed it to run cross-platform on both Windows and Linux – that it also incorporated some Microsoft-specific extensions. So I ditched the XSLT and switched to writing it from scratch in Python with the LXML library instead. Less than 2 hours later, to do the exact same task – including thorough error checking – the Python turned out to be 205 lines, while the XSLT was 714 lines.

For a long time I wondered if it was just me, that I was too much of a procedural, C++ thinker, and just didn’t “get” XLST. XSLT is supposed to be a purpose-built tool for the job, right? Well, I’ll say now what I’ve always secretly thought, namely that XSLT is obtuse and horrible and I’ll steer clear of it forever. And evidently I’m not alone.