Wednesday 31 December 2014

Finding ones way around Buenos Aires in 1870: a proof-of-concept for routing with OpenHistoricalMap data

A critical point about using OpenStreetMap technologies in OpenHistoricalMap is that we should get lots of useful tools for free.

Plaza 25 de Mayo
Plaza de Mayo 25, Buenos Aires in the 1860s
Source: Wikimedia Commons CC-BY-SA

We touched on this point during our end-of-year Google Hangout. In particular Karl Grossner. wanted to know more about how one might use the data for routing. Karl is part of the team behind the awesome ORBIS project at Stanford, which allows routing across Europe during the Roman Empire.

Tuesday 30 December 2014

From Mapping Trees to Tree Trails: some thoughts

The other day I engaged in a twitter conversation with Oliver Pescott, a biologist at the Centre of Environment and Hydrology who is very active in promoting citizen science  I was intrigued to see that he had created a map of interesting street trees in the centre of Sheffield using Google Maps (and see also the follow-up blog post).

Naturdenkmal 567 GuentherZ 2010-08-25 0129 Wien01 Rathauspark Platane
London Plane in Rathausgarten, Vienna
Probably this one.
Source: Wikimedia Commons
The chat prompted me to look at displaying information about street trees from OpenStreetMap using OverpassTurbo. Although I've mapped trees when I can, it's fairly arduous work if one wants to be reasonably complete. However, in several cities we have good quality tree data which has been imported from tree registers held by the local authority:
  • London Borough of Southwark. This was organised by Tom Chance, with a particular view to providing information on urban foraging. See his blog and a map about this.
  • Vienna. The Vienna OSM community imported a file with a large number of street and park trees for the Land. (I have a minor quibble about this because they didn't cross check against already mapped trees, and I'd added a very fine Scholar's Tree in the Rathausgarten which is now duplicated).
  • Bologna. A similar import which needs a bit of tidying up on the tagging.
These data are useful because they contain information over and above the species of tree: trying to collect things like girth etc. are too much for the regular mapper at the moment.

I've played with the Vienna data before so it was a good place to see what could be done with Overpass.

Oak trees (Genus Quercus) in Vienna on OpenStreetMap via OverpassTurbo
Species are colour coded using MapCSS as follows: green =robur; blue=petraea; red=rubra; oragnce=cerris; cyan=pubescens; purple=frainetto; small yellow dots with blue border are not identified to species.
Now this is very useful for the sort of things I want to do: either visit particular trees to learn how to identify them, or to look for insects and fungi associated with that species. For instance a few years ago I found galls of a fairly newly arrived gall wasp (Andricus grossulariae), and ended up collecting the galls for research by Graham Stone's team at Edinburgh University. This required knowing where I was going to find Turkey Oaks reasonably local to me. I have therefore been interested in fairly comprehensive maps of individual species.

Oliver's interest is one of education.

Providing a selection of interesting trees can really help people get started in learning more about the trees (and other wildlife around them). Having a large dataset, such as an entire city's tree register is far too much for this purpose. Indeed it might be too much for the person wanting to create a tree trail or even a curated list of interesting trees: it's not just location of the trees, but some will be more useful for this purpose than others.

Cedar of Lebanon, Wollaton Hall gardens
These gardens contain several old trees of this species.
Source: Andrew Abbott via Geograph

So the question arises: "As we get more tree data in OpenStreetMap, how can we make it usable for such purposes?" In particular our focus in OSM on 'ground truth': repeatably observable features of things we map makes this more of a challenge. This is actually a general problem: as more data gets added to OSM, it can get harder to find sub-sets for particular uses.

Once again I turned to osmfilter to help resolve the problem. Firstly I created a file just containing the highly attributed trees from Vienna, using a simple filter (--keep= "natural=tree && species= && taxon= ") . An additional useful feature is that osmfilter can create simple tab separated counts of given tags, so I was quickly able to find the numbers of trees of each species: there are 265 different species comprising over 122, 000 trees. Nearly 20% are a single species, Norway Maple. Here's a pie chart of the top 20 species (about 75% of the total):
The 20 commonest street trees in Vienna (total about 90,000 trees).

Clearly with these trees we need to be highly selective in choosing examples. Indeed this might be true for any tree with over 100 specimens in the city (72 altogether). At the opposite end of the scale there are nearly 100 species with fewer than 10 specimens, and amongst these are some of considerable interest or beauty, such as Red Maple (Acer rubrum) and the Handkerchief or Dove Tree (Davidia involucrata).

Cherry Blossom Grove on the National Mall
Flowering Cherry Trees, Washington D.C.
Examples of a collection of trees with strong seasonal interest
Source: Wikimedia Commons, CC-BY-SA
A quick check of some GPS traces from a couple of guided tree walks which I have attended suggests that a reasonable number to cover is 10-15 in an hour. The more interesting and unusual trees obviously are likely to have longer stories. This also means that a walk needs to be in a smallish area, with a variety of trees. Most of the ones I've been on, or followed using a leaflet, have been in parks (Graham Piearce has put together several for both Nottingham City Council and Nottingham University (pdf)), but this is mainly a combination of convenience and concentration of suitable trees.

The advantage of having large data sets is that it creates the possibility of having an endless suite of walks which can start from anywhere: it's not just the centres and parks of large cities which have interesting trees.

In order to select trees for a computer-generated trail they need to be scored. We are unlikely to capture human or historical interest associated with trees on OpenStreetMap, so scoring is likely to have to rely on other factors. These are the ones I have come up with whilst writing the article:
  • Native Trees. Tree walks provide a great opportunity to familiarise people with trees they might see in the countryside. As they will also have more associated wildlife they also can introduce topics such as pollination and pollinators, microfungi, plant galls etc. I would include extensively naturalised trees in this category (such as Sycamore in the UK). Scoring a tree as native/naturalised requires a list of species for a given geographical area with its status. I would be very hesitant about the sense of adding such data to OSM.
  • Locally Rare or Unusual Trees. These can be determined by choosing the lowest quartile (or some other means) of all trees mapped in the district.
  • Taxonomic Variety.  Including trees from a good range of plant families heightens interest, but also starts building the ability to recognise characters of the family (something which I used a lot in Argentina, where much of the flora was unfamiliar). Variety within a  common genus, such as different types of Oaks, or Maples is also a common theme. Taxonomic data can be acquired in an automated fashion from places such as Wikipedia or the Encylopedia of Life.
  • Large Trees. Trees with a large girth are likely to be old, and distinctive. Often photogenic, but may be too large to show features of the leaves as these will be above head height. (Street trees often have lower branches and epicormic growth removed). Requires girth or diameter to be tagged.
  • Avenues or other distinctive planting patterns. In principle the denotation tag allows these to be determined, but I suspect that identifying most of them will require some geospatial processing.
  • Trees with non-tree tagging. In Vienna a few trees are also tagged as historic=tree_shine, and in many cities some trees are planted to commemorate events or are memorials.
  • Commonest Trees. Although a tree trail's primary interest is in the less known specimens, the really common trees cannot be ignored. For instance in towns and cities throughout Europe, the London Plane is a common non-native street tree (indeed there are many in places like Buenos Aires), and many are old and large.
  • Fruit Trees.  Trees which produce fruits or nuts. Again would need some kind of external tabulation of properties.
  • Cultivars. Other things being equal it may be more interesting to show a cultivar of a common tree, a Copper Beech rather than an ordinary one, a Norway Maple with variegated foliage rather than an ordinary one, etc.

    Buenos Aires - Jacarandá
    Jacaranda in Buenos Aires
    Beatrice Murch (see her blog and photos on Flickr, of Buenos Aires trees)
    via Wikimedia Commons, CC-BY-SA
  • Beauty.  An abstract & subjective property. It may be more amenable to some more objective properties, such as size of flowers, known colour of autumn foliage, listing by horticultural authorities (such as the Royal Horticultural Society).
  • Seasonal Interest. If a tree's most distinctive features are only visible at certain times of year, this might be factored into altering the trial according to the season. For instance flowering cherries and other Prunus are highly valued when in flower, but not particularly rewarding at other times of year.
Some of these potential parameters are easier to deal with than others. In particular anything which requires creating an external data source of criteria will be harder to use. Simple lists of species/taxa meeting a criterion (such as a list of native trees) are easy to use as filters.

A reasonable balance in a trail might be a third rare trees, a third common trees, with the remaining third chosen more randomly. In other words there is an initial scoring process and then selection which uses scoring plus some mechanism to ensure variety.

We also want the trail to be more or less circular and constrained by time.

These sound like a lot of complex criteria. However, there is a nice precedent. Dimi Sztanko created a couple of years ago. This creates circular walks from a given location using a scoring system to define some measure of 'interestingness'. Needless to say I'm going to try and chat to him about the ideas above!

I've also tried out some of the ideas, by partitioning the data upto 5 classes based on ordered ranking for some of the above parameters: age, girth, height, rarity, native/non-native, cultivar or not. (I cheated and used a UK list for nativeness). Selecting all trees within about 500 metres of the Rathauspark I selected at random 4 species from the most common class, and 4 each from classes 2&3, and 4&5. This is the list I came up with:
  • Acer platanoides, Norway Maple
  • Betula pendula, Silver Birch
  • Broussonetia papyrifera. Paper Mulberry
  • Carpinus betulus
  • Chamaecyparis pisifera, Sawara Cypress.
  • Crataegus monogyna, Hawthorn
  • Davidia involucrata, Handkerchief or Dove Tree
  • Platanus orientalis, Oriental Plane
  • Quercus rubra, Red Oak
  • Rhamnus cathartica, Purging Buckthorn
  • Sophora japonica, Scholar's Tree
  • Ulmus minor
Not too bad, but a bit light on conifers, and plenty of native trees. This gives a total of 365 trees to then be filtered down to a single specimen for each species.

Selected species around the Rathaus in Vienna
Overpass Query ('cos I had trouble with QGIS)

I used a combination of weights to select a single tree from each category.

The selected specimens around the Rathaus
Overpass Query on Node Ids

Now its just necessary to create a route. For this example I've done this manually.

See full screen

Of course this sort of thing is not a substitute for a trail designed by a knowledgeable person, but it does show some of the possibilities for creating thngs where such a person is not available.

Updates & Corrections

Change "Lewisham" to "Southwark" in reference to imported data sets. 2016-01-14.

Thursday 11 December 2014

City Stripping : building historical road layouts from todays data

Buenos Aires - Plano de Basch (1895)
Street map of Buenos Aires 1895 (plano de Basch)
Source: Wikimedia Commons

For at least one year I have been thinking about the best way to create vector streetmaps for different historical periods for the same city. The basic principle is simple: take the existing street layout and remove what is new! Essentially, this presumes that any current vector data will be more precise than digitising vectors from old maps.

I have even given the process a name: "City Stripping", but in practice my attempts so far have been less successful than I hoped. I think that  I kept too much existing OpenStreetMap data in my first attempts, because in practice I found it easier to re-trace the old data (as here for Tartu/Dorpat).

Thursday 30 October 2014

War Memorials: revisiting an OpenStreetMap Project of the Week

Great Gable from Broad Crag col - - 768103
Great Gable, a mountain in the English Lake District.
The summit and around 1200 ha of the surrounding area are dedicated as a war memorial.

Four years ago I proposed mapping war memorials as the OpenStreetMap Project of the Week. It ran in early November 2010 to coincide with the anniversary of Armistice Day, when several countries honour their war dead. At the time I was intrigued at how this particular topic resonated with mappers.

I was gratified that both Peter Reed and Chris Hill felt engaged by the idea. Richard Weait, co-ordinator of Project of the Week, wrote a very interesting post about the poem "In Flanders Field" which was written by a Canadian.

Tuesday 28 October 2014

Strava & OpenStreetMap GPS traces: a quick comparison

Strava introduced their heatmaps and their Strava Slide tool at the Washington DC conference of the OSM US community SotM-US in the spring. I had a quick look at the time and it seemed interesting, but there was little data in areas which I map.

A question came up recently regarding how accurate the Strava heatmaps are for mapping routes on OpenStreetMap, particularly in wooded areas. This prompted me to have another look at the data.

It happens that I have made a lot of traces across two paths on an open playing field (an area sufficiently unobstructed that it is used periodically for calibrating and testing professional grade GPS equipment). A short distance from these two paths is NCN 6, a heavily used national cycle route. It was therefore very easy to grab some screen grabs of Strava and OSM data:

Jubilee Campus, Nottingham : Strava Heatmap
Strava Cycling Heat Map
(NCN 6 on left, university cycle paths centre & right)

Jubilee Campus, Nottingham : OSM GPS traces
The same area showing traces which have been uploaded to OpenStreetMap
(probably mainly by me)

Thursday 18 September 2014

OpenStreetMap at the UK Open Addresses Sympoosium

I attended the Open Addresses Symposium organised by Jeni Tennison of the Open Data Institute last month. This brought together a host of people and organisations interested in having an open alternative to the Postcode Address File (PAF).

Somewhat foolishly I'd suggested to Harry Wood that I might speak about addresses on OpenStreetMap.

Addresses mapped on OpenStreetMap in Britain
Density of address mapping in (southern) Britain on OpenStreetMap by local authority
(Northern Scotland not shown because little data, full map)
See text on map for full explanation.
I was glad to see that my talk was relatively late on in the day: the audience were unfamiliar and many of them came from large organisations., so I appreciated the chance to get an impression of them.

Tuesday 2 September 2014

Woodland Cartography

This is an expanded version of my talk at sotm-eu:

I start by seeking inspiration from the many ways in which woods and trees have been shown on maps in the past, and then consider what elements we may want for OSM data, and how we might depict such elements.

Monday 1 September 2014

Contributing to the Lesotho Mapathon

At the start of August I appeared in the OpenStreetMap stats for users adding most data in a day. This was the first time in ages that I've made enough edits to appear. The reason: I've been contributing this past week to a mapathon to map as much of Lesotho as possible. This has been co-ordinated by Irish OSM contributors, some of whom will travel to Lesotho early next year.

View from Lesotho village (5297237744)
A village in Mokhotlong District.
This is S of the area I have mapped, but looks similar on aerial photos.
Source: Wikimedia Commons.
The co-ordination makes use of the HOT Task Manager: a piece of software which has distant origins in something, long gone, called QualityStreetMap.

I've use the Task Manager fairly rarely, but development over the past year has added one feature which for makes it much easier to use: the creation of a bounding box in the JOSM editor. It is now much simpler to see the area one has undertaken to map. This in turn is important in reducing editing conflicts and redundant work.

Friday 22 August 2014

WWII Bombs in Nottingham : discovering local history whilst mapping

Last Saturday I showed 2 visitors how I mapped addresses (more on this later).

Infill housing, St Cuthbert's Road, Nottingham NG3

One little vignette stood out.

Sunday 10 August 2014

Perhaps I was too cryptic

maprebus_2 maprebus_3 maprebus_1 maprebus_4

Each of the above maps represents objects mapped on OpenStreetMap which have a tag which corresponds to a single word. Together they spell out a topical phrase:
  • Happy. Actually name=Happy's as for some reason I had not spotted lots of name=Happy in Europe.
  • 10th. A minor fudge, using tiger:name_base=10th and restricting it to Pennsylvania to reduce the amount of data returned.
  • Birthday. A couple of retail places in Japan have name=Birthday.
  • OSM, A real fudge, I used source:name=OSM (an interesting recursive defintition in itself) 
Happy 10th Birthday OSM!

Sunday 13 July 2014

Upland woods in the Schwarzwald

I wasn't done with looking at woodlands from a mapping perspective in Baden-Wurttemburg. A couple of days after SotM I walked from the top of the gondola at Feldberg to the Feldberg summit and then back to Hinterzarten.

The first part of the walk was somewhat marred by a fierce hailstorm which left me fairly damp. However I recuperated by drying out a bit over Kaffee and Kuchen at the Baldenweger Huette below the summit.

Forest path, part of the Emil-Thoma-Weg.

Thursday 3 July 2014

Weingartnermoor Woodland Walk : SotM-EU Mapping Workshop

At State of the Map Europe in Karlsruhe I used the opportunity to develop some of the themes I have already outlined for woodland mapping. Essentially I have three lines of attack:
  • extending the cartography of woodlands in OSM ;
  • finding richer ways of tagging woodlands in OSM ;
  • looking at how we can collect data about woodlands (mapping).
I didn't do anything about the second, but gave a lightning talk outlining just some ideas about woodland cartography (I got a few more over the course of the conference). For the latter I thought the best approach would be to get some OSMers in some real woods because real things are much easier to discuss than abstract ideas on the wiki.

We spend much of the conference listening and discussing abstract ideas, and (too) little time using the fact that we come from many countries to share our knowledge of tagging. (A little Guerilla Mapping is not out of place too). So this was a small innovation for an SotM conference too.

SotM-EU Woodland Workshop Participants
Participants on Woodland Mapping Workshop, Karlsruhe June 2014
(minus the author who can be seen in this photo)
I was fortunate in the first place that the Karlsruhe Stammtisch offered some good ideas and advice and the conference organisers chose to add the event to the programme. Secondly, I was fortunate in being supported by more OSMers than I expected: and I know I lost a few due to the early start (necessary to enable not missing the whole Hackday).

Sunday 29 June 2014

Visit to Poland April-May 2012

Visit to Poland May 2012
My Polish travels Spring 2012

Whilst completing this post I learnt of the death of a dear friend of many years, Prof. Dr. hab. Jacek Hennel, who died aet 88 on 2nd June 2014.

2008.06.22. Jacek Hennel Fot Mariusz Kubik 01

In the Spring of 2012 I spent just over two weeks in Poland: the main aim was to attend my cousin's wedding in Lubartow. Poland is a country I've always been about to visit, so I made sure to extend my trip. SteveC's OSM ready for prime-time blog post, prompted to finish this article because 2 years ago I successfully used lots of OSM-based data for my Polish trip.

This really should be about how Poland, Polishness and Poles have always impinged on my life: and why visiting Poland was bound to be imbued with lots of personal meaning. However, to do justice to the subject would just take too long, so what follows is just impressionistic.


I have known Poles all my life:
  • One of my father's closest colleagues was a Polish physicist (see above, and below); 
  • On my first day at school a neglectful parent forgot to collect me at the end of the day and I was consoled by a thoughtful Polish ice-cream man. He was one of many expatriate Poles, Ukrainians, Latvians and other Balts who had been placed in a refugee camp after WWII, at Ruddington, outside Nottingham. 
  • Others were parents of children in my class at school. 
  • Some of my closest friends at University had Polish fathers, many of whom had fought in WWII, and, usually, Irish catholic mothers. I will always remember one friend's father, saying in passing through the room when we were watching the end of Oh! What a Lovely War on TV: "There were lots of poppies on Monte Cassino in the Spring". 
  • A fellow student spent a summer in Gdansk and married a Polish girl he met there.
  • My cousin married one (see below).
  • There was even a Brighton-based punk band with a partially Polish name.
After graduating I worked in a lab with strong connections to a Warsaw research group: a fellow student spent 3 months in Tarkovski's lab in Warsaw:  his brother was attending the Warsaw Conservatoire at the same time. When martial law was declared I knew people who were put in prison or had to leave Poland by clandestine means: I refused to speak to the Russian in our department at the time. Man of Iron and Man of Marble spoke strongly to me, but not just as wonderful films, but because they represented a new hope for Europe arising from Poland. (I still remember my annoyance on leaving the Academy Cinema on Oxford Street, and overhearing some contemptuous remark about the cinematography.)

Later I lived near Ealing which has a vibrant Polish community. Many of my Anglo-Polish friends often resented having atypical fathers, with eccentric habits like making sauerkraut and raising ducklings for the pot, and of course, in general not just being embarrassing in the way parents usually are, but being embarrasingly different. However, in Ealing people of my own age seemed very comfortable with both British and Polish heritages.  


Of course Krakow is a jewel of a city, a heart of Polish culture, (and the destination of choice for British Stag parties).

My reason for visiting was to meet Jacek and his wife Jozefa: I've known Jacek as long as I've lived. They are prominent liberal catholic intellectuals who have led lives of extraordinary change, under four completely different political systems: pre-war Poland, the Nazis of General-Gouvernment, the Soviet puppet communist regime, and modern Poland of the EU.

At supper in the old Jewish quarter of Kasimierz, Agnieska, Jacek's daughter joined us. I remember going shopping for Beatles records with her when she stayed with my family in the early 70s. She still likes The Beatles. A little later around 1974 she gave me Enigmatic, an album by Czesław Niemen.

On the Sunday I visited them at home, and I bought flowers for Jozefa on the Rynek. I was able to enjoy the  beautiful scent as I carried them on the tram. Later Jacek told me about his father filming him and his brother with an early 8mm cine camera on the same square before WWII.

High Tatra

Kuźnice and the High Tatra from the slopes of Nosal
Early morning view of Kunice and the High Tatra

Another place I had to visit was Zakopane: my father first went there in the early 60s; and, as a child, I had a promise that I would be taught to ski there. Jacek's aunt remembered seeing V.I. Lenin in the village of Poronin just before WWI: it was conveniently close to the border for Lenin to retain contact with revolutionaries within Russia. Unfortunately I ended up visiting Zakopane at the start of May when two national holidays run into one another. This meant that I could not get the cable car to the peak of Kasprowy Wierch : the queues were already vast by 07:30, and the town was very crowded.

On the other hand I saw Zakopane as a uniquely Polish resort. A lot was quite vulgar: lots of eating and drinking; and yet the town has the typical faded allure of a 19th Century spa town too. Of course it's real strength is it's presence at the foot of the mountains, and although the main paths were busy this was not unpleasant. I hiked up to a big hut just on the snowline reaching it just before a sharp shower of rain. Most of the way I was passed and passed a group of young Polish women who were celebrating their graduation (one had her treasured certificate with her). They kindly made space for me at their table in the hut. The highlight of the walk was seeing (and photographing) a very sluggish Adder which had just come out of hibernation.

Adder, detail of head
Sleepy Adder in the Tatra

Somewhere I have a photo of my Dad outside this same hut in the early '60s.

The rain meant I started back later than planned: I got a bit worried that  I wouldn't be out of the Spruce forest before dark, and I had noted the signs about bears too:

There are Bears in the woods
The point when I wished I understood more Polish
Lubartów, Nałęczów, Lublin and Kozłówka: 

I returned to Krakow to meet other members of my family and do some proper tourism. On the day of the wedding we drove to Lubartów, stopping for lunch at, another slightly faded spa town, Nałęczów.

Jacek and Jozefa were by now taking a cure and staying at a sanatorium here. This was the only chance for my siblings to see them. I met Jacek in the hall of the sanatorium: he was amazed that we had found it so easily (OSM of course) and we then took a walk in the park and took the waters before returning to see Jozefa. We then adjourned for lunch at the very pleasant restaurant Ewelina in a villa slightly away from the centre of the spa, with some literary associations with Bolesław Prus.

Taking the waters at Nałęczów
Taking the waters at Nałęczów
Jacek Hennel, with the author (right) and his siblings.
We still had to make it to Lubartów, which we did comfortably in time for the wedding, although changing was a bit complicated as none of us were actually staying in Lubartów. The wedding itself was not particularly traditional: it was conducted in French and the initial music was provided by Breton bagpipes and bombard. There was much cross-cultural interchange on the music front as the church organist learnt some of the traditional Breton tunes, and the Bretons joined in the interminable renditions of Sto lat over the following 24 hours.

I got to my bed around 4 in the morning in a little country inn at  Kozłówka to the W of Lubartow. As we arrived I heard the wonderful song of a Golden Oriole from the park across the road. I pottered about the courtyard of the inn the following morning before we resumed eating and drinking in Polish style for the rest of the afternoon and early evening.

Kozlowka palac front 01
Kozłówka Palace, garden front
CC-BY-SA Wikimedia Commons
I was completely unaware that the park across the road contained a fine palace, Kozłówka Palace, which I could have visited that morning. Instead I only found it the following day, a Monday, when, unfortunately, it was closed. I'd returned to  Kozłówka after I went into Lublin to collect my hire car. In the short time available I was able to walk through the historic centre of Lublin, from the Castle to the Krakow gate passing by the court house in the main square. There are numerous other historic monuments in Lublin given its central role in the establishment of the Polish-Lithuanian Union at the Treaty of Lublin.


The rest of my trip was devoted to visiting a couple more National Parks in search of wildlife. I started by heading North staying the night on the outskirts of Białyostok to the Biebrza National Park. I first became aware of this in a programme made by Bill Oddie for the BBC, and had subsequently met his guide Marek Borkowski at the Rutland Bird Fair.

The Biebrza area is a huge (over 2000 km2) area of wetlands stretching along the river Biebrza to its junction with the . At the centre of the area, and the base for the administration of the National Park, is the former fortress of Osweic, parts of which are still under military control. The fortress was built when this part of Poland was in the Russian Empire. It controlled a strategic crossing of the marshes with a road and railway line, and was relatively close the frontier of East Prussia prior to WWI.

Elk at Biebrza
Female Elk (Moose to N. Americans) at dawn.
The problem of wildlife watching in early May is that the key times are dawn and dusk which makes for very long days. One of the highlights of the Biebrza marshes is that it still contains a decent population of Aquatic Warblers (Acrocephalus paludicola). This bird is now classified as vulnerable by BirdLife International. The favoured place for seeing them was a boardwalk in the S part of the reserve which was about 20-25 minutes drive from my hotel. This led out into an area of marshland full of the flowers of Bog Bean (Menyanthes trifolia). One morning I was lucky enough to see three Elk from an observation tower a little way N of this boardwalk. The drive back after dark brought another nice find: Hawfinches feeding on the road. My last day I moved over to the N side of the marshes and visited, all too briefly, an area known as the Red Marsh (Czerwone Bagno). I met some other birders as I returned and we lamented the absence of many warblers.


Białowieża was my last wildlife destination. I really hoped to see European Bison in the wild, and I used a guide for an exhausting 6 hour trip starting at 3:45. Ultimately we were unsuccessful, but I did see many Woodpeckers, and a male Citrine Wagtail. I was also very shocked when pulled over by the police at about 5:30 in the morning (they actually turned out to be the Border Guard), and took a long time to find all the documents. I really thought I was in trouble despite my guide's re-assurances. (Later I got stopped again, but was well prepared: the moral is that if you go within 2-3 kilometres of a Border Guard barracks in a car with plates from out of the area expect to be stopped).

Closest I got to European Bison, at Rezerwat Pokazowy Żubrów
Using a guide meant that I now knew the places where I might find Bison. I visited these morning and evening but still failed. Therefore I had to see them in the little zoo close to Białowieża. Walking from the car park to the entrance I got a great view of a Black Woodpecker which flew up to a nest hole. Whilst watching it a guy accosted me asking if I was interested in any guiding. After a brief conversation I realised that he was Mateusz, the son of my guide Arek, and he realised that I was the English guy staying in Siolo Budy. He had noticed this nest at the same time as me.

Black Woodpecker at nest hole
the tree is probably Aspen (Populus tremula)
I found it amusing that in a few days I had encountered most of the professional guides in Poland.

This part of Poland is great for wildlife, but it doesn't have quite the infrastructure or the number of visitors which would make it easier to get the most out of it. There are plenty of places to stay, but to my surprise most guests were Poles taking weekend breaks (as can be seen by the language of reviews on Trip Advisor and The small number of birders, mostly in guided parties, meant that there was little opportunity to learn about things by word of mouth. This latter contrasts strongly with Montfraqüe in Spain where the number of birders is not enormous, but in most viewing locations there were other people which increases the chances of catching unusual species without a guide. 

Tatar Country

Another aspect of Eastern Poland which I only learnt about when planning the trip, was the existence of a few villages which still perpetuate a distinct Tatar culture. These are situated to the NE of Bialystok, which itself is probably where most Poles of Tatar descent live. I managed to visit one late in the afternoon when I'd given up on bird-watching because of the rain. Fortunately the mosque was still open even though it was about 6 in the evening, where I was welcomed by a young couple, and given a tour of the mosque (image below), by the man who had a cousin living in Chesterfield!

The Mosque at Kruszyniany
Mosque at Kruszyniany
This area was fairly remote, the main road into the village from the Bialyostok - Brest highway was unsurfaced: a strange contrast after having driven past kilometres of tail-to-tail trucks queuing up for the border crossing into Belarus.

I also wandered around the graveyard a few hundred metres from the mosque: a very pleasant wooded area with a number of active woodpeckers. Many of the gravestones had both Polish and Arabic inscriptions, and to my surprise many had photographs of the deceased on them. I took some photos of these, but don't want to upset any religious sensibilities by posting them here.

Remnants of the Russian Empire

Church of Nativity of John the Baptist at Nowa Wola
Church of the Nativity of John the Baptist, Nowa Wola
belonging to the autocephalus Polish Orthodox Church

Through Eastern Poland I kept coming across reminders that this area had once been part of the Russian Empire. A road I drove regularly early in the morning before dawn and aeach evening after dusk in Biebrza was known as the Tsar's road. Osweic was an immense Russian-era fortification. Even churches were different because people belonged to different confessions: with both Uniate and Orthodox churches still common.

It was Białowieża which had been a hunting estate of the Tsars for several centuries where the former Russian influence was most obvious. The Tsar's hunting lodge has gone, but the formal park is still there. Throughout the forest area the rides are spaced at an interval of one verst with little marker posts at each corner.

Forest compartment marker, Białowieża
 No doubt there were many other survivals and markers of the hundred or so years when this part of Poland was Russian which I missed.

Being an OSM User 

Now I've not said a word about maps so far.

I actually travelled over 2000 kilometers by car, used buses, hiked trails in 3 national parks and wandered around a couple of cities and several towns. I used OSM exclusively for this, with one exception. In the main I used OSM on a Garmin device and used Navit on an Android smartphone. The latter was great, but as I didnt have an in-car charger not useful for long journeys. Instead I relied on the Garmin's beeps for upcoming turns. In the whole time I found a single trail and one linking road of poor quality E of Bialyostok missing.

Early morning ferry across the San at Czekaj Pniowski E of Sandomierz
(I was totally alarmed to see a ferry symbol on a sign, so was very relieved to find such a simple low-key affair.
I was totally reliant on OSM routing at this point.)
Most of the places I stayed were on OSM, but relatively few POIs such as shops, fire stations and churches. I think a lot of the data came from imports and was subsequently lost at the licence change. Some of it was clearly out of date: in Lubartow the station has been closed for years and buses stop on the main street, not in the location marked as the bus station. I did no conscious mapping whilst I was there: I had too many other things to do. I did however keep traces and took many geolocated photos.

What I did do 2 years ago was to navigate entirely using OSM across a broad range of Polish landscapes with no serious difficulties.

A lot of this OSM data was removed at the time of the license change, but the vast majority was restored by the concerted efforts of many mappers (see talk by Marek Kleciak at SotM Baltics). Places like Krakow and Lublin have good quality data, particularly in the old centres, but smaller towns which I visited like Lubartow, Goniadz, and Monki are seriously deficient in POIs. Even a popular tourist centre like Zakopane could do with more on the ground mapping (not least of the seriously good purveyor of cheesecake "Samanta" which has several outlets in Zakopane).


A couple of weeks ago I returned very briefly to Krakow for the funeral of Jacek Hennel. The experience really emphasised why Poland has always been, and always will be, a country with emotional significance for me.

Friday 23 May 2014

Fuzzy ideas on fuzzy matching

The UK Food Hygiene data set (FHRS) is just on example of many which it would be nice to be able to compare with OpenStreetMap in a semi-automated manner.

External open data can both be a useful source of missing data and an important tool for evaluating completeness and quality of OSM. FHRS has a number of nice properties:
  • it's large, but not too large; 
  • it is generally of high quality; 
  • it has reasonable precision geolocation; 
  • it is pretty current (most data - five-sixths - is less than 3 years old); 
  • it covers a wide range of different class of feature (hospitals, schools, pubs, butchers etc.); 
  • and it is comprehensive.
Even with good quality data there are always problems in matching data from two sources (conflation seems to be the GIS word for this):
  • Firstly, the location data provided is often not precise enough to do direct comparisons based on location. 
  • Secondly, elements like names and addresses may have enough variation in them to be non-trivial to match automatically, even if to a human the redundancy in the data means that matching is possible. 
  • Thirdly, names and functions of amenities change. 
  • Fourthly, different sources of data may encode features in different ways. 
  • Fifthly, all large data sets have errors in them.
The last point is the critical one: it is not wise to assume that any particular attribute of a given data set has a fixed level of data quality. So matching must be reasonably tolerant of missing values, values subject to typographical errors, encoding differences and variable accuracy. Also attributes may be interdependent in the incoming data set (for example locations in FHRS are derived from look-ups on postcode centroids): thus a trivial error in one value may result in a serious error in a dependent attribute, such as a single letter typo in a postcode moving a point 1000 kilometres away.

I have therefore been trying to think of ways in which matches can be made independently and with a degree of fuzziness. To this end I've been trying to come up with a listing of the more important types of matching operations involved with FHRS data. The hope is that I cover most use-cases for other data sets.

How fuzzy matches are scored and combined is not treated here. Equally there are tools already within the OSM community: Nominatim, Osmose, OS Locator Musical Chairs which provide elements of the functionality I desire.

Non-locational matching

There are many matching operations which require no knowledge of the associated geographies: although in practice they may correspond to strings used in matching operations (e.g., matching on a local authority name is in practice also matching to a polygon). Despite the fact that some of these things can be inextricably linked I want to treat them entirely separately for matching purposes.


Basic fuzzy matching of names is handled directly with PostgreSQL by a package called fuzzystrmatch. This provides several different ways of comparing the similarity of strings. The same algorithms are available in other RDBMs packages, and for many programming languages. Robert Scott used the Levenshtein algorithm very successfully for OSM Musical Chairs which helped mappers in Great Britain to track down discrepancies between OSM street names and Ordnance Survey (open) data. (These did not always turn out to be typos on our part).

However Levenshtein only works if strings are of similar length. In the case of FHRS I want to be able to find likely matches in these typical cases:

OSM Name
Reason for difference
Sycamore Primary School
Sycamore Academy
Name change (a common type in UK at present)
Robin Hood
Robin Hood and Little John
Name truncation
Rose & Crown
Rose and Crown
use of abbreviations
Rose and Crown Inn
Rose and Crown
Name variant
The Rose and Crown Hotel
Rose and Crown
Name Variant
Rose and Crown Public House
Rose and Crown
Unnecessary explicit non-name info
Royal Gourock YC
Royal Gourock Yacht Club
use of abbreviations

Royal Gourock Yacht Club - - 858830
Clubhouse, Royal Gourock Yacht Club
© CC-BY-SA Thomas Nugent on Geograph via Wikimedia Commons

The major common feature of these strings is that there are elements which obviously match when inspected visually. Common sense and real-world knowledge (that lots of schools are changing their name to use "Academy", for instance) tells us that many of the elements are unimportant and can be ignored in matching. It's a bit harder to persuade a computer to do the same!

My basic idea is that we should match individual words, so that each string is divided into individual word strings, which I refer to as tokenising. Very common tokens ("and", "the", "&") and domain specific ones ("school", "academy", "primary", "hotel", "inn", "pub", "public house", "club", "cafe") are eliminated before attempting matches.

Charlbury, the Rose and Crown pub - - 801266
One of the many Rose & Crown pubs in Britain :
this one has particular significance for OpenStreetMap
© CC-BY-SA Francois Thomas on Geograph via Wikimedia Commons.
A totally naive way would just to be to look for single token matches and then apply increasingly stringent matching criteria. Unfortunately this falls down on performance grounds: there are a lot of "Rose and Crown" pubs in Britain (236 in FHRS data, 174 on OSM), and a not inconsiderable number of "Rose" and "Crown" pubs, not to say "Rose Cottage" restaurants. A crude initial single token match will result in thousands of potential matches just for these strings (a minimum of 2*236*174 => 82k+). At this stage I think a means of matching multiple tokens in the first pass is less likely to end up running into major performance issues when scaled for larger data sets.

One reason to worry about this performance is that we also need to take account of typos in both data sets, and therefore either as part of the initial match or as a subsequent trawl through unmatched items we also need to use Levenshtein distance for token comparison.

It may well be that the most effective route is to apply stringent matching criteria (all tokens match and are in the same order) before progressively relaxing the constraints. Only looking at real world examples will help evaluate which method will be most effective. In practice the method used may need to be parameterizable to reflect the quality of one or more data sets. For instance I remember being frustrated trying to match GNIS names to US military maps of Pakistan during the 2011 flooding, and I want to work on techniques which are just as applicable to humanitarian scenarios as to improving pub coverage in the UK.

Feature (Point of Interest) Type 

FHRS data comes with a range of Business Types:

  • "School/college/university"
  • "Pub/bar/nightclub"
  • "Restaurant/Cafe/Canteen"
  • "Importers/Exporters"
  • "Mobile caterer"
  • "Retailers - other"
  • "Retailers - supermarkets/hypermarkets"
  • "Retailers - other"
  • "Takeaway/sandwich shop"
  • "Farmers/growers"
  • "Distributors/Transporters"
  • "Other catering premises"
  • "Hospitals/Childcare/Caring Premises"
  • "Hotel/bed & breakfast/guest house"
  • "Manufacturers/packers"
Although how these correspond to OSM tags is in the main, fairly obvious,  most of the FHRS categories are broader.

There is also the straightforward classification problem: it is not always easy to decide if a places is a restaurant or a pub. Yesterday, I had the problem of deciding if a place is a hospital or a care home (the probable answer is that once it was a hospital but now is a care home). Thus one needs some way of matching semantic categories (presumably in the first instance using trivial rules such as pub ⇔ "Pub/Bar/Nightclub"), but also a means of identifying potential overlaps or 'spill-over' between semantic categories. 

Manor Pub Restaurant on Nottingham Road Toton Corner - - 1058543
The Manor at Toton Corner
A classic pub building, but these days a restaurant with a small bar area: places like this are likely to be tagged or coded differently by different people.
Other examples are: pubs with overnight accomodation; and hotels with pub-like bars; petrol stations only marked as such, but with a convenience store on site too.

My impression is that it will be easier to identify rules for the fuzzy aspects of matching semantic categories by using training sets. Once again FHRS data, simply because it has lots of other detail (addresses, limited geolocation and names) is a decent starting point for identifying how fuzzy OSM tag categories might be. 

There is another problem here: systematic mis-classification. At least one local authority, Gedling, places all school contract caterers in the category "Other Caterers", when it is clear that they should be in "School/college/university".  Of course, in cases like this, one would hope that we can work to get the data properly coded: no-one can assess hygiene of school catering in this district easily with the data as it stands.


It may be odd to treat addresses as non-locational data, but here I am largely referring to string matching of one or more parts of the address : independent from any awareness of the locations associated with these strings.

Ideally, the address is parsable into discrete elements. This is probably true for most UK addresses which consist of a number, street and post town, but as is usually the case all the difficulty lies in the exceptions. Furthermore each country requires parsing rules specific to its own particular cases. For instance in Spain, it is not unusual for forms requesting address data to ask for the door (often izquierda (left) and derecha (right)) and floor number as well as the rest of the address. Following edits on the OSM wiki I have also learnt about addresses in Florence (businesses and residential addresses have separate but overlapping numbers) and in the Czech Republic. Places like Japan and South Korea have quite different addressing schemes too.

Thus decent functionality for address parsing is to my mind rather more complicated to look at straight away. Instead we can focus on parts of the address which we are highly likely to have already captured in OSM. Notably these are the street name and the postal town/village/city. The former is easier to use, not least because the Royal Mail in Britain insist in some very odd uses of locations in addresses.

Once again very common names present a matching volume problem (Church Street, Main Street etc.) but this can be greatly reduced by applying other non-geographical constraints (such as the local authority which provided the data) from the data set. (This may seem to be making everything too hard, but I really want to keep pure geo-matching separate: ultimately it should make for a cleaner architecture). One important reason for doing this is error handling.

A simple inspection of addresses in the Land Registry Prices Paid file for the town of Maidenhead (for obscure technical reasons I used a postcode, SL6, as a proxy) reveals a small number of addresses where the postcode does not match the polygon of the named district in the file.

Land Registry records with either an erroneous assignment of postcode or of local authority
Data are for post district SL6 (orange line) where the postcode centroid was not located in the boundary of the local authority in the records.
Boundaries from OS Open Data Boundary Line, Post Code centroids from CodePoint Open,
SL6 boundary from Geolytix (based on OSGB Open Data sets)
In large datasets these types of errors are invariably present: failing to cope with them (including chucking them out) often leads to obscure complications both with the data and code to manipulate it.

Locational Matching

By locational matching I mean comparing data sets based on a geolocated data: whether this is a point (as with FHRS data located at postcode centroids), or an area (again with FHRS data, the local authority which has collected the data.

It should be noted that most datasets will have two implicit sources of geolocated data: scope (the defined area of the data set, typically with OpenData sets, scope will be a country, state, or local authority) and source (who collected the data, sometimes identical to scope). The important aspect of these two implicit sets of location data is that they are likely to be free of basic locational errors. A national data set is very unlikely to include data for other countries; inaccurate locations outside the source local authority are likely to be erroneous. This basic information must not be overlooked as it provides a good control on data reliability and will often enable other matching to be much more constrained.


The FHRS data potentially covers the entire United Kingdom, which is its basic scope. The Scottish part of the scheme has different data and therefore also has a separate scope.


As FHRS data is collected independently by each local authority, and this information is contained in the source data, this provides a finer grain of location data which can be treated as having a very low error rate.Source is important because data quality is likely to vary by source (it certainly does in FHRS data).

Explicit Locations

My expectation is that most data sets are likely to provide explicit location information in the form of lat/lon pairs (or eastings and northings in other co-ordinate systems), with line or polygon information being rarer. Certainly Nottingham OpenData have been removing many data formats in favour of plain CSV: this is pragmatic, it is easy to load the data in a spreadsheet, but not so simple to look at it in a GIS. Local users are also more likely to be able to make use of the data without requiring that it be mapped in the first instance. With this in mind most of what follows assumes data is delivered with a point location. In many cases if data does have a more elaborate geographical content, much of the matching may still be carried out based on centroids.

The key problem with centroids is that one has no idea of the degree of imprecision of the data. For instance playing with GNIS data exposes many data items which are located a long way from their true location, whereas the Nottingham OpenData on Streetlights is accurate to the nearest metre. Use of postcode centroids makes things slightly harder as their degree of accuracy will be a factor of local postcode density. The worst case for a postcode in Great Britain is a farm which is over 11 km from the centroid.

Therefore at least for things like UK postcodes and GNIS data the plain locational matching will need to start with a fairly tight circumspection of the area of potential matches, which can be progressively relaxed as most matches are made.

Other datasets which provide metadata on precision (Nottingham Streetlights, q.v.) can probably be handled with a single suitable matching operation.

Other implicit locations

Oddly, postcodes feature here too. It is usual, although not mandatory, that a postcode can belong to one and only one street. The main exceptions are in rural areas where all houses in an area share a postcode (usually there are no street names in this case), or when a group of houses are associated with a subsidiary street as well as the main one. In the case where the postcode does belong to the street, the data can be matched to the geometry of the street (which may or may not be helpful).


Here's a brief outline of what I think the main components should be:
  • One or more matching engines. These use a rule driven matching technique to two data sets with an optional rule driven filter (for performance reasons). Matching could be bi-directional, but for simplicity I assume it is uni-directional as the other direction can be done by swapping the two dataset parameters. Output from the matching engine is a set of matches scored by a likelihood measure. A minimum of two matching engines can be recognised: one based on strings the other on geographies. Filters are likely to be a straight datasetA.attributeX = datasetB.attributeX (e.g., local authority identifier).
  • Matching should allow increasing/decreasing tightness of constraints. Probably by allowing recursive calls within a matching rule.
  • A match selector. Given n different matching routines each producing a likelihood estimate, something needs to evaluate these scores and output final matches.
  • A matching routine chooser. With different data sets the order of application of matching routines it may be better to train the system with a known data set in order to use the most efficient way to apply matching routines.
  • A simple way of specifying rules.


I came to this problem with some old experience of use of Harte Hanks Trillium software for keeping track of commercial customers in a banking application. I didn't use the tool myself but it was an important part of being able to build a single common view of customers, and part of this was matching up different versions of a business name captured both in internal and external systems.

Years before on another (antique) banking system we came upon the unfortunate decision to create internal keys based on the client name, which meant that we lost any history when the name altered (potentially just a typo). I mention these just to illustrate that string matching and intelligent address parsing have always had important business applications. However, OpenSource resources to do similar things are few and far between.

Twice I have been struck how fairly simple matching operations would have made mapping during HOT activations a bit easier: locating hospitals in Haiti after the Januaru 2010 'quake, and matching GNIS nodes to names on old US Military Maps during the 2011 Pakistan floods. Ideally we would be able to use a range of matching techniques to enrich the map data created from aerial and satellite imagery at the time of a crisis. Not all such data would be directly suitable for OSM, as there would be potential for trying to match stuff from non-open sources too, but in the main I see the whole process as being an aid to mapping, not a way to directly generate map data.

I have only set out some desiderata here, although I've played a little with some of the basic techniques described here. I certainly have not attempted any mechanism for fuzzy matching, although I have discussed the viability of using Bayesian approaches with a couple of folk who know much more than I do. For me the key thing is to have a plug-in framework for matching engines and matching rules. The flexibility gained will not just allow increasing refinement of techniques, but also enable only appropriate techniques to be used and in the most efficient order.

Friday 9 May 2014

Editing historical road layouts : Persistence in the Urban Landscape 3

It's a while since I've written a post on the theme of persistence in urban landscapes, and this despite covering some additional examples in my talk at SotM13 in Birmingham. This post takes one example I included in that talk: how road layouts persist.

However, I have also used it as a convenient hook to discuss how we might enable the capture of such data for something like OHM (Open Historical Map). I hope the latter discussion might provoke further ideas from those interested in developing OHM and both links with Wikimedia Commons and Wikidata in the future.

Derby Road, Nottingham

Current OSM Map of Derby Road compared with
Sanderson's Map of 1835. Changes in alignment marked with arrows

I've touched on the history of Derby Road before. Here I'll show what I know of it's history in detail from around 1800. The key point is that the basic line of the road has changed very little: the major changes being well understood and mostly comprehensively documented by the Lenton Local History Society.