Sunday, 13 July 2014

Upland woods in the Schwarzwald

I wasn't done with looking at woodlands from a mapping perspective in Baden-Wurttemburg. A couple of days after SotM I walked from the top of the gondola at Feldberg to the Feldberg summit and then back to Hinterzarten.

The first part of the walk was somewhat marred by a fierce hailstorm which left me fairly damp. However I recuperated by drying out a bit over Kaffee and Kuchen at the Baldenweger Huette below the summit.

Forest path, part of the Emil-Thoma-Weg.

Thursday, 3 July 2014

Weingartnermoor Woodland Walk : SotM-EU Mapping Workshop

At State of the Map Europe in Karlsruhe I used the opportunity to develop some of the themes I have already outlined for woodland mapping. Essentially I have three lines of attack:
  • extending the cartography of woodlands in OSM ;
  • finding richer ways of tagging woodlands in OSM ;
  • looking at how we can collect data about woodlands (mapping).
I didn't do anything about the second, but gave a lightning talk outlining just some ideas about woodland cartography (I got a few more over the course of the conference). For the latter I thought the best approach would be to get some OSMers in some real woods because real things are much easier to discuss than abstract ideas on the wiki.

We spend much of the conference listening and discussing abstract ideas, and (too) little time using the fact that we come from many countries to share our knowledge of tagging. (A little Guerilla Mapping is not out of place too). So this was a small innovation for an SotM conference too.

SotM-EU Woodland Workshop Participants
Participants on Woodland Mapping Workshop, Karlsruhe June 2014
(minus the author who can be seen in this photo)
I was fortunate in the first place that the Karlsruhe Stammtisch offered some good ideas and advice and the conference organisers chose to add the event to the programme. Secondly, I was fortunate in being supported by more OSMers than I expected: and I know I lost a few due to the early start (necessary to enable not missing the whole Hackday).


Sunday, 29 June 2014

Visit to Poland April-May 2012


Visit to Poland May 2012
My Polish travels Spring 2012

Whilst completing this post I learnt of the death of a dear friend of many years, Prof. Dr. hab. Jacek Hennel, who died aet 88 on 2nd June 2014.

2008.06.22. Jacek Hennel Fot Mariusz Kubik 01

In the Spring of 2012 I spent just over two weeks in Poland: the main aim was to attend my cousin's wedding in Lubartow. Poland is a country I've always been about to visit, so I made sure to extend my trip. SteveC's OSM ready for prime-time blog post, prompted to finish this article because 2 years ago I successfully used lots of OSM-based data for my Polish trip.

This really should be about how Poland, Polishness and Poles have always impinged on my life: and why visiting Poland was bound to be imbued with lots of personal meaning. However, to do justice to the subject would just take too long, so what follows is just impressionistic.

Background

I have known Poles all my life:
  • One of my father's closest colleagues was a Polish physicist (see above, and below); 
  • On my first day at school a neglectful parent forgot to collect me at the end of the day and I was consoled by a thoughtful Polish ice-cream man. He was one of many expatriate Poles, Ukrainians, Latvians and other Balts who had been placed in a refugee camp after WWII, at Ruddington, outside Nottingham. 
  • Others were parents of children in my class at school. 
  • Some of my closest friends at University had Polish fathers, many of whom had fought in WWII, and, usually, Irish catholic mothers. I will always remember one friend's father, saying in passing through the room when we were watching the end of Oh! What a Lovely War on TV: "There were lots of poppies on Monte Cassino in the Spring". 
  • A fellow student spent a summer in Gdansk and married a Polish girl he met there.
  • My cousin married one (see below).
  • There was even a Brighton-based punk band with a partially Polish name.
After graduating I worked in a lab with strong connections to a Warsaw research group: a fellow student spent 3 months in Tarkovski's lab in Warsaw:  his brother was attending the Warsaw Conservatoire at the same time. When martial law was declared I knew people who were put in prison or had to leave Poland by clandestine means: I refused to speak to the Russian in our department at the time. Man of Iron and Man of Marble spoke strongly to me, but not just as wonderful films, but because they represented a new hope for Europe arising from Poland. (I still remember my annoyance on leaving the Academy Cinema on Oxford Street, and overhearing some contemptuous remark about the cinematography.)

Later I lived near Ealing which has a vibrant Polish community. Many of my Anglo-Polish friends often resented having atypical fathers, with eccentric habits like making sauerkraut and raising ducklings for the pot, and of course, in general not just being embarrassing in the way parents usually are, but being embarrasingly different. However, in Ealing people of my own age seemed very comfortable with both British and Polish heritages.  

Krakow  

Of course Krakow is a jewel of a city, a heart of Polish culture, (and the destination of choice for British Stag parties).

My reason for visiting was to meet Jacek and his wife Jozefa: I've known Jacek as long as I've lived. They are prominent liberal catholic intellectuals who have led lives of extraordinary change, under four completely different political systems: pre-war Poland, the Nazis of General-Gouvernment, the Soviet puppet communist regime, and modern Poland of the EU.

At supper in the old Jewish quarter of Kasimierz, Agnieska, Jacek's daughter joined us. I remember going shopping for Beatles records with her when she stayed with my family in the early 70s. She still likes The Beatles. A little later around 1974 she gave me Enigmatic, an album by Czesław Niemen.

On the Sunday I visited them at home, and I bought flowers for Jozefa on the Rynek. I was able to enjoy the  beautiful scent as I carried them on the tram. Later Jacek told me about his father filming him and his brother with an early 8mm cine camera on the same square before WWII.

High Tatra

Kuźnice and the High Tatra from the slopes of Nosal
Early morning view of Kunice and the High Tatra

Another place I had to visit was Zakopane: my father first went there in the early 60s; and, as a child, I had a promise that I would be taught to ski there. Jacek's aunt remembered seeing V.I. Lenin in the village of Poronin just before WWI: it was conveniently close to the border for Lenin to retain contact with revolutionaries within Russia. Unfortunately I ended up visiting Zakopane at the start of May when two national holidays run into one another. This meant that I could not get the cable car to the peak of Kasprowy Wierch : the queues were already vast by 07:30, and the town was very crowded.

On the other hand I saw Zakopane as a uniquely Polish resort. A lot was quite vulgar: lots of eating and drinking; and yet the town has the typical faded allure of a 19th Century spa town too. Of course it's real strength is it's presence at the foot of the mountains, and although the main paths were busy this was not unpleasant. I hiked up to a big hut just on the snowline reaching it just before a sharp shower of rain. Most of the way I was passed and passed a group of young Polish women who were celebrating their graduation (one had her treasured certificate with her). They kindly made space for me at their table in the hut. The highlight of the walk was seeing (and photographing) a very sluggish Adder which had just come out of hibernation.

Adder, detail of head
Sleepy Adder in the Tatra

Somewhere I have a photo of my Dad outside this same hut in the early '60s.

The rain meant I started back later than planned: I got a bit worried that  I wouldn't be out of the Spruce forest before dark, and I had noted the signs about bears too:

There are Bears in the woods
The point when I wished I understood more Polish
Lubartów, Nałęczów, Lublin and Kozłówka: 

I returned to Krakow to meet other members of my family and do some proper tourism. On the day of the wedding we drove to Lubartów, stopping for lunch at, another slightly faded spa town, Nałęczów.

Jacek and Jozefa were by now taking a cure and staying at a sanatorium here. This was the only chance for my siblings to see them. I met Jacek in the hall of the sanatorium: he was amazed that we had found it so easily (OSM of course) and we then took a walk in the park and took the waters before returning to see Jozefa. We then adjourned for lunch at the very pleasant restaurant Ewelina in a villa slightly away from the centre of the spa, with some literary associations with Bolesław Prus.

Taking the waters at Nałęczów
Taking the waters at Nałęczów
Jacek Hennel, with the author (right) and his siblings.
We still had to make it to Lubartów, which we did comfortably in time for the wedding, although changing was a bit complicated as none of us were actually staying in Lubartów. The wedding itself was not particularly traditional: it was conducted in French and the initial music was provided by Breton bagpipes and bombard. There was much cross-cultural interchange on the music front as the church organist learnt some of the traditional Breton tunes, and the Bretons joined in the interminable renditions of Sto lat over the following 24 hours.


I got to my bed around 4 in the morning in a little country inn at  Kozłówka to the W of Lubartow. As we arrived I heard the wonderful song of a Golden Oriole from the park across the road. I pottered about the courtyard of the inn the following morning before we resumed eating and drinking in Polish style for the rest of the afternoon and early evening.

Kozlowka palac front 01
Kozłówka Palace, garden front
CC-BY-SA Wikimedia Commons
I was completely unaware that the park across the road contained a fine palace, Kozłówka Palace, which I could have visited that morning. Instead I only found it the following day, a Monday, when, unfortunately, it was closed. I'd returned to  Kozłówka after I went into Lublin to collect my hire car. In the short time available I was able to walk through the historic centre of Lublin, from the Castle to the Krakow gate passing by the court house in the main square. There are numerous other historic monuments in Lublin given its central role in the establishment of the Polish-Lithuanian Union at the Treaty of Lublin.

Biebrza

The rest of my trip was devoted to visiting a couple more National Parks in search of wildlife. I started by heading North staying the night on the outskirts of Białyostok to the Biebrza National Park. I first became aware of this in a programme made by Bill Oddie for the BBC, and had subsequently met his guide Marek Borkowski at the Rutland Bird Fair.

The Biebrza area is a huge (over 2000 km2) area of wetlands stretching along the river Biebrza to its junction with the . At the centre of the area, and the base for the administration of the National Park, is the former fortress of Osweic, parts of which are still under military control. The fortress was built when this part of Poland was in the Russian Empire. It controlled a strategic crossing of the marshes with a road and railway line, and was relatively close the frontier of East Prussia prior to WWI.

Elk at Biebrza
Female Elk (Moose to N. Americans) at dawn.
The problem of wildlife watching in early May is that the key times are dawn and dusk which makes for very long days. One of the highlights of the Biebrza marshes is that it still contains a decent population of Aquatic Warblers (Acrocephalus paludicola). This bird is now classified as vulnerable by BirdLife International. The favoured place for seeing them was a boardwalk in the S part of the reserve which was about 20-25 minutes drive from my hotel. This led out into an area of marshland full of the flowers of Bog Bean (Menyanthes trifolia). One morning I was lucky enough to see three Elk from an observation tower a little way N of this boardwalk. The drive back after dark brought another nice find: Hawfinches feeding on the road. My last day I moved over to the N side of the marshes and visited, all too briefly, an area known as the Red Marsh (Czerwone Bagno). I met some other birders as I returned and we lamented the absence of many warblers.

Białowieża

Białowieża was my last wildlife destination. I really hoped to see European Bison in the wild, and I used a guide for an exhausting 6 hour trip starting at 3:45. Ultimately we were unsuccessful, but I did see many Woodpeckers, and a male Citrine Wagtail. I was also very shocked when pulled over by the police at about 5:30 in the morning (they actually turned out to be the Border Guard), and took a long time to find all the documents. I really thought I was in trouble despite my guide's re-assurances. (Later I got stopped again, but was well prepared: the moral is that if you go within 2-3 kilometres of a Border Guard barracks in a car with plates from out of the area expect to be stopped).

Closest I got to European Bison, at Rezerwat Pokazowy Żubrów
Using a guide meant that I now knew the places where I might find Bison. I visited these morning and evening but still failed. Therefore I had to see them in the little zoo close to Białowieża. Walking from the car park to the entrance I got a great view of a Black Woodpecker which flew up to a nest hole. Whilst watching it a guy accosted me asking if I was interested in any guiding. After a brief conversation I realised that he was Mateusz, the son of my guide Arek, and he realised that I was the English guy staying in Siolo Budy. He had noticed this nest at the same time as me.

Black Woodpecker at nest hole
the tree is probably Aspen (Populus tremula)
I found it amusing that in a few days I had encountered most of the professional guides in Poland.

This part of Poland is great for wildlife, but it doesn't have quite the infrastructure or the number of visitors which would make it easier to get the most out of it. There are plenty of places to stay, but to my surprise most guests were Poles taking weekend breaks (as can be seen by the language of reviews on Trip Advisor and Booking.com). The small number of birders, mostly in guided parties, meant that there was little opportunity to learn about things by word of mouth. This latter contrasts strongly with Montfraqüe in Spain where the number of birders is not enormous, but in most viewing locations there were other people which increases the chances of catching unusual species without a guide. 

Tatar Country

Another aspect of Eastern Poland which I only learnt about when planning the trip, was the existence of a few villages which still perpetuate a distinct Tatar culture. These are situated to the NE of Bialystok, which itself is probably where most Poles of Tatar descent live. I managed to visit one late in the afternoon when I'd given up on bird-watching because of the rain. Fortunately the mosque was still open even though it was about 6 in the evening, where I was welcomed by a young couple, and given a tour of the mosque (image below), by the man who had a cousin living in Chesterfield!

The Mosque at Kruszyniany
Mosque at Kruszyniany
This area was fairly remote, the main road into the village from the Bialyostok - Brest highway was unsurfaced: a strange contrast after having driven past kilometres of tail-to-tail trucks queuing up for the border crossing into Belarus.

I also wandered around the graveyard a few hundred metres from the mosque: a very pleasant wooded area with a number of active woodpeckers. Many of the gravestones had both Polish and Arabic inscriptions, and to my surprise many had photographs of the deceased on them. I took some photos of these, but don't want to upset any religious sensibilities by posting them here.

Remnants of the Russian Empire

Church of Nativity of John the Baptist at Nowa Wola
Church of the Nativity of John the Baptist, Nowa Wola
belonging to the autocephalus Polish Orthodox Church

Through Eastern Poland I kept coming across reminders that this area had once been part of the Russian Empire. A road I drove regularly early in the morning before dawn and aeach evening after dusk in Biebrza was known as the Tsar's road. Osweic was an immense Russian-era fortification. Even churches were different because people belonged to different confessions: with both Uniate and Orthodox churches still common.

It was Białowieża which had been a hunting estate of the Tsars for several centuries where the former Russian influence was most obvious. The Tsar's hunting lodge has gone, but the formal park is still there. Throughout the forest area the rides are spaced at an interval of one verst with little marker posts at each corner.

Forest compartment marker, Białowieża
 No doubt there were many other survivals and markers of the hundred or so years when this part of Poland was Russian which I missed.

Being an OSM User 

Now I've not said a word about maps so far.

I actually travelled over 2000 kilometers by car, used buses, hiked trails in 3 national parks and wandered around a couple of cities and several towns. I used OSM exclusively for this, with one exception. In the main I used OSM on a Garmin device and used Navit on an Android smartphone. The latter was great, but as I didnt have an in-car charger not useful for long journeys. Instead I relied on the Garmin's beeps for upcoming turns. In the whole time I found a single trail and one linking road of poor quality E of Bialyostok missing.

Early morning ferry across the San at Czekaj Pniowski E of Sandomierz
(I was totally alarmed to see a ferry symbol on a sign, so was very relieved to find such a simple low-key affair.
I was totally reliant on OSM routing at this point.)
Most of the places I stayed were on OSM, but relatively few POIs such as shops, fire stations and churches. I think a lot of the data came from imports and was subsequently lost at the licence change. Some of it was clearly out of date: in Lubartow the station has been closed for years and buses stop on the main street, not in the location marked as the bus station. I did no conscious mapping whilst I was there: I had too many other things to do. I did however keep traces and took many geolocated photos.

What I did do 2 years ago was to navigate entirely using OSM across a broad range of Polish landscapes with no serious difficulties.

A lot of this OSM data was removed at the time of the license change, but the vast majority was restored by the concerted efforts of many mappers (see talk by Marek Kleciak at SotM Baltics). Places like Krakow and Lublin have good quality data, particularly in the old centres, but smaller towns which I visited like Lubartow, Goniadz, and Monki are seriously deficient in POIs. Even a popular tourist centre like Zakopane could do with more on the ground mapping (not least of the seriously good purveyor of cheesecake "Samanta" which has several outlets in Zakopane).

Coda

A couple of weeks ago I returned very briefly to Krakow for the funeral of Jacek Hennel. The experience really emphasised why Poland has always been, and always will be, a country with emotional significance for me.

Friday, 23 May 2014

Fuzzy ideas on fuzzy matching

The UK Food Hygiene data set (FHRS) is just on example of many which it would be nice to be able to compare with OpenStreetMap in a semi-automated manner.

External open data can both be a useful source of missing data and an important tool for evaluating completeness and quality of OSM. FHRS has a number of nice properties:
  • it's large, but not too large; 
  • it is generally of high quality; 
  • it has reasonable precision geolocation; 
  • it is pretty current (most data - five-sixths - is less than 3 years old); 
  • it covers a wide range of different class of feature (hospitals, schools, pubs, butchers etc.); 
  • and it is comprehensive.
Even with good quality data there are always problems in matching data from two sources (conflation seems to be the GIS word for this):
  • Firstly, the location data provided is often not precise enough to do direct comparisons based on location. 
  • Secondly, elements like names and addresses may have enough variation in them to be non-trivial to match automatically, even if to a human the redundancy in the data means that matching is possible. 
  • Thirdly, names and functions of amenities change. 
  • Fourthly, different sources of data may encode features in different ways. 
  • Fifthly, all large data sets have errors in them.
The last point is the critical one: it is not wise to assume that any particular attribute of a given data set has a fixed level of data quality. So matching must be reasonably tolerant of missing values, values subject to typographical errors, encoding differences and variable accuracy. Also attributes may be interdependent in the incoming data set (for example locations in FHRS are derived from look-ups on postcode centroids): thus a trivial error in one value may result in a serious error in a dependent attribute, such as a single letter typo in a postcode moving a point 1000 kilometres away.

I have therefore been trying to think of ways in which matches can be made independently and with a degree of fuzziness. To this end I've been trying to come up with a listing of the more important types of matching operations involved with FHRS data. The hope is that I cover most use-cases for other data sets.

How fuzzy matches are scored and combined is not treated here. Equally there are tools already within the OSM community: Nominatim, Osmose, OS Locator Musical Chairs which provide elements of the functionality I desire.

Non-locational matching

There are many matching operations which require no knowledge of the associated geographies: although in practice they may correspond to strings used in matching operations (e.g., matching on a local authority name is in practice also matching to a polygon). Despite the fact that some of these things can be inextricably linked I want to treat them entirely separately for matching purposes.

Names

Basic fuzzy matching of names is handled directly with PostgreSQL by a package called fuzzystrmatch. This provides several different ways of comparing the similarity of strings. The same algorithms are available in other RDBMs packages, and for many programming languages. Robert Scott used the Levenshtein algorithm very successfully for OSM Musical Chairs which helped mappers in Great Britain to track down discrepancies between OSM street names and Ordnance Survey (open) data. (These did not always turn out to be typos on our part).

However Levenshtein only works if strings are of similar length. In the case of FHRS I want to be able to find likely matches in these typical cases:

FHRS Name
OSM Name
Reason for difference
Sycamore Primary School
Sycamore Academy
Name change (a common type in UK at present)
Robin Hood
Robin Hood and Little John
Name truncation
Rose & Crown
Rose and Crown
use of abbreviations
Rose and Crown Inn
Rose and Crown
Name variant
The Rose and Crown Hotel
Rose and Crown
Name Variant
Rose and Crown Public House
Rose and Crown
Unnecessary explicit non-name info
Royal Gourock YC
Royal Gourock Yacht Club
use of abbreviations

Royal Gourock Yacht Club - geograph.org.uk - 858830
Clubhouse, Royal Gourock Yacht Club
© CC-BY-SA Thomas Nugent on Geograph via Wikimedia Commons

The major common feature of these strings is that there are elements which obviously match when inspected visually. Common sense and real-world knowledge (that lots of schools are changing their name to use "Academy", for instance) tells us that many of the elements are unimportant and can be ignored in matching. It's a bit harder to persuade a computer to do the same!

My basic idea is that we should match individual words, so that each string is divided into individual word strings, which I refer to as tokenising. Very common tokens ("and", "the", "&") and domain specific ones ("school", "academy", "primary", "hotel", "inn", "pub", "public house", "club", "cafe") are eliminated before attempting matches.

Charlbury, the Rose and Crown pub - geograph.org.uk - 801266
One of the many Rose & Crown pubs in Britain :
this one has particular significance for OpenStreetMap
© CC-BY-SA Francois Thomas on Geograph via Wikimedia Commons.
A totally naive way would just to be to look for single token matches and then apply increasingly stringent matching criteria. Unfortunately this falls down on performance grounds: there are a lot of "Rose and Crown" pubs in Britain (236 in FHRS data, 174 on OSM), and a not inconsiderable number of "Rose" and "Crown" pubs, not to say "Rose Cottage" restaurants. A crude initial single token match will result in thousands of potential matches just for these strings (a minimum of 2*236*174 => 82k+). At this stage I think a means of matching multiple tokens in the first pass is less likely to end up running into major performance issues when scaled for larger data sets.

One reason to worry about this performance is that we also need to take account of typos in both data sets, and therefore either as part of the initial match or as a subsequent trawl through unmatched items we also need to use Levenshtein distance for token comparison.

It may well be that the most effective route is to apply stringent matching criteria (all tokens match and are in the same order) before progressively relaxing the constraints. Only looking at real world examples will help evaluate which method will be most effective. In practice the method used may need to be parameterizable to reflect the quality of one or more data sets. For instance I remember being frustrated trying to match GNIS names to US military maps of Pakistan during the 2011 flooding, and I want to work on techniques which are just as applicable to humanitarian scenarios as to improving pub coverage in the UK.

Feature (Point of Interest) Type 

FHRS data comes with a range of Business Types:

  • "School/college/university"
  • "Pub/bar/nightclub"
  • "Restaurant/Cafe/Canteen"
  • "Importers/Exporters"
  • "Mobile caterer"
  • "Retailers - other"
  • "Retailers - supermarkets/hypermarkets"
  • "Retailers - other"
  • "Takeaway/sandwich shop"
  • "Farmers/growers"
  • "Distributors/Transporters"
  • "Other catering premises"
  • "Hospitals/Childcare/Caring Premises"
  • "Hotel/bed & breakfast/guest house"
  • "Manufacturers/packers"
Although how these correspond to OSM tags is in the main, fairly obvious,  most of the FHRS categories are broader.

There is also the straightforward classification problem: it is not always easy to decide if a places is a restaurant or a pub. Yesterday, I had the problem of deciding if a place is a hospital or a care home (the probable answer is that once it was a hospital but now is a care home). Thus one needs some way of matching semantic categories (presumably in the first instance using trivial rules such as pub ⇔ "Pub/Bar/Nightclub"), but also a means of identifying potential overlaps or 'spill-over' between semantic categories. 

Manor Pub Restaurant on Nottingham Road Toton Corner - geograph.org.uk - 1058543
The Manor at Toton Corner
A classic pub building, but these days a restaurant with a small bar area: places like this are likely to be tagged or coded differently by different people.
Other examples are: pubs with overnight accomodation; and hotels with pub-like bars; petrol stations only marked as such, but with a convenience store on site too.

My impression is that it will be easier to identify rules for the fuzzy aspects of matching semantic categories by using training sets. Once again FHRS data, simply because it has lots of other detail (addresses, limited geolocation and names) is a decent starting point for identifying how fuzzy OSM tag categories might be. 

There is another problem here: systematic mis-classification. At least one local authority, Gedling, places all school contract caterers in the category "Other Caterers", when it is clear that they should be in "School/college/university".  Of course, in cases like this, one would hope that we can work to get the data properly coded: no-one can assess hygiene of school catering in this district easily with the data as it stands.

Addresses

It may be odd to treat addresses as non-locational data, but here I am largely referring to string matching of one or more parts of the address : independent from any awareness of the locations associated with these strings.

Ideally, the address is parsable into discrete elements. This is probably true for most UK addresses which consist of a number, street and post town, but as is usually the case all the difficulty lies in the exceptions. Furthermore each country requires parsing rules specific to its own particular cases. For instance in Spain, it is not unusual for forms requesting address data to ask for the door (often izquierda (left) and derecha (right)) and floor number as well as the rest of the address. Following edits on the OSM wiki I have also learnt about addresses in Florence (businesses and residential addresses have separate but overlapping numbers) and in the Czech Republic. Places like Japan and South Korea have quite different addressing schemes too.

Thus decent functionality for address parsing is to my mind rather more complicated to look at straight away. Instead we can focus on parts of the address which we are highly likely to have already captured in OSM. Notably these are the street name and the postal town/village/city. The former is easier to use, not least because the Royal Mail in Britain insist in some very odd uses of locations in addresses.

Once again very common names present a matching volume problem (Church Street, Main Street etc.) but this can be greatly reduced by applying other non-geographical constraints (such as the local authority which provided the data) from the data set. (This may seem to be making everything too hard, but I really want to keep pure geo-matching separate: ultimately it should make for a cleaner architecture). One important reason for doing this is error handling.

A simple inspection of addresses in the Land Registry Prices Paid file for the town of Maidenhead (for obscure technical reasons I used a postcode, SL6, as a proxy) reveals a small number of addresses where the postcode does not match the polygon of the named district in the file.

Land Registry records with either an erroneous assignment of postcode or of local authority
Data are for post district SL6 (orange line) where the postcode centroid was not located in the boundary of the local authority in the records.
Boundaries from OS Open Data Boundary Line, Post Code centroids from CodePoint Open,
SL6 boundary from Geolytix (based on OSGB Open Data sets)
In large datasets these types of errors are invariably present: failing to cope with them (including chucking them out) often leads to obscure complications both with the data and code to manipulate it.

Locational Matching

By locational matching I mean comparing data sets based on a geolocated data: whether this is a point (as with FHRS data located at postcode centroids), or an area (again with FHRS data, the local authority which has collected the data.

It should be noted that most datasets will have two implicit sources of geolocated data: scope (the defined area of the data set, typically with OpenData sets, scope will be a country, state, or local authority) and source (who collected the data, sometimes identical to scope). The important aspect of these two implicit sets of location data is that they are likely to be free of basic locational errors. A national data set is very unlikely to include data for other countries; inaccurate locations outside the source local authority are likely to be erroneous. This basic information must not be overlooked as it provides a good control on data reliability and will often enable other matching to be much more constrained.

Scope

The FHRS data potentially covers the entire United Kingdom, which is its basic scope. The Scottish part of the scheme has different data and therefore also has a separate scope.

Source

As FHRS data is collected independently by each local authority, and this information is contained in the source data, this provides a finer grain of location data which can be treated as having a very low error rate.Source is important because data quality is likely to vary by source (it certainly does in FHRS data).

Explicit Locations

My expectation is that most data sets are likely to provide explicit location information in the form of lat/lon pairs (or eastings and northings in other co-ordinate systems), with line or polygon information being rarer. Certainly Nottingham OpenData have been removing many data formats in favour of plain CSV: this is pragmatic, it is easy to load the data in a spreadsheet, but not so simple to look at it in a GIS. Local users are also more likely to be able to make use of the data without requiring that it be mapped in the first instance. With this in mind most of what follows assumes data is delivered with a point location. In many cases if data does have a more elaborate geographical content, much of the matching may still be carried out based on centroids.

The key problem with centroids is that one has no idea of the degree of imprecision of the data. For instance playing with GNIS data exposes many data items which are located a long way from their true location, whereas the Nottingham OpenData on Streetlights is accurate to the nearest metre. Use of postcode centroids makes things slightly harder as their degree of accuracy will be a factor of local postcode density. The worst case for a postcode in Great Britain is a farm which is over 11 km from the centroid.

Therefore at least for things like UK postcodes and GNIS data the plain locational matching will need to start with a fairly tight circumspection of the area of potential matches, which can be progressively relaxed as most matches are made.

Other datasets which provide metadata on precision (Nottingham Streetlights, q.v.) can probably be handled with a single suitable matching operation.

Other implicit locations

Oddly, postcodes feature here too. It is usual, although not mandatory, that a postcode can belong to one and only one street. The main exceptions are in rural areas where all houses in an area share a postcode (usually there are no street names in this case), or when a group of houses are associated with a subsidiary street as well as the main one. In the case where the postcode does belong to the street, the data can be matched to the geometry of the street (which may or may not be helpful).

Outline 

Here's a brief outline of what I think the main components should be:
  • One or more matching engines. These use a rule driven matching technique to two data sets with an optional rule driven filter (for performance reasons). Matching could be bi-directional, but for simplicity I assume it is uni-directional as the other direction can be done by swapping the two dataset parameters. Output from the matching engine is a set of matches scored by a likelihood measure. A minimum of two matching engines can be recognised: one based on strings the other on geographies. Filters are likely to be a straight datasetA.attributeX = datasetB.attributeX (e.g., local authority identifier).
  • Matching should allow increasing/decreasing tightness of constraints. Probably by allowing recursive calls within a matching rule.
  • A match selector. Given n different matching routines each producing a likelihood estimate, something needs to evaluate these scores and output final matches.
  • A matching routine chooser. With different data sets the order of application of matching routines it may be better to train the system with a known data set in order to use the most efficient way to apply matching routines.
  • A simple way of specifying rules.

Summary

I came to this problem with some old experience of use of Harte Hanks Trillium software for keeping track of commercial customers in a banking application. I didn't use the tool myself but it was an important part of being able to build a single common view of customers, and part of this was matching up different versions of a business name captured both in internal and external systems.

Years before on another (antique) banking system we came upon the unfortunate decision to create internal keys based on the client name, which meant that we lost any history when the name altered (potentially just a typo). I mention these just to illustrate that string matching and intelligent address parsing have always had important business applications. However, OpenSource resources to do similar things are few and far between.

Twice I have been struck how fairly simple matching operations would have made mapping during HOT activations a bit easier: locating hospitals in Haiti after the Januaru 2010 'quake, and matching GNIS nodes to names on old US Military Maps during the 2011 Pakistan floods. Ideally we would be able to use a range of matching techniques to enrich the map data created from aerial and satellite imagery at the time of a crisis. Not all such data would be directly suitable for OSM, as there would be potential for trying to match stuff from non-open sources too, but in the main I see the whole process as being an aid to mapping, not a way to directly generate map data.

I have only set out some desiderata here, although I've played a little with some of the basic techniques described here. I certainly have not attempted any mechanism for fuzzy matching, although I have discussed the viability of using Bayesian approaches with a couple of folk who know much more than I do. For me the key thing is to have a plug-in framework for matching engines and matching rules. The flexibility gained will not just allow increasing refinement of techniques, but also enable only appropriate techniques to be used and in the most efficient order.


Friday, 9 May 2014

Editing historical road layouts : Persistance in the Urban Landscape 3

It's a while since I've written a post on the theme of persistence in urban landscapes, and this despite covering some additional examples in my talk at SotM13 in Birmingham. This post takes one example I included in that talk: how road layouts persist.

However, I have also used it as a convenient hook to discuss how we might enable the capture of such data for something like OHM (Open Historical Map). I hope the latter discussion might provoke further ideas from those interested in developing OHM and both links with Wikimedia Commons and Wikidata in the future.

Derby Road, Nottingham


Current OSM Map of Derby Road compared with
Sanderson's Map of 1835. Changes in alignment marked with arrows

I've touched on the history of Derby Road before. Here I'll show what I know of it's history in detail from around 1800. The key point is that the basic line of the road has changed very little: the major changes being well understood and mostly comprehensively documented by the Lenton Local History Society.

The main changes of line that I am aware of are:

A hollow or bank: part of the old alignment of Derby Road
  • In the 1820s the owner of Wollaton Park built a brick wall surrounding the estate. This was part of a series of works designed to defend Wollaton Hall and its park from attack by discontented local people. As a result of the wall, the route of Derby Road was pushed further S from the top of Adam's Hill (today marked by its use as the name of the residential service road on the N side of Derby Road) to where the road crossed, in quick succession, the Nottingham Canal and the River Leen. This change in alignment is obvious if one inspects modern maps: my brother drew my attention to this several years ago. At the entrance to Wollaton Park there are traces of the original alignment, as shown on the 6 inch map of a hundred years ago. My general impression is that the original route was less steep than the current route.

    This image of Wollaton Park Housing Estate, Wollaton, 1928
    from Britain from Above (CC-BY-SA-NC) shows two changes of alignment
    1: the carriage road from Lenton Lodge which is disappearing under housing is on the old alignment of Derby Road,
    2: in the near foreground right the old Rose and Crown pub can be seen with a grassy area to its right (a bowling green

  • Realignment around the Rose and Crown pub. The original Rose and Crown building created an awkward pinch point on the road, with possibly a slightly blind corner. As motor traffic built-up this no doubt became more and more inconvenient. In 1935-6, a new, more substantial pub was built in the grounds of the old pub. Once the new pub was open, the old pub was demolished and the road widened and straightened. Details in the Lenton Times article online.

    Although taken with rain running down the bus window this image
    shows the sharp change of alignment introduced by the railway bridge on Derby Road
    (starting at the second set of traffic lights).
  • Bridge over the railway. Originally Derby Road crossed the railway by a level crossing next to Lenton Station. The original alignment is still present next to the Three Wheatsheaves to the W of the railway. In 1909 a railway bridge was built slightly to the S of this alignment. The detail of these changes are described in the Lenton Times Issue 26.
  • Junction with Nottingham Ring Road. Originally this was a T-junction with a minor lane called Sandy Lane. The junction as it stands was created in the 1920s just after the City Council had bought Wollaton Park and started to build social housing in the E part of the park. Middleton Boulevard and (the then) Abbey Boulevard (now Clifton Boulevard) were part of a set of dual carriageways built round the W side of the city simultaneously with extensive tracts of council houses. The principal engineer was a Mr Clarke. He was the father of our GP when I was a child.

    Originally this junction was a roundabout. Sometime in the 1960s it was re-configured to be signal-controlled, but with rising traffic this still wasn't enough. In the early 1980s the ring road bypassed the junction entirely through the construction of an underpass. Finally in the past 5 years the new roundabout has been altered so that traffic flow is all under the control of traffic signals. However none of these changes has resulted in any significant alteration of the basic line of the road.
I have started trying to document the alignment of the road at the start of the 19th century on Open Historical Map. This is a bit of a cheat as my primary source is Sanderson's Map 20 Miles around Mansfield which was compiled in the early 1830s and thus post-dates the first alteration of alignment.


Wollaton Park area, showing Derby Road alignment c. 1820
vector map on OHM View Larger Map

One of the nice things about OHM is that it is vector data, so it is very easy to compare current OSM ways for Derby Road and the ones I have entered in OHM, and furthermore look at the temporal tagging needed to enable use of a single way in different time periods (as distinct to creating new vector data sets for each time period snapshot).

Temporal Tagging of Ways 

By temporal tagging I mean using some tags to indicate the time period when a way was in existence with the associated attributes. Typically this is done in temporal databases by using start_date and end_date columns, and the basic idea is to do this with tags. We know this can work for OSM data. Since my original tests, MaZDerMind, the guy behind the OSH full history extracts of the OSM database has developed a comprehensive suite of tools to populate and render OSM data using a temporal schema (although this is a system date, the time data entered the system, not a real world date which is what we want with OHM). For a neat example of what is possible see Joost Schouppe's diary entry which arrived at an opportune moment for this post.

Derby Road is a nice simple case as long as we avoid worrying about attributes: in the 200 years under consideration things like speed limits, surfaces, streetlighting, width will all have changed in many places many times. (Even in the past 5 years changes in bus routes have led to many changes to the composition of the ways which make up Derby Road).

So if we ignore attributes we can take every existing section of Derby Road and add them to OHM with tags of start_date=1800-01-01 and either not create an end_date tag or populate it with some far off date in the future (conventionally 9999-12-31 is used). Now for those sections of road which are newer than 1800 we need to adjust the start date. Finally the ways which have gone in the alignment which I have already put into OHM can be similarly tagged with a start_date of 1800-01-01, but with different end_dates according to the information above.

In outline:


No. Start location End location Start Date End DateComment
1 Priory Roundabout

Adams' Hill

1800

9999
2

Adams Hill

Lenton Lodge

1800

1825 Through Wollaton Park
3

Adams Hill

Lenton Lodge

1825

9999 S of Wollaton Park
4Lenton Lodge

W of Rose and Crown

18009999
5

W of Rose and Crown

E of Rose and Crown18001936
6W of Rose and CrownE of Rose and Crown 18001936
7Three Wheatsheaves Faraday Road 18001910 Level Crossing
8

Three Wheatsheaves

Faraday Road

1910

9999

Railway Bridge
9

Faraday Road

Tollhouse Hill

1800

9999

I have ignored the extra complexity of the Middleton Boulevard/Clifton (Abbey) Boulevard/Sandy Lane junction because the changes are more recent and are much more intricate.
 

Editor support for Temporal tagging and data entry 

So far all I have done is show that :
  • We broadly know how we ultimately want to store historical geographical information (as a temporal database of some form).
  • We could use tags for the temporal information
  • That creating such data for road networks is not too horrendous.
  • The more attributes are added, the more different temporal records are needed.
I'd also hoped to discuss the following point in detail, but this post is long enough already.
  • Holding attributes in a different federated data store has some appealing properties; not the least of which is a potentially larger community to crowd source such information.

However in the short term we want to make it easier to capture such data, and in doing so make the most of all the software infrastructure developed by OSM over the past 10 years. OSM has no support for temporality in its toolsets, and to my mind trying to create a modified OSM by adding temporality is a step too far for OHM in its early stages.

For some time I have been thinking about how one can do this by extending the editors. A recent discussion suggested it was really time for me to write up some of these notions. I'm doing this in a bit of a hurry because there is a big hack weekend in Zurich imminent and I hope these might at least provoke some discussion.

My starting point is as follows:
  • Only OHM elements with attributes need to have temporal tags. Tags solely concerned with metadata (source, fixme, note, etc.) do not count.
  • Temporal tags are special, I call them 'privileged' tags as the editors treat them in special ways from ordinary tags. In practice I think they should be hidden from the end-user who can only manipulate them through the editing interface. (There is already some support in editors for recognising and handling some particular tags, although this is to delete them!)
  • The basic editing would be on a historical snapshot with defined start and end dates. This would be defined at the start of the editing session, and initially the two may be identical.
  • Adding tags to an untagged element would automatically cause the element to inherit the base temporal tags of the edit session.
  • Individual elements could have their temporal ranges changed by an additional control within the editor (for instance a slider on the left-hand tag panel in Potlatch2). Again this is a second stage step.
  • Temporal tags would be all UPPERCASE to highlight their distinctness. So something like OHM_START_DATE and OHM_END_DATE.
  • Temporal tags would probably include some indication of the fuzzyness of the dates in the value field, so something like earlier_than 1800-01-01 , or no_later_than 1963-11-22. The sophisticated way in which opening hours tagging has been handled suggests that this is viable if the values are controlled by the edit interface and are thus not truly freeform.
This provides a baseline which requires the addition of
  • a dialogue or control on start of the edit session to set the basic timesnapshot
  • an additional control on the standard edit panel: for Potlatch this may be directly configurable, although preventing it being shown in Advanced Edit mode will require coding work.
Now comes the harder bits. These are:
  • Modifying how the editor pulls data down from OHM through the standard OSM API which is not temporally aware. The simplest case is to pull everything down and only expose elements which fall into the appropriate temporal range. The best analogy to this is JOSM's Purge menu option. Untagged nodes will present a particular problem, as they need to be purged based on whether their parent's temporal tags.
  • Changing the temporal tags of an existing element. If temporal ranges are extended this may create conflicts with existing ways in another time period. I think we have to live with that and have tools to resolve such conflicts. If the range of temporal element is contracted then the relevant element needs to be split into two (when only start or end date is changed) or three (when both are changed).
  • Changing the geometry of an existing temporal element. Geometry changes may only apply to the relevant time box of the snapshot. Therefore such changes should be treated in a similar way to changing the temporal bounds of an element. For instance if I start with Derby Road as it is now as a a single way, then if I realign it in an 1800-1820 snaphot view, then this creates two ways one 1800-1820, the other 1820-future. (of course ideally the editor would split the ways in such a way that this does not occur).
  • Absence of linkage across time. This is analogous to the fact that OSM has no notion of a street, so for now I don't think this is an issue. The main problem is that the time boxes of temporally adjacent elements may come adrift if edited as part of different time snapshots. Anyway, we always have relations which are the time honoured way of dealing with data which is not conveniently handled by existing editors or data consumers: i.e., an institutionalised hack!
  • Shared untagged nodes. It is entirely possible and reasonable for ways from different times (epochs?) to share nodes: this is certainly allowed if temporal tags are only needed on elements with 'normal' tags. A classic example, which current OSM tagging allows for, are cycleways on old railway alignments. The problem arises when someone wants to refine the geometry of one of the relevant ways for a given snapshot time. Logically we cannot know that the shared nodes are always shared in all time periods, so a change in geometry of a shared node should require that the new position be represented by a new node.
  • Merging elements across time periods. All our editor views are geographical. Additional tools or interfaces will be needed to recombine ways with disjunct temporal tags. This will be necessary as the above default actions may create many ways representing the same object which could be combined.
  • Validation and QA tools. A host of new validation and QA utilities will be needed. Particularly tools will be needed to spot overlaps and gaps in elements over time rather than space. Suitable visualisations of
I can see existing features in various OSM editors which handle some of these aspects, but given the different approaches and code-bases I suspect some things are much easier adding to one editor than another. One approach would be to try and decompose the actions further so that there might be some possibility of adding them directly to the underlying object model of the editor: that of iD was nicely described by John Firebaugh in a series of posts last year.

I know chippy thinks there is an element of running before we can walk in these thoughts. I certainly dont have his technical knowledge or, perhaps more pertinently, his knowledge of the Ruby on Rails platform. However, these notes are really aimed at avoiding, at least for now, any kind of fork of the core OSM infrastructure for OHM.

In the mean time I'm going to spend some time with 19th century street directories like this sample:

Extract from 1840 Street Directory of Nottingham
CC-BY-SA-NC from Leicester University Special Collections.
A list of street names is a great place to start the hunt for reconstructing the road layout of the city 175 years ago. Unfortunately I cant use this particular source because of the restrictive licensing, but I will be visiting a local library which has these volumes next week so will get some copies then.

Monday, 14 April 2014

10 years of footpath mapping for OpenStreetMap

On a Saturday in late March I joined Nick Whitlegg, one of the earliest OpenStreetMap contributors, for a session mapping footpaths in the Weald. I was introduced to Nick several years ago on the basis that we were both walkers. But this was the first time we'd actually done a walk together, and my first opportunity to see how a real expert on footpath mapping did things.

Nickw with lots of footpath detail needing mapping
near Ockley, Surrey

Friday, 11 April 2014

Getting all knitted up with childcare models : tagging and gender-bias

I've remarked in passing that the tagging of certain things in OpenStreetMap can be highly variable.

This post started because I just wanted to show a few examples from shops which were putatively related to gender of mappers. However, I've also got massively sidetracked into looking at the (notorious) childcare issue, largely because this is becoming the exemplar of gender biases in tagging.

My shop  examples (way down below the fold) we're chosen because they were ones which I would have expected to show gender biases in tagging:
  • Likelihood of the object being mapped
  • Precision of the tagging of  the object is mapped
I've done this because we have known for a long time that OSM contributors are predominantly male, and, furthermore, have strong technical backgrounds (see Yu-Wei Lin's first paper on the subject, based on interviews in 2010).

The childcare tag controversy

For the last year or two a meme about OSM preferring to map strip-clubs to childcare facilities has been used to highlight this discrepancy.

A Day Nursery in Rise Park, Nottingham
Source: the author via OpenStreetView
Whilst the latter is excellent for polemics, (and, indeed, initially fitted my own preconceptions both about mappers and wiki contributors), it's not really supported by actual mapping: approximately 125,000 facilities related to childcare (mainly pre-school establishments), to under 2000 brothels and strip clubs worldwide.