Tuesday, 24 July 2018

Can we identify 'completeness' of OpenStreetMap features from the data?

At the Milan SotM conference Stefan Keller from the Geometalab at HSR (Rapperswil) will talk about recent work of his group on identifying "Areas of Interest" (AoI) from OpenStreetMap data. Stefan has been kind enough to involve me in some discussions about this work as it has progressed, but in this post I am solely concerned with a separate issue arising from the use of points of interest in this work.

Growth of shops mapped on OSM for selected Local Authorities
(See Analysis section below for commentary)

Areas of Interest were introduced on Google Maps back in 2016. Loosely they correspond to shopping, entertainment and cultural areas with large clusters of relevant points of interest. No doubt Google not only used map features, but also other sources of data such as location of Android phones to calculate the footprints for Areas of Interest (shown in a pale orange or salmon colour on Google Maps).

There are issues with the Google implementation, some discussed in this CityLab article from 2016. My own examination of Google Maps confirms that shopping areas which are otherwise equivalent in range and type of shops are chosen as AoI in wealthy areas, but not in poorer areas dominated by social housing. I also found some places, notably the UBS IT centre in Altstetten, Zurich, which have erroneously been identified as AoI by Google. The work of Geometalab is therefore interesting not just in terms of whether OSM data can be used to calculate similar areas, but also to provide suitable data where biases based on socioeconomic status can, at least, be identified and corrected because data and code are open.

Zurich, centre and Aussersihl districts, showing Areas of Interest.
Work of Geometalab, derived from OpenStreetMap data.
The starting point for this type of work relies on areas where POI mapping density is high and reasonably complete (for instance, the areas of Switzerland which Stefan's group have looked at, and areas of the English East Midlands and Germany which I have looked at both recently, and in the past). Given that it is possible to calculate reasonable AoIs from OSM data where PoI density is high, the question arises "Can we identify which areas are 'reasonably' complete?". Normally, this type of work has involved comparing OSM data to some external reference data which are assumed for the purposes of comparison to be complete (for instance Peter Reed's work on UK retail). However, in many parts of the world, and for many topic domains there is no readily usable data for this purpose. So the ancillary clause for the question is ", and we do this with OSM data alone?"

This post is a first look at the problem for one class of POIs:  shops.

Species Accumulation Curves

My starting point comes from familiarity with something called a Species Accumulation Curve. I believe that there are strong points of commonality between how OSM data is accrued and these curves.

For many groups of plants, animals, and other biota, it is nigh on impossible to find, in a single survey, all the different species which grow or live in a particular area. Numerous factors influence this:
  • Surveyors' skills. Not every surveyor has the same skill set, training, or even just visual acuity. One of the best naturalists I know is a care worker, who can trump national and international scientific authorities by finding more species than they can in the field.
  • Seasonality. Plants flower at different times, birds migrate, some insects are on the wing for a short time.
  • Weather. The hot dry weather in Britain has greatly reduced the number of flowers I have seen in the past few weeks, and consequently their insect visitors. On Sunday I was heartened to lead a field meeting where we found 44 species in our target group; but 10 years ago in the same location & at the same time of year we found nigh on 30 more.
  • Predator Prey relationships. Many species numbers go in cycles (for instance Lemming years), but at least for some insects population density has been estimated to be an order of 10^12 between the troughs and the peaks. Ideally one surveys through 2-3 full cycles: problematic if they are 17-year cicadas, or bamboos which flower and die on a 70-year cycle.
  • Increasing knowledge. Sharing of techniques for searching or recognising different plants and animals can have an amazing influence on total numbers of species found. This is true even in Britain for as well studied a group as the higher plants. The BSBI's Atlas 2020 project which will be completed in 3 years time, will not only show changes in plant distribution brought about by agricultural intensification, increased urbanisation and climate change, but also changes from looking more closely for a wider range of plants (notably urban weeds and garden escapes).
  • Sheer cussedness. Fungi are particularly awkward customers. Most spend their time invisibly underground, only showing fleetingly as fruiting bodies (mushrooms) when they feel the time and weather is right. Even with the most capable surveyors in the world the full extent of species complexity can only be appreciated by continual regular surveying of the same place. There are two locations in England which demonstrate this point. Esher Commons have been regularly surveyed for fungi by scientists, including global authorities on some fungal groups,  from Kew Gardens for many years. No other place in the world is know to have as many fungi, and around 20% of all fungi known in the British Isles have been found at Esher. Slapton Ley in Devon has also received decades of regular surveying effort for fungi, and has over 2000 known fungi species. It may be come second to Esher for known fungal diversity!
  • Recorder effects. Even for professional scientists it is often difficult to maintain a constant recording effort. Most biological data is gathered by citizen scientists who can only devote what leisure time they can spare for the activity. Recorders tend to be located in larger cities, rather than in potentially highly species diverse remote areas. For many species groups only a few people are seriously dedicated (as in my own interest in Plant Galls).
The key advantage of species accumulation curves is that, whilst not impervious to these effects, they are a relatively robust measure. For my fungi-loving friends they are a useful tool to work out when to move on from one area to another. At the scientific level the curves are well studied and there is a good framework of statistical techniques for analysing them.

Differences between OpenStreetMap & Biological Record Data

Data collection for biological recording differs from that for OSM in one particularly important aspect. For biological records all observations in each survey count. In OSM every repeat observation of the same POI (pub, shop, restaurant) is never collected. This means we can make no use of properties of each individual survey activity which contributed POIs. Also there is absolutely no equivalent of an OSM import for biological records. I've focused on Great Britain, so the latter has little impact on the results I present below.

We can still look at two types of accumulation:
  • of individual shops;
  • of shop tags.
Note that empty shops are meaningful data in the first grouping.

Null Hypothesis

Typical small shops which get mapped piecemeal on OSM:
near Conde de Casal Metro station
Av. Mediterraneo, Madrid, Nov. 2016

My null hypothesis is that over time we should see the number of shops on OSM for a given area tailing off towards an asymptote once the area is well-mapped. I know that surveys I did in the Spring of 2013 changed the percentage of shops mapped in Nottingham from around 40% towards 90%. Even earlier Jean-Louis Zimmerman and Tony Emery had mapped Orange in great detail and a map of the town was published. I therefore took these two towns and a few others to see if this was plausible. Data were gathered point by point using Overpass-turbo, and plotted in LibreOffice.

The two towns (Nottingham & Orange) where I had good reason to believe that shops were reasonably complete some years ago formed a baseline and did appear to show curves with asymptotic properties.

The remaining places I chose on the basis that I knew that they are well-mapped, but without knowing if retail properties had been mapped to completion. I thought that it was plausible that this would prove to be true for Zurich & Karlsruhe, certainly not for Madrid, and likely not for Dakar. San Francisco was chosen as a well-mapped location in North America. In practice none of the graphs for number of shops over times suggests that effort to map shops has reached an inflexion. Even for somewhere like Karlsruhe which has had active mappers for as long as anywhere the graph suggests that there is still scope for mapping shops.

Gathering data point-by-point is fine for a quick test, but far too tedious (& expensive in use of free resources), so my next step was to wrestle with OSM History files.

Extracting OSM History data

I already had an OSM History file for Great Britain for June 2017 downloaded from Geofabrik. Unfortunately, these files do not appear to have been updated since Geofabrik changed the user metadata available on their public servers. Also because history files contain user information protected by GDPR, these files are now only available through using an OSM sign-on now.

Manipulating history files effectively means either using the command line osmium tool or writing programs using osmium library. This in turns means installing osmium. I therefore did this under Ubuntu 16.04. There is a packaged version of osmium for Ubuntu, but it is ancient, so it is necessary to compile and install the current version 1.8.

Osmium is very much designed for heavy duty sophisticated processing of OSM data. It's not really a toolset for quick-and-dirty ad hoc investigations of the kind I do. I was apprehensive about getting tied up in knots getting the Osmium tool installed, particularly when I read the list of dependencies.

In practice the only problem I had initially was due to not cloning a couple of packages into the right location in my osmium build directories. As I've never used Cmake in my life I was certainly intimidated by the simple statement "Please read the CMake documentation and get familiar with the cmake and ccmake tools which have many more options.". However, reassured by Richard Fairhurst & Andy Townsend that it wasn't too difficult to install I preserved and soon had it installed. One thing I would have found helpful would have been an outline of the directory tree for a build.

I also had a 5 minute attempt to compile Peter Mazdermind's OSM History tool, but this has not been maintained and uses very ancient versions of osmium, so I did not preserver.

The key reason for using the 1.8 version is that it has better support for extracting dependent data. Thus in a two step process it is possible to filter a history file for all elements tagged with shop and then find all their dependent elements. This is well covered in the osmium manual.

For pragmatic reasons I chose the One Per Line (OPL) format, as I could very quickly load this data "as is" into a Postgres database.

Wrangling the Data

As I have done for years I loaded the data exactly as it was stored in the source file, so that I could start from the raw data at any time all within Postgres. In practice I loaded nodes, ways & relations with  distinct COPY TO statements.

I then processed each element type into base tables: transforming the main columns from strings to the proper datatype. Tags involves converting the string into an array separated by commas in the form key, value, key1, value1 .... This in turn converts simply to hstore.

For each element the next thing was to calculate the end date for each version and add this to the . This can be done with a window function, or by joining the base table to itself (a left join). (See my very old post for one way to do this).

The major disadvantage of my pragmatic approach is that one has to reassemble geometries, but before one can do that it is necessary to determine the potential number of distinct geometries for each version of a way or relation element. As others have done before me I ignored relations (very few shops are mapped as relations), and just worked with ways. To do this I first found all distinct start dates for all the nodes versions which contributed to any given way element, which can then be treated as minor versions of each way version. I actually prefer the term geometry version. You can see something very similar if you look at the history of a way in Potlatch2.

Once I had the start and end dates for each geometry version the linestrings for the way can be assembled by joining the way_node_history, way_geom_history and node_history tables. All ways should produce valid linestrings. By storing the linestrings it is possible to preform multiple checks so that the code only attempts to assemble valid polygons (st_npoints(geom) > 3 and st_isclosed(geom)) worked for me.

The shop data is relatively small, under 70k ways, totalling around 125k versions, which expands to 180k geometry versions. For nodes of course versions is the total : 100k elements, 200k versions. Given the data goes back to 2007, the increase in data volume to handle history is very modest.

The last thing I did was calculate a centroid for all the data (this can include ways which do not form polygons). All analysis used the shop centroids.

It's worth noting that another paper in the SotM academic track by Alexander Zipf's group at Heidelberg may presage much easier analysis by anyone of OSM historical data without the need for this kind of data manipulation.


Shops by Local Authority

My starting point was to look at how many shops have been mapped within each local authority across Great Britain. This enables looking at a much more representative sample than the few cities I looked at earlier. There is a disadvantage in that local authorities do not correspond to cities and therefore may not make natural mapping units.

A first couple of quick plots show that there is a huge diversity in numbers of shops mapped, when they are mapped, intensity of mapping activity, and so on. When all are plotted together it's difficult to pick out any other trends:

Progress in shop mapping for all Local Authorities in Britain
(the top lines with over 2000 shops are : Birmingham, Bristol, Edinburgh, Leeds, Nottingham and London Borough of Westminster)

Percentage of shops mapped at June 2017 in prior months,
all Local Authorities Great Britain.
If we just look at the raw number of shops mapped, the accretion curve is more or less flat, with no sign of tapering off:

Just looking at the top local authorities can highlight a few other features:
Local Authorities with more than 2000 shops mapped mid-2017

Most of these places have seen a fairly steady increase in total shops mapped, but there are a few step changes:
  • Birmingham, 2015: Mattijs Melissen (Math1985) was a very active shop mapper at this time. The tailing off subsequently may merely because he returned to The Netherlands on completing his post doc.
  • Edinburgh, 2014. The Edinburgh MESH (social history) project were actively mapping the inner city during this period.
  • Nottingham, 2013. My own deliberate attempt to map most shops from around March to June 2013.
Clearly spurts of activity such as these are fairly typical of many areas. The extreme is Darlington where virtually all shops on OSM were mapped over at most a couple of months. Pulses of activity may therefore result in curves which are apparently asymptotic, but these only reflect individual mapper activity. I have not included counts of mappers touching elements tagged with shop, but this suggests some such metric may need to be used to avoid false positives from dedicated mapping of shops by single individuals.

A much easier way to look at the data is by looking at graphs side by side. I selected all LAs with over 800 shops mapped, which gives a convenient set of 45 different ones. The graphs are at the head of the blog. Out of these 45, only 2 suggest there might be an asymptotic relationship in the data: Nottingham and Tendring. There are plenty more examples of many shops mapped over short periods (e.g., Gateshead, Sefton).

Shops by Tag

We can also look at whether there is any indication that we have mapped a given category of shop to exhaustion. Here are accumulation curves for the top 35 tag values (those with over 1000 elements mapped:

Only shop=supermarket and shop=doityourself have any appearance of slowing down, and it would be a long stretch to say they were trending towards a given number.

In pretty much all cases the accumulation curves are linear. Thus individual shop tags are much less vulnerable to individual mapper activity. The one step function is shop=bookmaker where Math1985 initiated an effort to reduce the number of synonyms. Slightly worrying is the steady increase in shop=yes.

Lastly we can look at the total number of shop tags:

This looks more like the kind of graph I had been hoping to see. The drop around early 2015 was again, no doubt, due to Matt85's rationalisation efforts. It perhaps suggests no more than 2000 tags are needed to map shops in the UK.

In previous work I showed that there is a long tail of very low usage shop tags, and that the presence of these tags is usually in the noise (I was able to assign 98-99% of all shop tags to specific general categories). It struck me that removing this noise may provide a more informative graph. I therefore excluded any shop tag which had been used 5 times or less in June 2017:

Finally, I have the kind of graph I predicted. It actually looks as though 500 shop tags pretty much meets all our tagging needs.

Even without trying to fit curves to the rest of the data, it is clear that even in well-mapped cities, the data from 2017 suggest that there are plenty of shops to map. It may be with more tightly constrained boundaries we may see more curves suggestive of saturation. I'll look at this in the next post.


  1. Very cool research! Could you generate/share some more graphs, for example for Luxembourg City, Amsterdam and The Hague?

  2. I can say that in my city (Kraków, Poland) many things are well mapped - roads (except driveways) are 100% mapped, parks are 100% mapped, waterways and water ares are 100% mapped, railways are 100% mapped, churches are 100% mapped, buildings are in large part mapped, bicycle infrastructure is relatively well mapped (there was huge burst of oneway:bicycle=no after elimination of parking areas illegally marked by city government that is still not mapped, but cycleways are 100% mapped).

    But shops? On every street there are several unmapped. I would expect that shops, restaurants, cage and pubs are mapped maybe at most in 20%.

    There may be asymptotic curve in data but it would reflect editor fayigue rather than anything else.