Sunday, 25 October 2015

Urban Areas: a meditation on why simple global geographical datasets are so poor

Puerto Vallarta, an aerial view of an urban area missing many roads on OpenStreetMap.
The area in the middle distance away from the sea was particularly lacking.
Fortunately the centre of the hurricane didn't pass over this area.
Source: Wikimedia Commons, (c) CC-BY-SA

The other night, as Hurricane 'Patricia' bore down on the Pacific coast of Mexico, I had a twitter conversation with Bill Morris and others regarding how well mapped Puerto Vallarta was on OpenStreetMap. (BTW: I'm sure it's much better mapped now).
Of course, OSM is about fixing things, so I carried out the conversation in between adding around a hundred streets to the city. However the really interesting question was this one:

Whilst at breakfast I thought a little more about this. I decided it ought to be possible to do something fairly simple with data which already exists.

My basic idea was as follows: create a set of urban area polygons (from the Natural Earth data) for cities over a certain size (from Geonames/GNS data), and use these polygons to count total road length and other parameters of OSM data. I would then plot these values against population for various areas (continents, countries), choose well-mapped places and build a regression model to predict poorly mapped places.

Of course I though the long-winded bit would be extracting OSM data for several thousand cities and then processing to get the various metrics.  Unfortunately I haven't got that far!

As with any sensible analytical approach it's always worthwhile visualising the data: mainly because this is the easiest way to do some basic sense checks. Sadly, both my datasets failed this initial step.

Natural Earth Urban Areas

Even before I loaded the city data I was suspicious of the utility of the Natural Earth data. I looked in two sparsely populated areas which are familiar to me: Patagonia & Extremadura. In both cases there were urban areas which looked spurious at first glance: NW of Rio Gallegos, and somewhere close to the Sierra de Gata.

Spurious Urban Areas (outlined in orange) in Patagonia from Natural Earth
Imagery via Bing (most likely Landsat).
Rio Gallegos (population 100,000) on the same scale as above.
The urban area is clearly visible, but not in Natural Earth
Ushuaia (population 50,000) on the same scale as above.
The urban area is clearly visible.
Punta Arenas (population 127,000) on the same scale as above.
The urban area is clearly visible.
El Calafate (population 22,000) at a much larger scale.
For once some correspondence of urban areas.
Furthermore many of the larger cities in these areas had no urban areas in the Natural Earth data: in Patagonia these include Rio Gallegos, Ushuaia and Punta Arenas; in Extremadura, Caceres is missing.

Cáceres (population 90,000), Extremadura.
The urban area is clearly visible.
Coria (pop 12,000) a small town in Extremadura
which has an urban polygon in Natural Earth.
This all without looking at the data in more populated parts of the world. So I had a look at Western Europe. Firstly, I looked at my main mapping area Nottinghamshire. Two problems were immediately apparent: many urban areas are shown as one contiguous strip; minor towns were appearing with substantial urban areas.
Natural Earth Urban Areas: showing the 'Northern Powerhouse' area.
Urban areas for 7 major cities are run together in the centre.
Urban area polygon for Gainsborough (pop: 20,000).
This is very rural agricultural land as can be seen by my Mapillary traces!
In practice false positives don't matter too much as they can be filtered out by only selecting urban areas for larger cities. Contiguous areas can be separated using Voronoi triangulation of cities to split the polygons (as suggested on the Natural Earth site, where it is called Thiessen polygons), but more on this below.

The key point is that I want a polygon set which is reliable for places I don't know in Central America, Brazil, Indonesia, Africa, India and China.  A cursory inspection shows that it is unreliable in places I do know.

Straightforward conclusions about this dataset are:
  • It is not usable for analytical applications
  • It has major errors for cartographical applications
  • Relying on Remote Sensing alone doesn't always give good data
  • No-one sense checked the data.
  • Not all Natural Earth data has been curated and checked for fitness for purpose.
I still need a decent set of urban area polygons!

Geonames (15k+ populated places)

The other dataset I needed was a geolocated list of cities, of say over 100,000.

Wikipedia has such a list, but the idea was to do something quick, not learn how to scrape Wikipedia.

There seemed to be two readily available datasets: from the UN Statistical Division and from Geonames. The UN data lacks lat/lon so that was out. I've used the Geonames data of cities with populations over 15,000 before: it certainly has the types of data I wanted.

Once again I did a quick visualisation, and just as before problems were immediately apparent. Take a look at the image of Northern England. Firstly, note the node labelled High Peak: this is not a city/town it's an administrative district. Secondly, look for Derby (pop: 250,000) & Grimsby (pop: 90,000).: they're not there, are they. Scotland is affected too. No Perth, Dundee or Inverness. I didn't bother looking further.

For my purpose it's important that the population data are also reasonably accurate.

My first check, Nottingham, is shown as having a population of 246,000. The last time the city had such a low population was around 1900 (see the Demographics section here): current estimated population is 314,000 for the city and 730,000 for the urban area. A 33% error is probably too much for the purpose.

So I need a better list of cities, with more useful population figures too!

Combining the two

On a worldwide basis around 4000 places with populations over 100,000 are not covered by the Natural Earth urban polygons. Mindanao in the Philippines seems particularly badly affected:

Places over 100,000 from GNS data without corresponding urban area from Natural Earth
Again not good.

Some of the missing places are actually a reflection of the generalisation of the NE data: this mainly shows up along coastlines. In these cases the centroid of the city is not just outside the urban area of NE data, but in the sea!

And of course losing places like Derby makes the idea of cutting the urban areas up using Voronoi triangulation also suspect:

Urban Areas for GNS populated places over 50000
cut using Voronoi polygons>
Obviously Nottingham now includes most of Derby and has a population value which is too small. This is a heavily mapped area so I'd predict this as a significant outlier. Doncaster also seems to be missing, and Rotherham is mopping up many smaller places.

At least this is an area I can look at statistically. I just did this for a single metric (non-motorway road length in metres) using OSM data from May 2015. I counted road length within each polygon and tweaked the population figure by adding in the population of all city nodes within the polygon. I plotted the results using R.

Plot of road length (in m) on Y axis (V3) against Population
for most urban areas in Great Britain
This uncovered another problem with the data: there were nodes for both London and City of London each with assigned the population of the Greater London area. As the data also included most significant London suburbs these two records caused lots of problems. Fortunately their Voronoi Polygons were small and just removing them allowed a more sensible plot (see above), and even a decent enough regression line.

Plot as above Road Length on OSM (V3) vs Population (V4)
with added regression line.
 I spent too long fiddling with the data to have time to spend tarting up my R plots. Apologies.

Urban Areas in OpenStreetMap

So we could do with better Urban Area data: why not OSM?

Analytical uses did not figure over much at the outset of OSM. I suspect I was relatively unusual in seeing OSM as a potential way to capture data for analytical uses. As I have tried to show for other data there are no good technical reasons why OSM data can't be used for analytics: it is absence, patchiness and inconsistency of data which are much more problematic.

Whilst not of any consolation for people who want the data now, it's not something I worry about over much. A few years ago OSM wasn't much use for car routing: more mappers and a greater variety of source data fixed it.

I think it's a fairly natural progression for mappers to move on to capturing other features once the road network is in place: in the main these are the things of most use for analytical applications. Furthermore in places where mappers are focused on maintaining data consistency becomes more desirable for the mapper: it really helps spot change. Mappers are also often motivated by seeing the impact of their activities: thematic maps can nudge contributors to add things outside their own direct interests: Robert Whittaker's Post Hoc site is a good example.

There is another problem, which I find more intriguing. Urban areas exemplify this perfectly. We tend to map things at a fine granular level; and the grains get smaller as detail gets added. A city mapped in a lot of detail will have well over ten different types of landuse/landcover: residential, commercial, parks, recreation grounds, pitches, hospitals, education, railways, brownfield, construction, retail, allotments, industrial, cemeteries & so on. There is no single tag to identify a built-up urban area.

Candidate Urban Areas from OpenStreetMap cf. NE polygons
OSM data includes a wide range of polygon types.
Note industrial area in North Sea, and Chatsworth Park in Peak District: good examples of potential spurious data
Source: (c) OpenStreetMap contributors & Natural Earth
It also turns out that we can't rely on some concatenation of all the different landuses: there are numerous exceptions for many of the categories which are not urban. Examples include industrial areas in the countryside (nuclear power stations, water works), formal parks associated with country estates, sports clubs and so on. However, at least for Great Britain, it does provide a reasonably convincing candidate dataset.

We can try a couple of fudges: first merge polygons matching those criteria which touch or overlap. Merged areas can be tidied up by enlarging them with a positive buffer value and then shrinking them again. This will probably fill any holes left my unmapped landuse, or categories not selected initially. I had a quick go at this for Derby. The results compared with the NE data, my source data from OSM and urban areas from the Ordnance Survey Meridian 2 Open Data are shown below.

Comparison between non-OSM sources for Urban Area around Derby
and candidate polygons from OSM (orange: original; blue stipple: merged with some gaps filled using buffering).
Sources: OpenStreetMap; Ordnance Suryey Meridian 2 (Open Data); Natural Earth.

Conclusions & Next Steps

Doing simple geo-experiments with existing worldwide datasets is not really possible. Most available data has not really been created for any kind of analytical purpose. Furthermore the datasets often require substantial cleaning up to remove obvious artefacts (erroneous data, duplicates etc). Many really require additional sources of data, and in many cases OpenStreetMap is a really convenient way to do so (more on this at some later time).

I'm actually going to have a go at trying to improve the datasets referenced here, and shortly will create projects on GitHub to do so. It turns out that decent Urban Area data would be useful for all sorts of things.

However, I'll first return to the issue of deriving built-up areas from OSM: it's something I've thought about from time-to-time in the past, and the Derby experiments look better than I expected. I'll also have to give more thought as to how we can identify less well mapped cities given that my first idea won't work.

1 comment:

  1. Mapbox did an experiment using the size of the sat pic tile as a proxy of information to be mapped. The idea being that complex places are probably built op places. Then they compared that to node density to find undermapped places, and tile views to identify priority places. I don't know if this is available somewhere.

    Another quick and dirty approach could be one of the earth by night pictures. If there's light and no roads, that warrants closer inspection. The biggest problem probably being resolution.