The other night, as Hurricane 'Patricia' bore down on the Pacific coast of Mexico, I had a twitter conversation with Bill Morris and others regarding how well mapped Puerto Vallarta was on OpenStreetMap. (BTW: I'm sure it's much better mapped now).
Latin America is a weak spot for @openstreetmap. A town in the #Patricia track: https://t.co/pk4UYHHan3 pic.twitter.com/3t7ZQkWkWz
— Bill Morris (@vtcraghead) October 23, 2015
Of course, OSM is about fixing things, so I carried out the conversation in between adding around a hundred streets to the city. However the really interesting question was this one:
@vtcraghead @hotosm @openstreetmap Would there be some sort of algorithm that can be run to detect built up areas from imagery not mapped?
— Antonio Locandro (@antoniolocandro) October 23, 2015
Whilst at breakfast I thought a little more about this. I decided it ought to be possible to do something fairly simple with data which already exists.
My basic idea was as follows: create a set of urban area polygons (from the Natural Earth data) for cities over a certain size (from Geonames/GNS data), and use these polygons to count total road length and other parameters of OSM data. I would then plot these values against population for various areas (continents, countries), choose well-mapped places and build a regression model to predict poorly mapped places.
Of course I though the long-winded bit would be extracting OSM data for several thousand cities and then processing to get the various metrics. Unfortunately I haven't got that far!
As with any sensible analytical approach it's always worthwhile visualising the data: mainly because this is the easiest way to do some basic sense checks. Sadly, both my datasets failed this initial step.
Natural Earth Urban AreasEven before I loaded the city data I was suspicious of the utility of the Natural Earth data. I looked in two sparsely populated areas which are familiar to me: Patagonia & Extremadura. In both cases there were urban areas which looked spurious at first glance: NW of Rio Gallegos, and somewhere close to the Sierra de Gata.
|Spurious Urban Areas (outlined in orange) in Patagonia from Natural Earth
Imagery via Bing (most likely Landsat).
|Rio Gallegos (population 100,000) on the same scale as above.
The urban area is clearly visible, but not in Natural Earth
|Ushuaia (population 50,000) on the same scale as above.
The urban area is clearly visible.
|Punta Arenas (population 127,000) on the same scale as above.
The urban area is clearly visible.
|El Calafate (population 22,000) at a much larger scale.
For once some correspondence of urban areas.
|Cáceres (population 90,000), Extremadura.
The urban area is clearly visible.
|Coria (pop 12,000) a small town in Extremadura
which has an urban polygon in Natural Earth.
|Natural Earth Urban Areas: showing the 'Northern Powerhouse' area.
Urban areas for 7 major cities are run together in the centre.
|Urban area polygon for Gainsborough (pop: 20,000).
This is very rural agricultural land as can be seen by my Mapillary traces!
The key point is that I want a polygon set which is reliable for places I don't know in Central America, Brazil, Indonesia, Africa, India and China. A cursory inspection shows that it is unreliable in places I do know.
Straightforward conclusions about this dataset are:
- It is not usable for analytical applications
- It has major errors for cartographical applications
- Relying on Remote Sensing alone doesn't always give good data
- No-one sense checked the data.
- Not all Natural Earth data has been curated and checked for fitness for purpose.
Geonames (15k+ populated places)The other dataset I needed was a geolocated list of cities, of say over 100,000.
Wikipedia has such a list, but the idea was to do something quick, not learn how to scrape Wikipedia.
There seemed to be two readily available datasets: from the UN Statistical Division and from Geonames. The UN data lacks lat/lon so that was out. I've used the Geonames data of cities with populations over 15,000 before: it certainly has the types of data I wanted.
Once again I did a quick visualisation, and just as before problems were immediately apparent. Take a look at the image of Northern England. Firstly, note the node labelled High Peak: this is not a city/town it's an administrative district. Secondly, look for Derby (pop: 250,000) & Grimsby (pop: 90,000).: they're not there, are they. Scotland is affected too. No Perth, Dundee or Inverness. I didn't bother looking further.
For my purpose it's important that the population data are also reasonably accurate.
My first check, Nottingham, is shown as having a population of 246,000. The last time the city had such a low population was around 1900 (see the Demographics section here): current estimated population is 314,000 for the city and 730,000 for the urban area. A 33% error is probably too much for the purpose.
So I need a better list of cities, with more useful population figures too!
Combining the twoOn a worldwide basis around 4000 places with populations over 100,000 are not covered by the Natural Earth urban polygons. Mindanao in the Philippines seems particularly badly affected:
|Places over 100,000 from GNS data without corresponding urban area from Natural Earth
Some of the missing places are actually a reflection of the generalisation of the NE data: this mainly shows up along coastlines. In these cases the centroid of the city is not just outside the urban area of NE data, but in the sea!
And of course losing places like Derby makes the idea of cutting the urban areas up using Voronoi triangulation also suspect:
|Urban Areas for GNS populated places over 50000
cut using Voronoi polygons>
At least this is an area I can look at statistically. I just did this for a single metric (non-motorway road length in metres) using OSM data from May 2015. I counted road length within each polygon and tweaked the population figure by adding in the population of all city nodes within the polygon. I plotted the results using R.
|Plot of road length (in m) on Y axis (V3) against Population
for most urban areas in Great Britain
|Plot as above Road Length on OSM (V3) vs Population (V4)
with added regression line.
Urban Areas in OpenStreetMapSo we could do with better Urban Area data: why not OSM?
Analytical uses did not figure over much at the outset of OSM. I suspect I was relatively unusual in seeing OSM as a potential way to capture data for analytical uses. As I have tried to show for other data there are no good technical reasons why OSM data can't be used for analytics: it is absence, patchiness and inconsistency of data which are much more problematic.
Whilst not of any consolation for people who want the data now, it's not something I worry about over much. A few years ago OSM wasn't much use for car routing: more mappers and a greater variety of source data fixed it.
I think it's a fairly natural progression for mappers to move on to capturing other features once the road network is in place: in the main these are the things of most use for analytical applications. Furthermore in places where mappers are focused on maintaining data consistency becomes more desirable for the mapper: it really helps spot change. Mappers are also often motivated by seeing the impact of their activities: thematic maps can nudge contributors to add things outside their own direct interests: Robert Whittaker's Post Hoc site is a good example.
There is another problem, which I find more intriguing. Urban areas exemplify this perfectly. We tend to map things at a fine granular level; and the grains get smaller as detail gets added. A city mapped in a lot of detail will have well over ten different types of landuse/landcover: residential, commercial, parks, recreation grounds, pitches, hospitals, education, railways, brownfield, construction, retail, allotments, industrial, cemeteries & so on. There is no single tag to identify a built-up urban area.
We can try a couple of fudges: first merge polygons matching those criteria which touch or overlap. Merged areas can be tidied up by enlarging them with a positive buffer value and then shrinking them again. This will probably fill any holes left my unmapped landuse, or categories not selected initially. I had a quick go at this for Derby. The results compared with the NE data, my source data from OSM and urban areas from the Ordnance Survey Meridian 2 Open Data are shown below.
Conclusions & Next StepsDoing simple geo-experiments with existing worldwide datasets is not really possible. Most available data has not really been created for any kind of analytical purpose. Furthermore the datasets often require substantial cleaning up to remove obvious artefacts (erroneous data, duplicates etc). Many really require additional sources of data, and in many cases OpenStreetMap is a really convenient way to do so (more on this at some later time).
I'm actually going to have a go at trying to improve the datasets referenced here, and shortly will create projects on GitHub to do so. It turns out that decent Urban Area data would be useful for all sorts of things.
However, I'll first return to the issue of deriving built-up areas from OSM: it's something I've thought about from time-to-time in the past, and the Derby experiments look better than I expected. I'll also have to give more thought as to how we can identify less well mapped cities given that my first idea won't work.