Saturday 15 September 2018

A few oddities of OpenStreetMap history files

I've been working with the OSM history data for Great Britain for a while now. This has mainly removing bugs in the initial processing (yup, out-by-one errors mainly) and, as a side effect, working out how to improve the speed of extracting way geometries. En route I have noted a few quirks which may be helpful for others working with the data.

Some of these should be obvious, but I feel that they are worth stating nonetheless.

Here they are:
  • Elements can start their existence with a deleted status. Node #1 is the best example of which I'm aware. Here is the OSM Change XML (.osc) from that deletion:
<osmChange version="0.6" generator="CGImap 0.6.1 (1966" copyright="OpenStreetMap and contributors" attribution="" license="">
<node id="1" visible="false" version="1" changeset="9257" timestamp="2006-05-10T18:27:47Z" user="τ12" uid="1298"/>
No idea how this came about.
  • Ways can have nodes which come into existence later than the way itself. Way #180 was created by Deanna Early in June 2006:

    <way id="180" visible="true" version="1" changeset="34555" timestamp="2006-06-04T00:35:25Z" user="Deanna Earley" uid="2231">
    <nd ref="635109"/>
    <nd ref="15433291"/>
    <nd ref="635110"/>
    <nd ref="15433288"/>
    <nd ref="15433290"/>
    <nd ref="635111"/>
    <tag k="highway" v="residential"/><tag k="name" v="Green Lane"/></way>
     But some of the nodes only appear to have been created rather later, like this one in September 2006:
    <node id="635109" visible="true" version="1" changeset="109674" timestamp="2006-09-13T00:37:15Z" user="Deanna Earley" uid="2231" lat="50.8889009" lon="-1.3269173"/>
    <node id="635109" visible="true" version="2" changeset="6669766" timestamp="2010-12-15T17:48:55Z" user="0123456789" uid="55782" lat="50.8888938" lon="-1.3270150"/>
    Again I don't know why.
  • The opposite is true, nodes can be deleted whilst they still remain in the list of way nodes. Here'a a road in Thurso, way #4839614
    <way id="4839614" visible="true" version="1" changeset="122653" timestamp="2007-07-02T11:09:11Z" user="rjmunro" uid="729">
    <nd ref="31140984"/>
    <nd ref="31140986"/>
    <nd ref="31140987"/>
    <nd ref="31140988"/>
    <nd ref="31140991"/>
    <nd ref="31140992"/>
    <nd ref="31140994"/>
    <nd ref="31140996"/>
    <nd ref="31140997"/>
    <nd ref="31140999"/>
    <tag k="created_by" v="Tways 0.2"/>
    <tag k="highway" v="unclassified"/>
    <way id="4839614" visible="true" version="14" changeset="52740714" timestamp="2017-10-08T19:56:51Z" user="woodpeck_repair" uid="145231">
    <nd ref="31140984"/>
    <nd ref="31140986"/>
    <nd ref="31140987"/>
    <nd ref="31140988"/>
    <nd ref="766071173"/>
    <nd ref="31140991"/>
    <nd ref="2344890723"/>
    <nd ref="2347341994"/>
    <nd ref="31140992"/>
    <nd ref="766073124"/>
    <nd ref="31140994"/>
    <nd ref="766071341"/>
    <nd ref="31140996"/>
    <nd ref="538050262"/>
    <nd ref="538049944"/>
    <nd ref="2409031255"/>
    <nd ref="31140997"/>
    <nd ref="538049236"/>
    <tag k="highway" v="unclassified"/>
    <tag k="ref" v="U4135"/>
    <tag k="source:ref" v="official"/>
    <tag k="source_ref:ref" v=""/>
    Node #31140999 was deleted by user Ollie in 2010:

    <node id="31140999" visible="true" version="1" changeset="121208" timestamp="2007-07-01T18:51:46Z" user="rjmunro" uid="729" lat="58.5975437" lon="-3.5137181"><tag k="created_by" v="JOSM"/></node><node id="31140999" visible="false" version="2" changeset="4923978" timestamp="2010-06-06T23:45:04Z" user="Ollie" uid="10785"><tag k="created_by" v="JOSM"/></node>

    In this case it's clear what has happened. All versions of the way from 2 to 13 needed to be redacted, and redaction cannot hope to restore all changes completely.
  • A logical corollary of redactions is that version numbers are not guaranteed to be sequential, whilst they do always increase monotonically, . This is particularly important if you use SQL window functions. I've found it useful to have an element sequence number which is sequential.
  • Element versions can share the same time stamp. OSM XML only stores timestamps on objects to the nearest second. Whereas it must be possible to adjust the position of a node several times in one second, it's harder to see how these can be saved to the API so quickly. One case where this might work is where work is saved, and whilst the data is being uploaded some fault is noted, corrected and immediately saved. Another possibility, which I haven't investigated is in the mechanisms by which editors & the API generate timestamps. Effectively I havent worried about this and have 'thrown away' the earlier elements sharing the timestamp, but it does mean that you can't rely on date order alone to correctly sequence element history.

    Simon Poole asked me for an example, so here is one, node #107298, versions 2 through 4 all have the same timestamp:

    <node id="107298" visible="true" version="1" changeset="66" timestamp="2005-06-05T18:31:25Z" user="zool" uid="131" lat="51.5496368" lon="0.0051554"/><node id="107298" visible="false" version="2" changeset="415456" timestamp="2007-11-05T22:02:42Z" user="Paul Todd" uid="12503"/><node id="107298" visible="false" version="3" changeset="415456" timestamp="2007-11-05T22:02:42Z" user="Paul Todd" uid="12503"/><node id="107298" visible="true" version="4" changeset="415456" timestamp="2007-11-05T22:02:42Z" user="Paul Todd" uid="12503" lat="51.5496390" lon="0.0051490"><tag k="created_by" v="Potlatch 0.4c"/></node>
    This was back in the early days of Potlatch when edit changes went live immediately on the API database. This 'live mode' of Potlatch prior to Potlatch 1.0 may explain the existence of most of these very transient versions.
  • History files for an area may contain data outside that area. I've used Geofabrik's history files (now requiring an OSM logon) for Great Britain, but they contain all versions of node #1. It spent some of it's existence in Argentina. I suspect the same is true if one creates files using Osmium. Mainly this is because one needs the history including when a node is deleted which lacks any geographical information. As far as I know it adds little overhead.
A quick re-cap of the key points again:
  • Element history in OSM full planets is not complete (at least for early stages of the project).
  • There can be gaps in the sequence of OSM element version numbers (usually caused by redactions).
  • Redactions may affect the integrity of the relationships of geometries.
  • Time granularity is not enough to separate all object versions.
  • History files do not cleanly contain only data from the relevant area.
Taken together these facts means that processing OSM history data requires a fair degree of defensive programming measures. I've introduced a few as I've been working through data for GB, clearly I need more, because it's otherwise hard to separate out programming bugs from data glitches. For instance, I am now introducing node counts, point counts and deleted node counts for geometries as I assemble them.

This also means that I'm delaying placing the basic SQL processing code for Osmium OPL data on github until I done some more checking. However, the specific examples I've posted above represent decent candidates for building up a small dataset for tests.


2018-09-16 12:00: Example of a node with multiple versions with a single timestamp added.

1 comment:

  1. Re: multiple object versions with the same timestamp: you can modify the same node multiple times within the same osmChange message upload. However, deleting the same node 107298 v2 again in v3 looks like an old bug.


Sorry, as Google seem unable to filter obvious spam I now have to moderate comments. Please be patient.