A few weeks ago there was an involved, and sometimes, heated, discussion on the main OSM mailing list about imports. Many of the comments were interesting and useful, but one particular strand has attracted my attention. A slightly grotesque paraphrase of these various messages might be "OSM is a poorly managed Computer Science project, with inadequate tools, particularly for version control."
Leaving aside that OSM is neither a project, nor managed: I'd like to focus on what seems to be a surprising mis-perception about OSM as a database.
Firstly, databases don't have to be digital or stored electronically. Phone books, card indexes, and many reference books are easily recognisable as databases. Therefore a primitive database operation is only valid if it can be applied also to non-digital media.
Secondly, and this directly follows, information stored in a database does not have to have a unique identifier (a primary key). There's nothing stopping a telephone number appearing several times in a phone book, or complete pages being duplicated.
Once data is moved onto digital media, it really helps to assign unique identifiers: data can be restructured to be stored more efficiently or be easier to change; it's easier to spot duplication or bad data. This is exactly what OSM does, nodes, ways and relations are identified internally by system generated keys.
And it's what my local council does. They maintain an asset register of street furniture (bollards, traffic signals, parking signs, parking meters) and within this application assign identifiers. These days all this information is geocoded and available in the council's GIS. This information is obviously useful for financial planning, maintenance and other activities. BUT, they've gone a step further and each asset now has it's system assigned number marked on it. WHY? Because, "replace the bulb in lamp 35621" is a lot more specific than "replace the bulb in the lamp outside 25 Main Street". There may be lamp standards opposite each other at that location, or there might be two Main Street's or there may be no number on 25 Main Street, or the house might have been demolished. However, this number DOES NOT uniquely identify the lamp-post: it might do when combined with its location data, location data of the organisation which has assigned the number and information about the status of the system used to generate the number.
What does this have to do with OSM (apart from the wonderful possibility of collecting lamp-post numbers). Well OSM is like my local council, except that our local patch is a bit bigger, and we cannot go stencilling numbers on anything we map. So we have no means to tie an OSM object to its corresponding thing in the real world.
A corollary to this is that we cannot confidently tie OSM objects to geolocated objects in other databases. There are far too many variables to even inspire confidence in fuzzy matches: when was the OSM data mapped? what sources were used? how accurately was it mapped? when was the external data mapped? how current is it? how complete is it? does it have unique identifiers? are the identifiers persistent?
So, we have a host of problems in matching data from an OSM dataset and an equivalent external dataset. These problems relate to location accuracy, temporal accuracy, matching identifiers, and accuracy of associated data.
A good example of these problems is shown by OS Locator Musical Chairs and ITO's OSM Analysis which compare OSM street name data for Great Britain with the OpenData Locator dataset from the Ordnance Survey. This is a nice clear domain with the OS Locator data being from a known source and date and from a highly reputable national mapping agency. In some areas we have enough separately sourced data in OSM to have a handle on how accurately we can match these datasets. In most areas in England about 0.5-1.0% of Locator records cannot be matched to OSM. (I am not aware of reverse statistics, but in a recent survey aimed at hunting down some 20-odd of the mis-matches I found 5 street names used for addresses which are not present in the OS dataset). Even different datasets from OSGB have enough inconsistency to prevent complete matching. And these cases are relatively easy ones.
There are also problems relating to the purpose of external datasets: cadastral data might not reflect the building outlines we would draw naturally (e.g., French & Spanish cadasters); hydrography data might be segmented for water-flow measurement (e.g., NHD); vector data might be optimised for rendering (OS OpenData VectorMapDistrict); road data might not need to be very accurate (TIGER). The imported data should be restructured to reflect what is important for OSM, not maintained in aspic for some putative update.
So those advocating data imports or having 'development forks' of OSM need to answer : how on earth can you easily relate objects between two different data sets, or even the same data set at different times. Alternatively, we could all add some stencils to our mapping toolkits, but even then we'd have to leave our armchairs.
Postscript: the council are busy replacing all the local lamp posts, wonder what number they'll put on the new ones.