Friday, 23 May 2014

Fuzzy ideas on fuzzy matching

The UK Food Hygiene data set (FHRS) is just on example of many which it would be nice to be able to compare with OpenStreetMap in a semi-automated manner.

External open data can both be a useful source of missing data and an important tool for evaluating completeness and quality of OSM. FHRS has a number of nice properties:
  • it's large, but not too large; 
  • it is generally of high quality; 
  • it has reasonable precision geolocation; 
  • it is pretty current (most data - five-sixths - is less than 3 years old); 
  • it covers a wide range of different class of feature (hospitals, schools, pubs, butchers etc.); 
  • and it is comprehensive.
Even with good quality data there are always problems in matching data from two sources (conflation seems to be the GIS word for this):
  • Firstly, the location data provided is often not precise enough to do direct comparisons based on location. 
  • Secondly, elements like names and addresses may have enough variation in them to be non-trivial to match automatically, even if to a human the redundancy in the data means that matching is possible. 
  • Thirdly, names and functions of amenities change. 
  • Fourthly, different sources of data may encode features in different ways. 
  • Fifthly, all large data sets have errors in them.
The last point is the critical one: it is not wise to assume that any particular attribute of a given data set has a fixed level of data quality. So matching must be reasonably tolerant of missing values, values subject to typographical errors, encoding differences and variable accuracy. Also attributes may be interdependent in the incoming data set (for example locations in FHRS are derived from look-ups on postcode centroids): thus a trivial error in one value may result in a serious error in a dependent attribute, such as a single letter typo in a postcode moving a point 1000 kilometres away.

I have therefore been trying to think of ways in which matches can be made independently and with a degree of fuzziness. To this end I've been trying to come up with a listing of the more important types of matching operations involved with FHRS data. The hope is that I cover most use-cases for other data sets.

How fuzzy matches are scored and combined is not treated here. Equally there are tools already within the OSM community: Nominatim, Osmose, OS Locator Musical Chairs which provide elements of the functionality I desire.

Non-locational matching

There are many matching operations which require no knowledge of the associated geographies: although in practice they may correspond to strings used in matching operations (e.g., matching on a local authority name is in practice also matching to a polygon). Despite the fact that some of these things can be inextricably linked I want to treat them entirely separately for matching purposes.

Names

Basic fuzzy matching of names is handled directly with PostgreSQL by a package called fuzzystrmatch. This provides several different ways of comparing the similarity of strings. The same algorithms are available in other RDBMs packages, and for many programming languages. Robert Scott used the Levenshtein algorithm very successfully for OSM Musical Chairs which helped mappers in Great Britain to track down discrepancies between OSM street names and Ordnance Survey (open) data. (These did not always turn out to be typos on our part).

However Levenshtein only works if strings are of similar length. In the case of FHRS I want to be able to find likely matches in these typical cases:

FHRS Name
OSM Name
Reason for difference
Sycamore Primary School
Sycamore Academy
Name change (a common type in UK at present)
Robin Hood
Robin Hood and Little John
Name truncation
Rose & Crown
Rose and Crown
use of abbreviations
Rose and Crown Inn
Rose and Crown
Name variant
The Rose and Crown Hotel
Rose and Crown
Name Variant
Rose and Crown Public House
Rose and Crown
Unnecessary explicit non-name info
Royal Gourock YC
Royal Gourock Yacht Club
use of abbreviations

Royal Gourock Yacht Club - geograph.org.uk - 858830
Clubhouse, Royal Gourock Yacht Club
© CC-BY-SA Thomas Nugent on Geograph via Wikimedia Commons

The major common feature of these strings is that there are elements which obviously match when inspected visually. Common sense and real-world knowledge (that lots of schools are changing their name to use "Academy", for instance) tells us that many of the elements are unimportant and can be ignored in matching. It's a bit harder to persuade a computer to do the same!

My basic idea is that we should match individual words, so that each string is divided into individual word strings, which I refer to as tokenising. Very common tokens ("and", "the", "&") and domain specific ones ("school", "academy", "primary", "hotel", "inn", "pub", "public house", "club", "cafe") are eliminated before attempting matches.

Charlbury, the Rose and Crown pub - geograph.org.uk - 801266
One of the many Rose & Crown pubs in Britain :
this one has particular significance for OpenStreetMap
© CC-BY-SA Francois Thomas on Geograph via Wikimedia Commons.
A totally naive way would just to be to look for single token matches and then apply increasingly stringent matching criteria. Unfortunately this falls down on performance grounds: there are a lot of "Rose and Crown" pubs in Britain (236 in FHRS data, 174 on OSM), and a not inconsiderable number of "Rose" and "Crown" pubs, not to say "Rose Cottage" restaurants. A crude initial single token match will result in thousands of potential matches just for these strings (a minimum of 2*236*174 => 82k+). At this stage I think a means of matching multiple tokens in the first pass is less likely to end up running into major performance issues when scaled for larger data sets.

One reason to worry about this performance is that we also need to take account of typos in both data sets, and therefore either as part of the initial match or as a subsequent trawl through unmatched items we also need to use Levenshtein distance for token comparison.

It may well be that the most effective route is to apply stringent matching criteria (all tokens match and are in the same order) before progressively relaxing the constraints. Only looking at real world examples will help evaluate which method will be most effective. In practice the method used may need to be parameterizable to reflect the quality of one or more data sets. For instance I remember being frustrated trying to match GNIS names to US military maps of Pakistan during the 2011 flooding, and I want to work on techniques which are just as applicable to humanitarian scenarios as to improving pub coverage in the UK.

Feature (Point of Interest) Type 

FHRS data comes with a range of Business Types:

  • "School/college/university"
  • "Pub/bar/nightclub"
  • "Restaurant/Cafe/Canteen"
  • "Importers/Exporters"
  • "Mobile caterer"
  • "Retailers - other"
  • "Retailers - supermarkets/hypermarkets"
  • "Retailers - other"
  • "Takeaway/sandwich shop"
  • "Farmers/growers"
  • "Distributors/Transporters"
  • "Other catering premises"
  • "Hospitals/Childcare/Caring Premises"
  • "Hotel/bed & breakfast/guest house"
  • "Manufacturers/packers"
Although how these correspond to OSM tags is in the main, fairly obvious,  most of the FHRS categories are broader.

There is also the straightforward classification problem: it is not always easy to decide if a places is a restaurant or a pub. Yesterday, I had the problem of deciding if a place is a hospital or a care home (the probable answer is that once it was a hospital but now is a care home). Thus one needs some way of matching semantic categories (presumably in the first instance using trivial rules such as pub ⇔ "Pub/Bar/Nightclub"), but also a means of identifying potential overlaps or 'spill-over' between semantic categories. 

Manor Pub Restaurant on Nottingham Road Toton Corner - geograph.org.uk - 1058543
The Manor at Toton Corner
A classic pub building, but these days a restaurant with a small bar area: places like this are likely to be tagged or coded differently by different people.
Other examples are: pubs with overnight accomodation; and hotels with pub-like bars; petrol stations only marked as such, but with a convenience store on site too.

My impression is that it will be easier to identify rules for the fuzzy aspects of matching semantic categories by using training sets. Once again FHRS data, simply because it has lots of other detail (addresses, limited geolocation and names) is a decent starting point for identifying how fuzzy OSM tag categories might be. 

There is another problem here: systematic mis-classification. At least one local authority, Gedling, places all school contract caterers in the category "Other Caterers", when it is clear that they should be in "School/college/university".  Of course, in cases like this, one would hope that we can work to get the data properly coded: no-one can assess hygiene of school catering in this district easily with the data as it stands.

Addresses

It may be odd to treat addresses as non-locational data, but here I am largely referring to string matching of one or more parts of the address : independent from any awareness of the locations associated with these strings.

Ideally, the address is parsable into discrete elements. This is probably true for most UK addresses which consist of a number, street and post town, but as is usually the case all the difficulty lies in the exceptions. Furthermore each country requires parsing rules specific to its own particular cases. For instance in Spain, it is not unusual for forms requesting address data to ask for the door (often izquierda (left) and derecha (right)) and floor number as well as the rest of the address. Following edits on the OSM wiki I have also learnt about addresses in Florence (businesses and residential addresses have separate but overlapping numbers) and in the Czech Republic. Places like Japan and South Korea have quite different addressing schemes too.

Thus decent functionality for address parsing is to my mind rather more complicated to look at straight away. Instead we can focus on parts of the address which we are highly likely to have already captured in OSM. Notably these are the street name and the postal town/village/city. The former is easier to use, not least because the Royal Mail in Britain insist in some very odd uses of locations in addresses.

Once again very common names present a matching volume problem (Church Street, Main Street etc.) but this can be greatly reduced by applying other non-geographical constraints (such as the local authority which provided the data) from the data set. (This may seem to be making everything too hard, but I really want to keep pure geo-matching separate: ultimately it should make for a cleaner architecture). One important reason for doing this is error handling.

A simple inspection of addresses in the Land Registry Prices Paid file for the town of Maidenhead (for obscure technical reasons I used a postcode, SL6, as a proxy) reveals a small number of addresses where the postcode does not match the polygon of the named district in the file.

Land Registry records with either an erroneous assignment of postcode or of local authority
Data are for post district SL6 (orange line) where the postcode centroid was not located in the boundary of the local authority in the records.
Boundaries from OS Open Data Boundary Line, Post Code centroids from CodePoint Open,
SL6 boundary from Geolytix (based on OSGB Open Data sets)
In large datasets these types of errors are invariably present: failing to cope with them (including chucking them out) often leads to obscure complications both with the data and code to manipulate it.

Locational Matching

By locational matching I mean comparing data sets based on a geolocated data: whether this is a point (as with FHRS data located at postcode centroids), or an area (again with FHRS data, the local authority which has collected the data.

It should be noted that most datasets will have two implicit sources of geolocated data: scope (the defined area of the data set, typically with OpenData sets, scope will be a country, state, or local authority) and source (who collected the data, sometimes identical to scope). The important aspect of these two implicit sets of location data is that they are likely to be free of basic locational errors. A national data set is very unlikely to include data for other countries; inaccurate locations outside the source local authority are likely to be erroneous. This basic information must not be overlooked as it provides a good control on data reliability and will often enable other matching to be much more constrained.

Scope

The FHRS data potentially covers the entire United Kingdom, which is its basic scope. The Scottish part of the scheme has different data and therefore also has a separate scope.

Source

As FHRS data is collected independently by each local authority, and this information is contained in the source data, this provides a finer grain of location data which can be treated as having a very low error rate.Source is important because data quality is likely to vary by source (it certainly does in FHRS data).

Explicit Locations

My expectation is that most data sets are likely to provide explicit location information in the form of lat/lon pairs (or eastings and northings in other co-ordinate systems), with line or polygon information being rarer. Certainly Nottingham OpenData have been removing many data formats in favour of plain CSV: this is pragmatic, it is easy to load the data in a spreadsheet, but not so simple to look at it in a GIS. Local users are also more likely to be able to make use of the data without requiring that it be mapped in the first instance. With this in mind most of what follows assumes data is delivered with a point location. In many cases if data does have a more elaborate geographical content, much of the matching may still be carried out based on centroids.

The key problem with centroids is that one has no idea of the degree of imprecision of the data. For instance playing with GNIS data exposes many data items which are located a long way from their true location, whereas the Nottingham OpenData on Streetlights is accurate to the nearest metre. Use of postcode centroids makes things slightly harder as their degree of accuracy will be a factor of local postcode density. The worst case for a postcode in Great Britain is a farm which is over 11 km from the centroid.

Therefore at least for things like UK postcodes and GNIS data the plain locational matching will need to start with a fairly tight circumspection of the area of potential matches, which can be progressively relaxed as most matches are made.

Other datasets which provide metadata on precision (Nottingham Streetlights, q.v.) can probably be handled with a single suitable matching operation.

Other implicit locations

Oddly, postcodes feature here too. It is usual, although not mandatory, that a postcode can belong to one and only one street. The main exceptions are in rural areas where all houses in an area share a postcode (usually there are no street names in this case), or when a group of houses are associated with a subsidiary street as well as the main one. In the case where the postcode does belong to the street, the data can be matched to the geometry of the street (which may or may not be helpful).

Outline 

Here's a brief outline of what I think the main components should be:
  • One or more matching engines. These use a rule driven matching technique to two data sets with an optional rule driven filter (for performance reasons). Matching could be bi-directional, but for simplicity I assume it is uni-directional as the other direction can be done by swapping the two dataset parameters. Output from the matching engine is a set of matches scored by a likelihood measure. A minimum of two matching engines can be recognised: one based on strings the other on geographies. Filters are likely to be a straight datasetA.attributeX = datasetB.attributeX (e.g., local authority identifier).
  • Matching should allow increasing/decreasing tightness of constraints. Probably by allowing recursive calls within a matching rule.
  • A match selector. Given n different matching routines each producing a likelihood estimate, something needs to evaluate these scores and output final matches.
  • A matching routine chooser. With different data sets the order of application of matching routines it may be better to train the system with a known data set in order to use the most efficient way to apply matching routines.
  • A simple way of specifying rules.

Summary

I came to this problem with some old experience of use of Harte Hanks Trillium software for keeping track of commercial customers in a banking application. I didn't use the tool myself but it was an important part of being able to build a single common view of customers, and part of this was matching up different versions of a business name captured both in internal and external systems.

Years before on another (antique) banking system we came upon the unfortunate decision to create internal keys based on the client name, which meant that we lost any history when the name altered (potentially just a typo). I mention these just to illustrate that string matching and intelligent address parsing have always had important business applications. However, OpenSource resources to do similar things are few and far between.

Twice I have been struck how fairly simple matching operations would have made mapping during HOT activations a bit easier: locating hospitals in Haiti after the Januaru 2010 'quake, and matching GNIS nodes to names on old US Military Maps during the 2011 Pakistan floods. Ideally we would be able to use a range of matching techniques to enrich the map data created from aerial and satellite imagery at the time of a crisis. Not all such data would be directly suitable for OSM, as there would be potential for trying to match stuff from non-open sources too, but in the main I see the whole process as being an aid to mapping, not a way to directly generate map data.

I have only set out some desiderata here, although I've played a little with some of the basic techniques described here. I certainly have not attempted any mechanism for fuzzy matching, although I have discussed the viability of using Bayesian approaches with a couple of folk who know much more than I do. For me the key thing is to have a plug-in framework for matching engines and matching rules. The flexibility gained will not just allow increasing refinement of techniques, but also enable only appropriate techniques to be used and in the most efficient order.


No comments:

Post a Comment

Sorry, as Google seem unable to filter obvious spam I now have to moderate comments. Please be patient.