Sunday, 8 December 2013

Food Hygiene Open Data : an easy way into mapping addresses and postcodes

I've written a bit about the Food Safety Agency's Food Hygiene Open Data (henceforward FHRS) before, but as it was the focus of my hacking at last weeks London Hack Weekend, it's about time it had a dedicated post.

FHRS premises spread out along a line normal to shortest line to associated street (See article for details)
Data (c) from OSM contributors 2013, and Food Hygiene Rating Scheme (OGL)

To recap the salient points about FHRS data are:
  • It's pretty comprehensive, covering most local authorities in the UK.
  • It represents more than 15% of all postcodes.
  • Most records come with full address data and a location (the postcode centroid).
  • Premises (shops, pubs, cafes etc.,) are grouped into categories which map fairly easily to OSM tags. 
  • Daily updates. High data currency (last reviewed date on all records).
  • It's Open Data under the plain vanilla Open Government Licence.
For me the obvious uses of this data with OpenStreetMap are:
  • Analysis of data quality and completeness. (This can be for a given retailer, say, Tesco or Aldi; a given local authority (as in my work with Nottingham data), or for a class of premises (see below)).
  • Enrichment of existing OSM data (e.g., pubs) with addresses and postcodes.
  • Provision of targeted mapping destinations (through the location data).
  • 'Prompted recall'. In many cases one can remember pubs , restaurants and hotels from visits many years ago. I have always resisted mapping such places because it is difficult to know if these places are still in business, and have not changed their names. It's also possible to remember these places and not immediately recall the name. However, the FHRS data enables one to check if a place is still in business (or likely to still be in business) and to refresh one's memory of the names. I used this to add the extant pubs in Hampton Wick (it was the one I used most which has gone).
  • Identifying change. In the simplest case this is just spotting new, closed or renamed premises. On the other hand it may also identify more major developments.
In my view the one thing this data is not particularly useful for in an OSM context is storing Food Hygiene ratings. I think these change too much (some places get inspected every 6 months) and maintaining such data is known not to be one of our strengths. Furthermore there are plenty of other ways of acquiring this information including the FSA's own Android, i-phone and Win8 apps.

To date I have been using and manipulating this data manually: I loaded the Nottingham data into a spreadsheet from which I generated a GPX file to load on my Garmin, Each time I visited a new postcode I checked off the premises I'd surveyed on the spreadsheet. I only used a static snapshot of the data. For a while I've been looking at trying to see how this data can be exploited in OSM at a national level. This is a run down of some things I've done, and things I'd like to do.

Getting the Data

The data is available as (currently) 397 XML files on the FHRS website. Data is updated regularly (whenever a local authority needs to add or change the data), but there does not seem to be anything like an RSS feed. There are lots of attributes in the XML data most of which are not of great interest for OSM.

It is easy to read the data in with Excel or other spreadsheets, but when it comes to processing it as a whole there seem to be a number of problems. I tried loading files directly into Postgres but failed. Instead I wrote a simple minded XLST program which converts the data into a CSV file (actually with "::" as separator, because of commas in the data). Even then I had problems with a small number of records: mainly from Northern Ireland. Again a simple kludge is to eliminate offending characters like '\' backslash in the incoming data. At the Hack Weekend MickO wrote another parser which converts the data to a shape file.

The XML format means that individual attributes are not consistent. Food ratings can be numeric values or "EXEMPT".  I therefore load the CSV data into a Postgres table where all columns are strings (I call this an image table), and then parse the data to load it into the main table. The principle objective of using the text image table is to reduce/eliminate the possibility of losing during the data load or the data load breaking with unexpected values.

This whole process needs to be automated. In particular finding diffs and only loading those together with some temporal columns on the main FHRS table would enable more interesting handling of the data. To date I only have a very crude shell script to pull the data from the main FHRS site.

Simple Data Analysis

The simplest way to look at the data is just to compare for a given area the number of premises (schools, pubs etc) from FHRS with similar types of POI from OpenStreetMap. By far and away the easiest is looking at the data by local authority, but I've gone for a finer grained approach using postcodes and the Geolytix postcode boundaries. (This does have some deficiencies because not all data has complete or accurate postcode data).


Fastfood Outlets
Pubs
Restaurants and Cafes
Other (food) retail
Schools
Supermarkets
Comparison of OSM and FHRS Data at level of Postal Sector
Contains data from Code Point Open 2013 (c) and database right: Royal Mail, Ordnance Survey, Geolytix Ltd
Contains data from OpenStreetMap 2013 (c) and database right OSM contributors

The above comparisons show how the idea works in practice: but they must be taken with a pince of salt as I have not ironed out steps which may lose data in constructing these chlorolpeths.

It's important to be clear that such comparisons are of aggregated data, and not necessarily like-for-like datasets: for instance FHRS data only includes pubs which serve food, and not every authority collects data on every category (Argyll and Bute don't do convenience stores at present). In some cases the categories are misused: Gedling appears to place external caterers at schools in the "Other Caterer" category. For more sophisticated approaches see below.

Tools for using FHRS data in conjunction with OpenStreetMap

Working with FHRS data has suggested a range of steps and processes where fairly simple tools could be used. Some exist already, others we have made a start on. Ideally manipulating this data would lead to more generic approaches suitable for handling other sources of Open Data.

This is a wish list of a few things I would like (there are others): some definitely exist, some probably are available in one part of the OSM software ecosystem, but some need creating.
  • FHRS Data Grabber: Scripts and parsers required to grab data on a regular basis the data on the website, extract useful data from the XML files, format it and find diffs.
  • Postgres and Shapefile based FHRS Data: Experience shows that XML is a bit tedious to manipulate for the type of base data feeds we are interested in. FHRS data is only one example of many similar data sources. At a minimum it is nice to be able to have data in shape files, and in a format convenient for import to PostgreSQL, such as csv. Others may wish to have it as JSON.
  • RSS Feeds: Converting extracts of data into diffs means that it is possible to contemplate something like the RSS feed which it would have been nice to see from FHRS in the first place! Such a service would be much more useful for people who curate areas of the map, and are more likely to be solely interested in changes.
  • Tiled FHRS Layer: A tiled rendered layer similar to the very useful oscompare layer of Postcodes from Chris Hill, would make checking FHRS data against existing OSM data something which was easy to do from within the editor. (Not as useful as using snapshot server to do the same thing, but probably a bit quicker to implement).
  • Snapshot Server: Snapshot Server is a Rails app which allows a small database in Osmosis snapshot format to be served to editors. It was written by Andy Allan, and so far has mainly been used to allow DoT cyclepath open data to be brought into OSM in a much more controlled way than a classic import. Providing FHRS data in the snapshot server would enable much faster merging of address data to existing POI nodes and ways.
  • Address parsing: FHRS has a wide range of address forms, ranging from standard housenumbers on a street to places like St Alban's Cathedral. Furthermore individual authorities have used the 4 address fields in the data in very different ways: Argyll put the whole address in the first field, Nottingham only put information in the first field for subsidiary parts of the address, Rotherham don't seem to bother with postcodes. Ideally we would automate parsing of the address to chunk addresses into fragments which correspond to things like addr:housenumber, addr:street etc. (See below for matching streetnames only).
  • House number vectors and Street Topology generalisation:  Currently it is only possible to sort POIs at one postcode in an arbitrary direction. Ideally we would know which way housenumbers run on the street (forward or backward for streets with even/odd numbering, clockwise & anticlockwise for streets with sequential numbering). This information is available, but is a little tricky to process: in particular it is much easier with generalised forms of the streets. My main problems so far are : major roads with large roundabouts, flared junctions, dual carriageways, service roads, etc.; residential roads which loop back on themselves; and roads with branches. In each case it is not possible to simply chain along the ways which make up the road (which anyway is a painful thing to have to do just to reconstruct the entire street).

    The availability of data on the direction of increasing house numbers could greatly reduce the need for initial detailed address surveys. This becomes more significant as we get additional large but partial source of addresses, notably the Land Registry Prices Paid data.
  • Data Merging (Conflation) tools. This is a huge topic in its own right, so I'm going to say relatively little here. I would like not just to identify prospective similar data items by co-location, but through a range of fuzzy matching techniques. So far I have looked at tokenising names to have a better chance of matching "Rose & Crown" to "The Rose and Crown Inn" or "Sycamore Primary School" to "Sycamore Academy"; and methods for fuzzily matching POI attributes: a pub may be encoded as a bar or even a restaurant. Better tools in this area would help drive a more sophisticated matching process for analysing completeness. However, this is a longer term project. FHRS provides a good set of test data.

Helping to Map using FHRS

In the short term my main goal is to provide simple tools which will help people use FHRS data more effectively.

The image at the top shows the type of output I'm looking to produce either as a tiled layer or through the snapshot server.

Each FHRS premise with a full postcode is tested to see if any part of the address matches the name of any highway on OSM within 100 metres. (Non-matches can be re-processed with increasingly large buffers). Data without a street name, or without a full postcode is currently lost.

Then I find the shortest line to the matching road (thick red line in diagram), and draw a line normal to that line with a length of 10 metres for each premise minus one. So a postcode with 1 retail outlet will have a 0 length line at the postcode centroid, whereas one with 6 premises will have a line of 50 metres. The line is segmentized every 10 metres, and the individual premises assigned to each node in the line. An attempt is made to sort premises by housenumber, but this is very rough and ready: and anyway the order may be back-to-front.

Lastly premises can be rendered with an icon representing its class in FHRS.

The example I have chosen is an area where I have mapped all the FHRS premises. I have used OpenCycleMap as a background simply to reduce the amount of detail in the background layer, but this leaves only a small number of POIs to compare: notably pubs, restaurants and cafes. The location of "The Albany" pub demonstrates how the FHRS location (postcode centroid) may be some way from the actual POI itself.

Conclusion

It's not been possible to touch on half the things which can be done with FHRS data, but I hope that I've shown that work done at the Hack Weekend may be of some use to regular mappers fairly soon now.







No comments:

Post a Comment