Sunday, 14 February 2016

Distribution of Contributions in Volunteer-generated Datasets : Gall or Fruit Fly Records

I remarked in my OpenCageData interview that I see many similarities between biological recording and OpenStreetMap contributions. Indeed, I've had some interesting discussions about this with Prof. Muki Hakaly at UCL. Muki's group now do extensive research across the gamut of activities which fall under the rubric of "citizen science", so I'm hopeful that they will elucidate which features are common across this spectrum.

Chaetorellia jaceae f : 5532b
A female Chaetoraellia jaceae, a tephritid fly whose larvae feed on Knapweed.
Photo: (c) mausboam, Flickr.
Basically, we know that there is a very long tail of smaller contributions to OpenStreetMap. Both Harry Wood and Frederick Ramm gave presentations on aspects of this at SotM-14 in Buenos Aires, and Richard Fairhurst also touched on this at SotM-US in 2013. Very recently Marc Zoutendijk has used data collected by Pascal Neis to examine a cohort of new Dutch OpenStreetMap contributors from 2014 and 2015.

The usual hope expressed by people doing this type of analysis with OSM data is that by better understanding of these contributions we can improve the number of people who continue to contribute after the initial sign-up and first edit.

My perspective is slightly different, because it is coloured by knowledge of the much longer history of biological recording.

In Britain this can roughly be dated to the 17th century, and John Ray's publication of a flora of the Cambridge area. In the early days compilation of records relied on exchanges of letters, but by the 18th century the collation of data had grown to the extent that for many counties it was possible to produce a flora for the local area. For instance in Nottinghamshire, the first flora was produced by Deering in 1738, followed by a second by Ordanyo in 1807. (In fact there was another written in the 1830s, followed by a gap of over  hundred years until the last county flora was written by the Howitts in 1963).  The sheer scope of this literature can be seen in Tim Rich's digitised copy of Simpson's 900 odd page A Bibliographical Index of the British Flora (see the BSBI website). Here's a small extract of works which mention aspects of the Nottinghamshire Flora from around 1750-1825

(Nottinghamshire) A catalogue of plants ... about Loughborough; R. Pulteney 1747; Manuscript Leicester Museum 1749, and library Linnean Society
Nottingham; Historical account of the town of, C. Deering 1751, 90.
(Nottinghamshire) An account of some of the more rare English plants observed in Leicestershire; R. Pulteney Philosophical Transactions XLIX 2 (1757) 803.
(Nottingham) A catalogue of some of the more rare plants found in the neighbourhood of Leicester, Loughborough and in Charley Forest; R. Pulteney Philosophical Transactions XLIX (1757) 803, 866; and in ‘The history ... of Leicester', I. Nichols I (1795) clxxvii.
Nottinghamshire. Plantae Cantabrigienses; T. Martyn 1763, 83, [from Deering's Catalogue].
Nottinghamshire The present state of all nations; T. Smollett II (1768) 408.
Nottingham. Description of England and Wales; [Society of Gentlemen] VII (1769) 135.
Nottingham. The complete English traveller; N. Spencer 1771, 495; 1773, 495.
(Nottinghamshire) Manuscript notes in Ray's Synopsis iii; J. Lightfoot [died 1788] & J. Hill [died 1775] library Botany Department Oxford.
Nottingham; Topographical and statistical description of the county of, G. A. Cooke [c. 1802-10] 121.
Nottinghamshire. Botanist's guide; D. Turner & L. W. Dillwyn II (1805) 482.
(Nottinghamshire) Midland flora; T. Purton 1817 2 volumes; appendix, parts 1, 2 & 3 (in 2 parts) 1821.
Nottinghamshire. The scientific tourist in England ...; T. Walford II (1818).
Nottinghamshire. The new British traveller; J. Dugdale IV (1819) 6.
Nottinghamshire; Botanical calendar for, T. Jowett 1826. By "Il Rosajo" in local paper*.
Perhaps my favourite quote from this period is about the Maiden Pink Dianthus deltoides, for instance in an english language edition of Camden's Britannia around 1722 :
John Ray says, " I find this to be the same pink which groweth so plentifully by the road side on the sandy hill you ascend going from Lenton to Nottingham." Catalogus Plantarum 2 ed., 1677. p. 57.
Maiden Pink in Nottingham,
sadly not native but an escape from a green roof.
Photo: copyright the author

Ray's correspondence has more (gruesome detail):

These details just emphasise how much has changed: the sandy hill between Lenton & Nottingham is a four-lane road, Derby Road. This Mapillary sequence is roughly in the relevant location.

Indeed the area was long known as Lenton Sands, although this name is, I think, falling into disuse.

I wasn't not sure where the gallows were located, but a quick web search shows that they were close to what is now the junction of Forest and Mansfield Roads.

Another evocative historic plant location is Nottingham Castle Rock. This is the type locality for the Nottingham Catchfly, Silene nutans, but it is long since extinct in the area. However, at the foot of the rock is another plant location known since John Ray's time: here Alexanders, Smyrnium olusatrum, still grows in a small patch along Peveril Drive.

Foot of Castle Rock, Nottingham.
The gates are the old entrance gates to The Park Estate.
The green plants behind the left-hand gate are Alexanders.
These little examples are just some of the one's most familiar to me. Throughout the British Isles there are thousands of such places where rare plants grow which have been known for hundreds of years. Occasionally the places get lost, and then by chance, or as the result of diligent research and field work, get relocated (see here for examples from Wales).

It's not just plants either, a friend, David Brown, runs a regular field course in Scotland called Special Spring Moths. He takes the participants to locations which have been known for at least 170 years to see such exotica as the Rannoch Sprawler and the Rannoch Brindled Beauty.

Brachionycha nubeculosa2
A Rannoch Sprawler in Poland, a rather better photo than mine of a Scottish moth.
By Adam Furlepa  CC BY-SA 4.0, via Wikimedia Commons
It should be obvious by now, as to why naturalists are often keenly interested in topography. Rarer species highlight why accurate records and a means of sharing them have been so important to British naturalists for such a long time.

Until the mid-20th century the principle means of record keeping were personal card indexes (David Brown still does things this way). Collation of records over a particular area would be entrusted to a particularly knowledgeable and enthusiastic individual. The ideal was that these records would be periodically consolidated and published, either as a journal article, or, for larger groups, as a book.

In 1964 this changed when the Biological Record Centre was set-up and records started to be computerised. Slightly earlier, in 1962, the first Plant Atlas of Britain & Ireland was published. I think this was the first major publication to organise records on the basis of Ordnance Survey grid squares, which had only appeared relatively recently on consumer, rather than military, map products (the New Popular Edition). Since then major atlases have appeared for a number of groups, with birds and plants each having several editions.

For these better known groups, atlas data pertains to surveys carried out in a defined period. For instance the last Bird Atlas surveys ran from 2007-2011, and the next plant atlas recording period runs until 2020. For most insects (the exceptions being Butterflies and Dragonflies) there are just not enough people interested, or with the specialist knowledge, for such an effort. For these groups, all known records, however old, need to be used. There are currently around 100 beetles known from Nottinghamshire, where the last known sighting was pre-1916. But they can be re-found, as was the nationally rare Hazel Pot Beetle, Cryptocephalus coryli, by Trevor & Dilys Pendleton in 2008. So these older records can still be totally relevant today.

Now to return to the long tail phenomenon.

The point about the various BRC Atlas initiatives is that, in addition to all the information about plants and animals, they also form a range of large and valuable datasets for understanding aspects of how people contribute data in this type of undertaking. (Of course they don't answer the WHY?, but the National Biodiversity Network has recently surveyed contributors, and answers there may be of interest to people concerned with similar issues with OpenStreetMap).

I, of course, don't have access to any of these large datasets at a granularity at which one can ask questions about relative frequency of contributions. However, I have contributed to a small niche dataset, that for Tephritid flies (often called Picture-winged Flies, or Gall Flies, but more widely called Fruit Flies; unfortunately in Britain fruit fly usually means Drosphila melanogaster which belongs to a different family). These are small, but distinctive flies most of which spend their larval phase feeding on fruits of various plants, some galling their hosts. They are common, but not well recorded.

Tephritis bardanae (m)
Tephritis bardanae on Arctium tomentosum, Puchberg-am-Schneeberg, Austria, 2011
Taken the day before SotM-EU 2011.
Laurence Clemons, the co-ordinator of the scheme, is a near contemporary from university. He has run it for over 30 years, combining it with his day job of teaching. From time to time he circulates updated maps for each tephritid species. Fortunately for me he also includes a list of all recorders with the number of records and species submitted. The records go back to the 19th century, so although the data set is relatively small, 3 572 observations,  it has both broad temporal and geographical extent. I have used this list from around 2008-9 to look at recorder contribution distributions.

Terellia tussilaginis (m) : 9037a
A male of Terellia tussilaginis on Arctium minus, Nottingham.
Despite the name, these insects only feed on Burdocks.
A quick scan of the people contributing records shows many entomologists who are specialists in other fields, for instance Raymond Uffen, a lepidopterist with 1 record, Brian Spooner, former head of Mycology at Kew Gardens, with 2 records. The people with lots of records are well known dipterists: Derek Whiteley (371 records) from the Sheffield area, Steven Falk (865) from Warwickshire, John Coldwell (170) from Barnsley. Quite a few are museum professionals: Bill Ely (161), Jerry Bowdrey (184). Many of the recorders with a small number of records will be general naturalists who happen to come across galls, flies or leaf mines in the course of other surveying work.

A page from Carr (1916) show early records of Tephritidae in Nottinghamshire
Not all of these records have found there way to the national scheme.
I don't recognise many names of historical, rather than active recorders, but two stand out: J.L. Carr (1 record), and J.W. Saunt (209), see short biographies at The Coleopterist. Carr was the author of the formidable Invertebrate Fauna of Nottinghamshire (1916), keeper of the Nottingham Natural History Museum and Professor of Zoology. I know less of Saunt, other than he contributed many records to Carr's book, but he was apparently a car-body builder in Coventry.

Records / recorder GB & Ireland Tephritid Recording Scheme 2008

I took the list of records from Laurence's PDF and lightly munged the data into a CSV file which I then pulled into R to plot some histograms. (As usual I find I always have to consult a few webpages to even get started again with R after a break of a few weeks).

Of course, and entirely as expected, the histogram shows an exponential decay of numbers of recorders against records. This is exactly like the graph which Marc created which set me of looking at these numbers. The extreme outlier with nearly 2000 records, is, again, as might be expected, Laurence Clemons. No-one devotes their leisure activity to running a recording scheme unless it is a passion. Also note that 2000 records implies considerably less than 100 a year: each record probably represents significant effort.

If I change the size of bins and exclude the more extreme outlier values, the graph looks remarkably the same! Below I show the graph with 50 bins for the visible range with an upper limit of 200 and then 50.

The same graph as above, but for recorders with under 200 records

The same graph as above, but for recorders with under 200 records
I'm in the 16 records bin.
This is only one dataset. I'm pretty confident that the same patterns will be seen in other atlas datasets, and, as I stated at the outset, in other citizen science datasets. It's pretty much the same pattern we see when we look at OSM contributions.

For me the main point about this is to emphasise that there is much to learn from the long experience of Biological Recording. I think the idea that somehow we can do magical things to change the shape of the contribution curve is belied by these other data collection experiences. Indeed, the co-ordinator of the fantastically successful National Earthworm Recording scheme wrote recently that "I speak from experience when I say that teaching someone to identify a group does not make them record (and build up the necessary experience to become an expert)". Therefore we should accept that it is most likely a reality which we need to recognise.

This does not of course mean that we should cease in efforts to make OpenStreetMap look:
  •  important (for instance as Missing Maps & HOT have done most successfully);
  •  useful (Richard Fairhurst's 'driven by cyclist' idea);
  •  fun (it is, really!);
  •  interesting (that too);
  •  or just providing an excuse to get outside (.
OSM is all of these things; it's also educative, a way to meet like-minded folk; a way of learning new skills (mainly in informatics, but also observational). Nor is this pattern a reason not to strive for a more diverse community of contributors. In fact, the very likelihood that mappers "are made not created" tells us that new contributors are as important as ever.

OSM is different in one other important way from biological records: there is a founder effect. 

If I see a bird in one place today and the same species tomorrow both are useful records. In OSM if something is already mapped one can't contribute it again (although one can alter what already exists). Over time the really easy things get mapped leaving those that are harder, more tedious or just less useful. This probably doesn't matter too much for the already engaged, but it does possibly put a limit on what a newcomer might feel able to do. However, to date, there is no sign of this happening: see Simon Poole's recent diary post on the subject.

On a more general level, as I said at the outset of this piece, I hope to see some much more detailed and meticulous academic research in the near future. There are now hundreds of different citizen science datasets which can be examined across a range of different domains and types of data acquisition.

1 comment:

  1. An interesting and, as usual, thoughtful and thought provoking post. More contributors would be useful. Every new contributor is potentially a future crazy mapper and even if they just add to the long tail, their edits are still valuable.