What is our error rate?

Home
Mission
Overview of Project
Project Staff
Sponsors
Achievements
Checking, Illustrations
Upcoming Activities
Id and Species Lists
Protea Information
Protea Gallery
Growing Proteas
Interim Dist. Maps
Publications
Afrikaanse Inligting

SANBI

What is our error rate?

Protea Atlas Logo The Protea Atlas Project has been accused of having a 20% error rate by a prominent taxonomist. However, this figure is just pie in the sky. What are more realistic estimates of our error rates?

Data capture errors:
Detailed statistics on data capture errors are available as two separate programmes monitor "illegal values".
The first is undertaken by Blikbrein using a FORTRAN programme called CAPTURE, which converts the Sight Record Sheet data to a format suitable for the five database files which hold all the data. This provides detailed statistics on errors. The average detected error rate is 0.33 errors per Sight Record Sheet (33%), or 0.11 errors per species (11%). Of these about two thirds are typological errors [20% due to mis-typing and 80% due to misinterpretation of writing), with most of the mis-typing resulting in a cascade of errors - one extra inserted letter makes the rest of the line nonsense. The remaining third are illegal codes and incomplete data (no subspecies codes, no height codes, incomplete co-ordinates and altitudes, etc.).
The second programme is a dBase routine, SRS_CHK, that checks file integrity, valid codes, problem co-ordinates and relations between files. It is needed primarily to check that the errors identified by CAPTURE have been correctly made. The detected error rate here is very low - 1 per 500 SRS, and comprises mainly missing data and incorrect citing of Repeat Localities.
At present all known (detected) errors have been corrected, with the exception of six SRS from Zambia which lack co-ordinates and two SRS for which we are awaiting altitudes.

It must be noted that at this stage the electronic data have not yet been checked against the paper Sight Record Sheets, so some errors may still exist. However, we suspect these to be very low in the current data base - about 1 per 100 SRS for the Protea and Habitat boxes and between 1 and 5 per 100 SRS for the Locality box. This low rate is due to the integrity of the codes, with the only commonly shared code being "N" (for nothing, none). However, these errors will be confined to population, phenology and height codes, and not to species. Similarly, Locality errors in degree Latitude or Longitude should be very low (1 per 8 000 SRS), and in minutes fairly low (1 per 1 000 SRS), although the error in decimal minutes may be higher. More details are provided under Co-ordinates below.

Atlasser errors:

Although atlassers may make many errors, most of these will be picked up by the data checking programmes. The exception to these are "allocation errors" and computing errors. We cannot detect when atlassers have used the wrong code (an allocation error). By far the most serious of these are identification errors. But we can check on computing errors. Estimates here are less accurate than above, being based on what I remember having processed. If a numerist is interested in tallying up the corrections, their types and their rates, we should be most grateful.

Computing errors- Co-ordinates and Altitude:
We have a dedicated team of data checkers, comprising David Louw, Peter Ross and Chris Van Vuuren. Together they have checked the mapwork of over one third of all Sight Record Sheets sent in. To do this they must have all three major fields in the Locality Box (Co-ordinates, Nearest Named Place and Further Details of Locality) adequately filled in, or alternatively, the co-ordinates and a copy of a map showing the sites. This is a routine task in which every fifth Sight Record Sheet is checked and when an error is detected, checking the entire batch for the atlasser by day, month or year, depending on the nature and extent of the error. Each atlasser has their own foibles and our team now routinely inspects data from specific atlassers for personal quirks. However, personal idiosyncrasies can only be detected after about 20 SRS over a fair period, by which stage most atlassers are proficient at map work, having been personally guided by the atlas co-ordinator, where necessary (and, if not corrected, demonstrating interesting variations in what can only be coined "New Maths").

In addition, the checking programme CAPTURE locates each site to a broad biogeographical zone and reports any new or "out of range" species records. For the Cape Floral Region, it reports new records (to the atlas, and to the atlasser) to a 12 X 12 km grid cell (See species errors, below). These records are thus automatically flagged for attention by the atlasser and, if required, the co-ordinate checking team.

Similarly, during routine checks, the generation of species lists for specific areas and nature reserves (at the request of atlassers intending to visit them) and the listing of species localities (at the request of atlassers wanting to see certain species in an area), any odd records are first flagged for co-ordinate checks before possible allocation errors are checked.

Detected (and corrected) locality errors are about:
Swapping co-ordinates (E before S): 1 per 400 SRS;
Locality errors at degree resolution: 1 per 200 SRS,
Locality errors at minute resolution: 1 per 60 SRS;
Locality errors at decimal min resol: 1 per 50 SRS;
Distance to nearest place errors: 1 per 80 SRS;
Direction to nearest place errors: 1 per 40 SRS;
Unresolvable errors at minute resolution
(due to Locality data incomplete): 1 per 100 SRS;
Altitude errors (total): 1 per 85 SRS;
Altitude errors (feet instead metres): 1 per 200 SRS;
Errors - GPS: unknown (Locality data inadequate).

The current co-ordinate error in the database is estimated at about 1 per 300, most of these in recently received (and thus still to be checked) data.

Allocation errors - Species
The incorrect coding of species cannot automatically be distinguished from identification errors. The only clues to an allocation area are records beyond a species' known distributional range. These are routinely reported by CAPTURE and are conspicuous in area reports as relatively poorly sampled species.

In order to resolve some of these possible errors, we "empower" atlassers (who request species lists for areas they intended visiting) with comprehensive lists of species and localities which require checking. We have also sent letters to atlassers with suspect data, asking them for more details. Herbarium specimens are requested for note-worthy localities, gap fillers and range extensions. This is ongoing. The major problems are:

Subspecies codes:
This is not strictly an error, but many atlassers have not recorded subspecies data. Lack of these codes accounts for 1 in 250 species records, but with follow-up this is currently about 1 in 700 species records. Pr caffra and Ld glaberrimum are the worst cases at present.

Planted species:
Planted species are a major problem. The extent of this problem was not appreciated when the Protea Atlas Project was initiated. The Mark 2 version of the Sight Record Sheet contains a new (office) field for planted records. Nature Reserves such as Salmonsdam and Silvermine may contain between 5 and 15% of their species complements as planted or escaped (gone wild following fires). This does not include the formal garden areas (such as at Helderberg), but only occurrences in apparently natural veld. As a consequence simply using distributional ranges to resolve apparent identification problems requires that planted records must be checked. Where atlassers have noted that the species are planted, or localized near a campsite or gate, these potential errors can readily be resolved. A further clue is that usually several species simultaneously crop up outside of their known distribution ranges.
At present 1 in 84 species records are of planted or escaped species, with an estimate of 1 in 90.

Substitution codes:
The rate of substitution of species codes is unknown, but is considered to be very low (about 1 in 7 000 species records). The major problems are Pr laur & Ld laur and Ld cord and Pr cord (Ls cord is an invalid code). These are easily picked up on distributional records. The substitution of letters, for example, Pr cord by Pr coro has not been detected to date, and is easily checked. The software data base capture programme available to atlassers was detected to substitute Pa cand for Serruria candicans, but this has been fixed.

Substitution species - problem species:
Problems between geographically distinct species and subspecies are easy to check, query and obtain more data on. Overlapping distribution ranges present more of a problem. However, the major intractable problems are taxonomic: Paranomus and Serruria require revisions before certain groups can be resolved:

The following species routinely present identification problems (from worst to not so bad):
Pr lepidocarpodendron and neriifolia (PAN 9.8);
Se acrocarpa-dodii complex (PAN 15.7, 19.5);
Pa bracteolaris & lagopus - Bokkeveld (PAN 18.8);
Se phylicoides, rosea and heterophylla ;
Pr aspera and Pr scabra;
Pr lacticolor, punctata, mundii & hybrids (PAN 8.17)
Pr caffra and related species;
Ld coniferum and xanthoconus (PAN 8.5):
Pa esterhuysenae and dregei (PAN 24.9).

Protea neriifolia and hybrids of the White Water Proteas are widely planted within the distribution ranges of related species and require much effort (usually revisits to the sites) to distinguish between them. None of this is helped by the fact that many of the more experienced atlassers routinely identify species outside of flowering and fruiting times.

A further problem is the identification of atypical specimens. Thus Ld uliginosum uliginosum with larger leaves in the Kouga has been identified as Ld loeriense. A previously unrecorded small-flowered form of Pr scorzonerifolia was identified as Pr piscina (PAN 27.7). Pr laurifolia and Pr neriifolia seem to intergrade where they co-occur - most (99%) of the time there is no problem distinguishing between the two species.

The degree to which atlas data will modify species concepts in the genera is at present unclear, but many problems experienced by amateurs are real and not merely careless mistakes.

Furthermore, some identification problems are seasonal. Out-of-range records of Ls conocarpo-dendron conocarpodendron are only noted during the period of new leaf growth, when subspecies viridum produces silvery-haired leaves.

Some identification problems can be identified to a particular source. Thus Botanical Society Members of Professor Jackson's A-team routinely atlassed Se glomerata as Se hirsuta. Neither species are illustrated in Mary Matham Kidd's Cape Peninsula guide, and presumably the error originated in the A-team. This error now only occasionally crops up during the period of new growth when flowerheads are absent.

Statistics on incorrect identifications are difficult to obtain. Many problems are resolved before the Sight Record Sheets are submitted, when atlassers bring in "ecoscraps" for identification. Obvious errors are often resolved before data are captured, simply by pointing out differences between confusing species and requesting confirmation. By carefully vetting the first 20-50 SRS sent in by any atlasser the major identification problems are easily forestalled. About 1 in 400 species records have been changed since capture, mostly as a result of atlassers collecting additional data or having their specimens verified by professionals. Queries on over 400 species records (1 in 200) have been sent to atlassers, but many of these have been verified or confirmed.

Correcting errors, possible errors and verifying new records is an on-going process. We hope that atlassers are improving all the time, as that this is reflected in the quality of our data.

Tony Rebelo

Back PAN 30