
What is our error rate?
The Protea Atlas Project has
been accused of having a 20% error rate by a prominent taxonomist. However, this figure is
just pie in the sky. What are more realistic estimates of our error rates?
Data capture errors:
Detailed statistics on data capture errors are available as two separate programmes
monitor "illegal values".
The first is undertaken by Blikbrein using a FORTRAN programme called CAPTURE, which
converts the Sight Record Sheet data to a format suitable for the five database files
which hold all the data. This provides detailed statistics on errors. The average detected
error rate is 0.33 errors per Sight Record Sheet (33%), or 0.11 errors per species (11%).
Of these about two thirds are typological errors [20% due to mis-typing and 80% due to
misinterpretation of writing), with most of the mis-typing resulting in a cascade of
errors - one extra inserted letter makes the rest of the line nonsense. The remaining
third are illegal codes and incomplete data (no subspecies codes, no height codes,
incomplete co-ordinates and altitudes, etc.).
The second programme is a dBase routine, SRS_CHK, that checks file integrity, valid codes,
problem co-ordinates and relations between files. It is needed primarily to check that the
errors identified by CAPTURE have been correctly made. The detected error rate here is
very low - 1 per 500 SRS, and comprises mainly missing data and incorrect citing of Repeat
Localities.
At present all known (detected) errors have been corrected, with the exception of six SRS
from Zambia which lack co-ordinates and two SRS for which we are awaiting altitudes.
It must be noted that at this stage the electronic data have not yet been checked
against the paper Sight Record Sheets, so some errors may still exist. However, we suspect
these to be very low in the current data base - about 1 per 100 SRS for the Protea and
Habitat boxes and between 1 and 5 per 100 SRS for the Locality box. This low rate is due
to the integrity of the codes, with the only commonly shared code being "N" (for
nothing, none). However, these errors will be confined to population, phenology and height
codes, and not to species. Similarly, Locality errors in degree Latitude or Longitude
should be very low (1 per 8 000 SRS), and in minutes fairly low (1 per
1 000 SRS), although the error in decimal minutes may be higher. More details are
provided under Co-ordinates below.
Atlasser errors:
Although atlassers may make many errors, most of these will be picked up by the data
checking programmes. The exception to these are "allocation errors" and
computing errors. We cannot detect when atlassers have used the wrong code (an allocation
error). By far the most serious of these are identification errors. But we can check on
computing errors. Estimates here are less accurate than above, being based on what I
remember having processed. If a numerist is interested in tallying up the corrections,
their types and their rates, we should be most grateful.
Computing errors- Co-ordinates and Altitude:
We have a dedicated team of data checkers, comprising David Louw, Peter Ross and
Chris Van Vuuren. Together they have checked the mapwork of over one third of all Sight
Record Sheets sent in. To do this they must have all three major fields in the Locality
Box (Co-ordinates, Nearest Named Place and Further Details of Locality) adequately filled
in, or alternatively, the co-ordinates and a copy of a map showing the sites. This is a
routine task in which every fifth Sight Record Sheet is checked and when an error is
detected, checking the entire batch for the atlasser by day, month or year, depending on
the nature and extent of the error. Each atlasser has their own foibles and our team now
routinely inspects data from specific atlassers for personal quirks. However, personal
idiosyncrasies can only be detected after about 20 SRS over a fair period, by which stage
most atlassers are proficient at map work, having been personally guided by the atlas
co-ordinator, where necessary (and, if not corrected, demonstrating interesting variations
in what can only be coined "New Maths").
In addition, the checking programme CAPTURE locates each site to a broad
biogeographical zone and reports any new or "out of range" species records. For
the Cape Floral Region, it reports new records (to the atlas, and to the atlasser) to a 12
X 12 km grid cell (See species errors, below). These records are thus automatically
flagged for attention by the atlasser and, if required, the co-ordinate checking team.
Similarly, during routine checks, the generation of species lists for specific areas
and nature reserves (at the request of atlassers intending to visit them) and the listing
of species localities (at the request of atlassers wanting to see certain species in an
area), any odd records are first flagged for co-ordinate checks before possible allocation
errors are checked.
Detected (and corrected) locality errors are about:
Swapping co-ordinates (E before S): 1 per 400 SRS;
Locality errors at degree resolution: 1 per 200 SRS,
Locality errors at minute resolution: 1 per 60 SRS;
Locality errors at decimal min resol: 1 per 50 SRS;
Distance to nearest place errors: 1 per 80 SRS;
Direction to nearest place errors: 1 per 40 SRS;
Unresolvable errors at minute resolution
(due to Locality data incomplete): 1 per 100 SRS;
Altitude errors (total): 1 per 85 SRS;
Altitude errors (feet instead metres): 1 per 200 SRS;
Errors - GPS: unknown (Locality data inadequate).
The current co-ordinate error in the database is estimated at about 1 per 300, most of
these in recently received (and thus still to be checked) data.
Allocation errors - Species
The incorrect coding of species cannot automatically be distinguished from
identification errors. The only clues to an allocation area are records beyond a species'
known distributional range. These are routinely reported by CAPTURE and are conspicuous in
area reports as relatively poorly sampled species.
In order to resolve some of these possible errors, we "empower" atlassers
(who request species lists for areas they intended visiting) with comprehensive lists of
species and localities which require checking. We have also sent letters to atlassers with
suspect data, asking them for more details. Herbarium specimens are requested for
note-worthy localities, gap fillers and range extensions. This is ongoing. The major
problems are:
Subspecies codes:
This is not strictly an error, but many atlassers have not recorded subspecies data.
Lack of these codes accounts for 1 in 250 species records, but with follow-up this is
currently about 1 in 700 species records. Pr caffra and Ld glaberrimum are
the worst cases at present.
Planted species:
Planted species are a major problem. The extent of this problem was not appreciated
when the Protea Atlas Project was initiated. The Mark 2 version of the Sight Record Sheet
contains a new (office) field for planted records. Nature Reserves such as
Salmonsdam and Silvermine may contain between 5 and 15% of their species complements as
planted or escaped (gone wild following fires). This does not include the formal garden
areas (such as at Helderberg), but only occurrences in apparently natural veld. As a
consequence simply using distributional ranges to resolve apparent identification problems
requires that planted records must be checked. Where atlassers have noted that the species
are planted, or localized near a campsite or gate, these potential errors can readily be
resolved. A further clue is that usually several species simultaneously crop up outside of
their known distribution ranges.
At present 1 in 84 species records are of planted or escaped species, with an estimate of
1 in 90.
Substitution codes:
The rate of substitution of species codes is unknown, but is considered to be very low
(about 1 in 7 000 species records). The major problems are Pr laur &
Ld laur and Ld cord and Pr cord (Ls cord is an invalid code). These
are easily picked up on distributional records. The substitution of letters, for example,
Pr cord by Pr coro has not been detected to date, and is easily checked. The
software data base capture programme available to atlassers was detected to substitute Pa
cand for Serruria candicans, but this has been fixed.
Substitution species - problem species:
Problems between geographically distinct species and subspecies are easy to check,
query and obtain more data on. Overlapping distribution ranges present more of a problem.
However, the major intractable problems are taxonomic: Paranomus and Serruria
require revisions before certain groups can be resolved:
The following species routinely present identification problems (from worst to not so
bad):
Pr lepidocarpodendron and neriifolia (PAN 9.8);
Se acrocarpa-dodii complex (PAN 15.7, 19.5);
Pa bracteolaris & lagopus - Bokkeveld (PAN 18.8);
Se phylicoides, rosea and heterophylla ;
Pr aspera and Pr scabra;
Pr lacticolor, punctata, mundii & hybrids (PAN 8.17)
Pr caffra and related species;
Ld coniferum and xanthoconus (PAN 8.5):
Pa esterhuysenae and dregei (PAN 24.9).
Protea neriifolia and hybrids of the White Water Proteas are widely planted within
the distribution ranges of related species and require much effort (usually revisits to
the sites) to distinguish between them. None of this is helped by the fact that many of
the more experienced atlassers routinely identify species outside of flowering and
fruiting times.
A further problem is the identification of atypical specimens. Thus Ld uliginosum uliginosum
with larger leaves in the Kouga has been identified as Ld loeriense. A
previously unrecorded small-flowered form of Pr scorzonerifolia was identified
as Pr piscina (PAN 27.7). Pr laurifolia and Pr neriifolia
seem to intergrade where they co-occur - most (99%) of the time there is no problem
distinguishing between the two species.
The degree to which atlas data will modify species concepts in the genera is at present
unclear, but many problems experienced by amateurs are real and not merely careless
mistakes.
Furthermore, some identification problems are seasonal. Out-of-range records of Ls
conocarpo-dendron conocarpodendron are only noted during the period of new leaf
growth, when subspecies viridum produces silvery-haired leaves.
Some identification problems can be identified to a particular source. Thus Botanical
Society Members of Professor Jackson's A-team routinely atlassed Se glomerata
as Se hirsuta. Neither species are illustrated in Mary Matham Kidd's Cape
Peninsula guide, and presumably the error originated in the A-team. This error now only
occasionally crops up during the period of new growth when flowerheads are absent.
Statistics on incorrect identifications are difficult to obtain. Many problems are
resolved before the Sight Record Sheets are submitted, when atlassers bring in
"ecoscraps" for identification. Obvious errors are often resolved before data
are captured, simply by pointing out differences between confusing species and requesting
confirmation. By carefully vetting the first 20-50 SRS sent in by any atlasser the major
identification problems are easily forestalled. About 1 in 400 species records have been
changed since capture, mostly as a result of atlassers collecting additional data or
having their specimens verified by professionals. Queries on over 400 species records (1
in 200) have been sent to atlassers, but many of these have been verified or confirmed.
Correcting errors, possible errors and verifying new records is an on-going process. We
hope that atlassers are improving all the time, as that this is reflected in the quality
of our data.
Tony Rebelo
Back PAN 30
|