Bayesian Modelling

Home
Mission
Overview of Project
Project Staff
Sponsors
Achievements
Checking, Illustrations
Upcoming Activities
Id and Species Lists
Protea Information
Protea Gallery
Growing Proteas
Interim Dist. Maps
Publications
Afrikaanse Inligting

SANBI

Bayesian Modelling

Protea Atlas Logo What is Bayesian modelling and why is the Protea Atlas Project data being modelled? Nigel Forshaw

We are modelling the probability of a species occurring (presence/absence) in any area of the Cape Flora. We can summarize the model as follows:

P(Y_ij = 1) = p_ij = F(A_i, x_i, z_j, r _i, f _j)

The components of this equation are:

P(Y_ij = 1) is the species richness at any place i. Y_ij = 1 defines the chance of finding a species (j) at a place (i) as being either present (1) or absent (0). <The next step will be to model abundance based on Protea Atlas Codes, but initially the models will be tested to see if we can get the distributions correct – if this works well we can do the densities. But even with abundance, we will need a cutoff point to say if the species is "present" or "absent" in terms of species richness at place i.>

F is a function that we will apply to the other data at hand. We can use any function for the various variables as far as the model is concerned. In terms of getting meaningful results we want functions that suggest that a species will occur if the variable (e.g. rainfall, temperature) increases or decreases or is present under a a particular state (e.g. geology = sandstone).

A_i is the area of the unit. The model will be applicable over a variety of scales from a minimum of 1’X1’ (or about 1.5kmX1.7km) to the trivial case of the entire flora. We are constrained to a minimum of 1’X1’ by the availability of environmental data.

x_i is the vector of areal unit relationships. This will include temperature, rainfall, seasonality of rainfall, slope, altitude, evapo-transpiration, geology, and land transformation. At the finest scale we have this data at 1’X1’ cell size, but we will summarize these to greater scales later.

z_j is the vector of species level relationships. This will include seed dispersal, pollination, fire survival, plant size, seed production and other such variables. We will be using subspecies as our finest unit here, and we can summarize upwards as subgenus, genus or supergenus at a later stage.

r _i, f _j are random effects for areas and species, respectively. These are what are left when we have taken into account the areal and species effects – in other words noise. If these are low we have a good model, if high, then the model has little predictive ability. We will use two extra means to try and reduce this random noise.

For species, we will use the family trees to try and reduce the noise. More closely related species should be more similar in some aspects than less-related species (and less similar in others). This is where Gail comes in with her phylogeny.

For areas, we will geographically model the geography of the Cape Flora so that areas close to one another will be more likely to have a species present, and less likely to have them further away.

Big deal! How does this work and what is Bayesian modelling?

Bayesian probability theory is a branch of mathematics that models uncertainty by combining common-sense and observations.

We can model the real world by including every possible thing. But this is unrealistic. If we want to know where Pr cynaroides occurs, it is pointless to try and predict its presence from roads or towns or oil wells. We need to choose variables that are likely to influence its distribution, like rainfall, soil and slope. Choosing too few variables will result in poor predictions. Using too many variables will be too difficult to compute and will contribute little explanation as to where a species occurs.

We also need to know which variables are connected. Thus rainfall and geology determine soil fertility - they are both connected. All three affect the presence of Pr cyna (both directly and indirectly, e.g. rainfall directly and via fertility), but thunderstorm frequency and distance to the sea are unlikely to be important. Rainfall and geology are totally independent, and so we don’t have to model a connection between them. Modelling only the important connections increases understanding and speed of computation. There is no point in being able to accurately predict tomorrow’s weather from today’s data when your model takes two days to run.

The last thing we need for our model is the numerical probability of each connection. These "beliefs" can be predictive (e.g. half of all Conebush seeds will be male) or can be based on the evidence to hand (e.g. 95% of records of Pr cyna are from sandstone, 3% from shale).

The power of Bayesian modelling is that it calculates probabilities for each connection between the variables of interest. Of course, this requires extensive computing, but today’s computers are up to the task. (Bayes theorem was proposed in 1761 by Thomas Bayes and consigned to curiosity because only trivial cases could be calculated until the late 1990’s). But the power of the model becomes evident in the complex example we are using. By holding j (the species – say Pr cyna) constant, we can inspect the probabilities of its occurrence with environmental variables and see how the areal variables differ for this species. Thus – by way of a fictitious example - on sandstone Pr cyna occurs over most rainfall zones, but only above 800mm on shale. We can also tease out the importance of temperature, and use this to find out if global warming will affect the distribution of a species. Again, some fiction – it may be that for Pr cyna temperature will not affect it on sandstone, but on shale its range may be strongly influenced by temperature. We should be able to tease out these variables and thus model the effects of global warming on our proteas.

At present we have modelled the distribution of many species in Kogelberg and expanded this northward to Bainskloof. We still need to include geology, and we need to move to the rest of the biome. But first, we want to include Gail’s data on relatedness between species. We will also include morphological data – which we will also use for an electronic key to proteas.

Watch this space!

Tony Rebelo

Have a look at a talk given by Professor JA Silander on Spatial patterns of species richness and geographic ranges in African Proteaceae: Bayesian Hierarchical Models and his website. Also, have a look at Latest developments in the Protea Atlas Project.

Back PAN 54