## Decluttering ordination plots in vegan part 2: orditorp()

In the earlier post in this series I looked at the ordilabel() function to help tidy up ordination biplots in vegan. An alternative function vegan provides is orditorp(), the last four letters abbreviating the words text or points. That is a pretty good description of what orditorp() does; it draws sample or species labels using text where there is room and where there isn’t a plotting character is drawn instead. Essentially it boils down to being a one stop shop for calls to text() or points() as needed. Let’s see how it works… Continue reading

Posted in Plotting, R, vegan | Tagged , , , | 4 Comments

## Decluttering ordination plots in vegan part 1: ordilabel()

In an earlier post I showed how to customise ordination diagrams produced by our vegan package for R through use of colours and plotting symbols. In a series of short posts I want to cover some of the options available in vegan that can be used to help in producing better, clearer, less cluttered ordination diagrams.

First up we have ordilabel(). Continue reading

Posted in Plotting, R, vegan | | 1 Comment

## Shading regions under a curve

Over on the Clastic Detritus blog, Brian Romans posted a nice introduction to plotting in R. At the end of his post, Brian mentioned he would like to colour in areas under the data curve corresponding to particular ranges of grain sizes. The comment area on a blog isn’t really amenable to giving a full answer to the problem posed so I gave a few pointers. Other commenters also suggested solutions.

The problem is how to shade or colour in areas under a curve. The more general problem is how to do this when you don’t have any data that fall on the margins of the regions you wish to shade. Here is more solution to that more general problem. Continue reading

Posted in Plotting, R | | 5 Comments

## Monotonic deshrinking in weighted averaging models

Weighted averaging regression and calibration is the most widely used method for developing a palaeolimnological transfer function. Such models are used to reconstruct properties of the past lake environment such as pH, total phosphorus, and water temperature with, it has to be said, varying degrees of success and usefulness.

In simple weighted averaging (WA) there is little to specify other than the predictors (the species or other proxy data) and the response (the thing you wish to build a model for and predict). The one user-specified option in a simple WA is the type of deshrinking to use.

## A new version of analogue for a new year

Yesterday I rolled up a new version (0.10-0) of analogue, my R package for analysing palaeoecological data. It is now available from CRAN.

There were lots of incremental changes to Stratiplot() to improve the quality of the stratigraphic diagrams produced and fix several annoying bugs. Also the definition of the standard error of MAT reconstructions was fixed; it is essentially a weighted variance but the original version assumed the weights summed to 1, which is not the case for dissimilarities of the k-NN.

• caterpillarPlot() produces caterpillar plots showing species WA optima and tolerance ranges
• splitSample() is a convenience function for sampling a subset of training set samples whilst ensuring that the entire environmental gradient of interest in the training set is evenly sampled.
• The wa() function received a lot of love in this iteration. The main addition is to allow non-linear deshrinking of the raw WA estimates alongside the more common inverse and classical deshrinking techniques. The deshrinking is achieved using a cubic regression spline fitted using the gam() function in package mgcv. The spline is constrained to be monotonic to make sure that the deshrunk values for increasing raw values are likewise increasing. Small tolerance handling in wa() with tolerance downweighting gained the option to replace small tolerances with the mean of all taxon tolerances.
• logitreg(), which applies a logistic regression to the problem of identifying a critical threshold in compositional dissimilarity for MAT models, saw a major update. The returned object was substantially altered to allow for a wider amount of information to be supplied to the user. fitted() and predict() methods for class "logitreg" were also added. These compute the fitted probabilities for the training set samples and for new (e.g. fossil) samples respectively. The probabilities are in respect to the analogue-ness of samples to the groups in the training set (e.g. vegetation biomes in the case of pollen data).
These changes allow an analysis similar in spirit to that of Gavin et al (2003, Quaternary Research 60; 356–367) in their Figure 8. Here though logistic regression fits are used and not the ROC method they use.

I’ll be writing more on these ideas, especially the monotonic deshrinking and the logistic regression approach to dissimilarity threshold choice in future posts.

Posted in analogue, Palaeoecology, Palaeolimnology, R | Tagged , | 1 Comment

## Processing sample labels using regular expressions in R

I am often found in possession of palaeo core data where the sample identifiers contain a core code or label plus the sample depth. Often these are things generated by colleagues who have used other software where for one reason or another they don’t want to store the depth information as a separate numeric variable. I also generate such data sets, not because I want to but because the software often supplied with lab equipment (most recent example is the Thermo Flash EA/Delta V I’ve been running stable N and C isotope measurements on) that records data/measurements using a single character identifier variable.

The information in these labels is useful and I really don’t want to type out all the depths again and it’s not just because I am lazy; the more times you have to enter data the more opportunities for transcription errors to creep into your work and analysis. So I have things like this

R> (eg1 <- paste0("CORE", 0:10 + 0.5))
[1] "CORE0.5"  "CORE1.5"  "CORE2.5"  "CORE3.5"  "CORE4.5"  "CORE5.5"
[7] "CORE6.5"  "CORE7.5"  "CORE8.5"  "CORE9.5"  "CORE10.5"
R> (eg2 <- paste0("FOO_", 0:10 + 0.5))
[1] "FOO_0.5"  "FOO_1.5"  "FOO_2.5"  "FOO_3.5"  "FOO_4.5"  "FOO_5.5"
[7] "FOO_6.5"  "FOO_7.5"  "FOO_8.5"  "FOO_9.5"  "FOO_10.5"


What can be done to process these sorts of data with R to extract the useful information?

Posted in Palaeoecology, Palaeolimnology, R | | 1 Comment

## What’s wrong with LOESS for palaeo data?

Locally weighted scatterplot smoothing (LOWESS) or local regression (LOESS) is widely used to highlight “signal” in variables from stratigraphic sequences. It is a user-friendly way of fitting a local model that derives its form from the data themselves rather than having to be specified a priori by the user. There are generally two things that a user has to specify when using LOESS; $\lambda$ the span or bandwidth of the local window and $\alpha$ the degree of polynomial used in the local regression. Both control the smoothness of the fitted model, with smaller spans and higher degree polynomials giving less-smooth (more-rough) models. Usually it is just the span value that is changed, for expedience.

What can I have possibly against this? What is wrong with using LOESS for palaeo data?

Posted in Palaeoecology, Palaeolimnology, R, Science, Time series | | 4 Comments

## Sometimes I am lost for words…

Following on from yesterday’s momentous announcements from UK Government and RCUK on opening up access to research outputs funded by the British tax payer, John Wiley & Sons Inc. have announced the Geoscience Data Journal in partnership with the Royal Meteorological Society, the Met Office and NERC.

I first heard about this exciting new initiative via a submitted abstract from the EGU conference this summer. There need to be incentives for scientists to do more than just deposit data in repositories and data papers are a way to document their collection and processing and to acquire citations that indicate to the bean counters that scientists are making worthwhile contributions; having impact in the common parlance.

So I should be happy, right? Nope, not one bit!

Posted in OpenAccess, Science | | 4 Comments

## Quantitative palaeolimnology: my book chapters are finally out!

Today I received confirmation that the delayed fifth volume in the Developments in Palaeoenvironmental Research series has been published. The book is titled Data Handling and Numerical methods, though it covers more of the latter and, IMHO, is far more interesting than the dry title would suggest (who gets excited by Data Handling? Well, one or two people perhaps ;-)

A full table of contents can be found on the SpringerLink website, though in their infinite wisdom, this material is not available as HTML but as embedded previews of the pages, which you can download as PDFs (as you can each chapter but for the proper fee).

I authored or co-authored three of the 21 chapters

Pruned classification tree fitted to the three-fuel spheroidal carbonaceous particle (SCP) example data. The predicted fuel types for each terminal node are shown, as are the split variables and thresholds that define the prediction rules.

All three chapters relied heavily upon R; the first two being conducted entirely in R. Chapter 19 was written such a long time ago now that it wasn’t all done in R (WA-PLS, Maximum Likelihood transfer functions methods weren’t then available in R, nor were some of the ordination based methods I used). However, I’m confident the entire thing (at least the acidification parts) could be done using R now.

I have scripts for all the analyses performed using R. Some need a little work before I post them (mainly for Chapter 9) but I aim to maintain up-to-date scripts on my blog. Details soon.

Although writing these chapters took on a life of their own and used up far more time than they should have done, I am genuinely pleased with the results. I certainly learned a huge amount more about the statistics that underlay the techniques that palaeolimnologists and palaeoecologists use day in day out.

To pre-empt requests for PDFs of the chapters; I don’t have any so don’t ask. I’m not sure what arrangements were made originally with the publisher in that regard. I haven’t even seen the final book yet either; still waiting on my copy to pop through the letter box. If you want a copy and didn’t get in a pre-order at the discount rate, I suspect the best bet would be to pick on up later this summer at the International Paleolimnology Symposium in Glasgow, where I’m sure there will be a conference discount. If I hear of any offers in the meantime I’ll post something here.

## Customising vegan’s ordination plots

As a developer on the vegan package for R, one of the most FAQs is how to customise ordination diagrams, usually to colour the sample points according to an external grouping variable. Now, just because we get asked how to do this a lot is not really a reflection on the quality of the plot() methods available in vegan.

Ordination diagrams are difficult beasts to handle with plotting code without having an unwieldy number of arguments etc. There are potentially five sets of scores that need to be plotted so the number of arguments could quickly get out of hand if we allowed the user to pass all the relevant graphical parameters to the various sets of scores. We’ve provided a basic plot() method that is set up with useful defaults that works for all the ordination methods in vegan. This method is really there to allow the quick visualisation of the fitted ordination; I use vegan on almost a daily basis and none of my presentation or publication figures use the default plot method at all. However, we have provided all the tools needed to draw custom ordination plots within vegan. How you use them also provides a useful guide to building up base graphics plots from lower-level plotting functions. In this post I intend to show two examples of building up a simple PCA biplot from the basic building blocks available in vegan and R’s base graphics.

Posted in R, vegan | | 45 Comments