New Function Smoothness for the Pipeline

Today I finally managed to sort something out I have been wanting to solve for months. It is a rather small issue but it was a piece of most of the functions I’ve written that apply statistical analyses in between specific sources at sites with more than one source (e.g. DAFF and DEA each having a UTR at the same location). To do a t-test between these data is a simple matter, and writing a function that does this for each site is a no brainer. But how does one write a logic loop that can accommodate an unlimited number of individual sources per site? For example, if you have only two sources at one site, you only need to run one analysis. But if you have three sources, you need to run three analyses. Four sources, six analyses. Each source has to be compared to each other source and as these sources are independent of each other you must run t-tests (assuming heterogeneity of variance… not always a safe assumption) rather than an ANOVA, which would be simple.

So last year I wrote these big awful functions by hand which made the three comparisons necessary if there were three sources. It worked with anything… but only as long as the number of sources was two or three. This works with our current automation process, but is not “future ready”. So I gave it a good think today and after exhausting all of my more traditional methods of using logic loops etc. I started to think outside of the box. Which made me think of boxes. Matrices are boxes.

This clever little function “combn()”, which is a standard R function, takes an integer vector and creates a matrix of all possible matches depending on how many comparisons are made at once. Two in my case. Once a matrix of all possible integers was created, the next step was to assign each integer value (i.e. 1, 2 or 3) to a source name (i.e. “DAFF”, “SAEON”, or “SAWS”). It is then possible to use a for loop to pair all of the sources by simply going from one column of two integers to the next. The logic does all the work for you!


data <- data[complete.cases(data),]</pre>
stats <- data.frame()
for (i in 1:length(levels(data$site))){
data2 <- droplevels(subset(data, site == levels(data$site)[i]))
w <- as.character(levels(as.factor(data2$src))) # Create index of data source names
m <- combn(length(levels(as.factor(w))), 2)# Create matrix by which to compare all sources
for(j in 1:ncol(m)){
x <- subset(data2, src == w[m[1,j]]) # Keep only one source of data
y <- subset(data2, src == w[m[2,j]])
x <- subset(x, (date %in% y$date)) # Match overlapping times
y <- subset(y, (date %in% x$date))



From there you can start adding whatever paired analyses  you would like. Easy as that, something that is fully “automatic”. By just adding that one line with the “combn()” function I will be able to delete 30 lines of code from each of several useful functions. Schweet.

After getting that sorted I read a brief but important paper from Dufois et al. (2012) in which the authors compared Pathfinder (NOAA AVHRR) v5.0 and v5.2 data against MODIS Terra (NASA) data. MODIS is well known for being more accurate along the coast than the other satellite products and this paper is one of the reasons why. When compared against MODIS (and one in situ buoy at Slangkop) the v5.0 data was found to have a 3-5 degree C warm bias along all of the major Eastern Boundary Upwelling Systems (EBUS). When comparing the v5.2 data this warm bias was reduced to 0-2.5 degrees C, with the Benguela having the warmest bias for both data sets. The authors recommended caution when using satellite products for coastal temperature related projects, and from what I’ve seen I agree. In situ all the way!

One thought on “New Function Smoothness for the Pipeline

  1. ANOVAs and t-tests. One of the assumptions of ANOVAs is (like with t-tests) that the data are independent of each other. This is always going to be the case for our data – they are independent. As soon as comparisons are going to involve more than two sources of data it calls for an ANOVA – doing multiple t-tests is not encouraged as this increases the change of committing a type-1 error (coming up with a false positive). If you really want to apply multiple t-tests, then make sure to do a Bonferroni correction to adjust the probabilities.

    Something else… may I suggest you install a R syntax highlighter here in your WordPress Blog? It’ll make your R code more readable. See here:

    Once things start to work nicely you can think about linking selected posts with R-Bloggers.


Leave a Reply