Monthly Archives: February 2015

Polar Plots per site

Today was chaotic and not in a work related way.

Our land lady had workmen come by to replace all of our old wooden window frames with new fancy aluminium ones. As of this writing they are almost finished after having been working at it for 6 hours. A solid day of work on their part.

Unfortunately all of the banging and foot traffic made it rather difficult to focus on script writing. And I had to stay at the house to make sure nothing grew legs and walked away. I did however manage to write the script I had imagined yesterday. I am now able to specify any site along the coast with a one word command line change and R will produce what I want. All formatted and pretty. So now I just need to produce them and pop them in the prezi before moving on to the second half of the project. Lekker.

Vectors and DPI

Today I got down to the nitty gritty of what prezi is capable of… and it’s not much. Oh there is certainly a sense of elegance in the presentation format,but there are nowhere near as many options that exist in powerpoint. And yet I feel compelled to use it. If for no other reason than that it is not powerpoint. The zooming features do however allow for a layer of storytelling to be introduced that powerpoint simply cannot match. One must just learn how to move within the constraints of the software in order to make effective use of this advantage.

That being said. I am pretty chuffed with how things are going. I have a concrete plan written out (on paper) of what the first half of the presentation is going to look like. The first quarter is in the system and I need to get going on the busy work for the second quarter. But it is going to be great. Zooming along the coastline showing polar plots of the four different types of data currently processed (in situ, MUR, terra and pathfinder) next to each other. It’s going to be great. Rather than edit existing figures, I decided it would be faster to write a script that makes these figures per site. The main problem I encounter is  that the more pdf files you add to a project the slower the web based interface responds. It can get mildly frustrating…

Anyway, below is the bit of code I wrote that allows one to alter the position of text that will appear on a graph in ggplot2.It’s pretty clumsy, but the Internet was down when I was writing it so I wasn’t able to check for function that would replace the logic loops I employed instead. Moral of the story is that this works and is fully automated. Also, I haven’t posted any code for a while…


# Logic loop to offset multisite statistics text, fully automated
data <- annual # Rename for ease of use with pre-existing logic structure
stats<- data.frame()
for(i in 1:length(levels(data$site))){
data2 <- droplevels(subset(data, site == levels(data$site)[i]))
if(length(levels(as.factor(data2$src))) == 1){
stats <- rbind(stats, data2)
} else {
for(j in 1:length(levels(as.factor(data2$src)))){
data3 <- subset(data2, src == levels(as.factor(data2$src))[j])
data3$y <- data3$y+6-3*j
stats <- rbind(stats, data3)
}; rm(data); rm(data2); rm(data3); rm(i); rm(j)
annual <- stats; rm(stats) # Rename for plotting ease


Back to the presentation

Today we had a meet and great at UWC for the new honours students. So I got to meet the new members of team kelp and hear them talk about what they will be getting up to this year.  I then had a chance to talk with my supervisor and he unloaded several new ideas on me for Chapter 2 that are really going to make this chapter meaningful. The end product will effectively give  percentage value to how dependable a time series is in it’s ability to detect long term change. A large part of this has to do with how much variability is present in the data set, with the coarseness of the thermal resolution being relevant as well. By running a time series analysis to see when and to what degree patterns of variation exist it is then possible to make a time series of anomalies that correctly represents the time series minus the predictable variation. It is this set of anomalies on which a power analysis of trend must be run to determine the power of the data set to not commit a type 2 error, or saying that nothing is happening when indeed something may be, it’s just that it is hidden in all of the noise. The mean and time series length are relevant to this analysis as well.

But before doing all of that the range and mean and SD and all all sorts of normal stats must be determined in order to run an ordination on the sites to more accurately group them by coast. The current coastal groupings used are based on general practice, rather than statistical significance.  Anyways, the path is clear and Chapter 2 is looking to be a very exciting process. Until…

I am presenting my PhD Summary to the science faculty on March 9th. It must be a 30 – 40 minute presentation and I want it to be awesome. Luckily for me I already have heaps of great figures available to me and must do little leg work to correct things. And any little touch ups or minor alteration can quickly be accomplished via an image editing software, rather than having to change any code. So that is exciting. Prezi has a “3D” capability in which one can zoom in through photos, which would allow me to do some nice stuff with layers of coastline images that show different types of data depending on the layer. Bathy, bias, temperature. All that good stuff. Anyway, tomorrow is wide open and I hope to have at least a big useful mess by the end of the day if not something at least a bit pulled together.

Chapter 2 – Methods

Today I got started on the rough draft of the methodology section of Chapter 2. In this I mapped out the who-what-where-when-why-how questions and came up with some potential sub-headers to focus everything within the chapter. I also thought of a couple of good figures and one table to include here that are already existing in our files. But that need to be updated to v3.4 before being used in a proper publication. These are “figure one” from Smit et al. 2013 in PLoS One and another figure or two showing all of the time series’ available to us. Not sure yet on whether or not I should address the issue of multisites in this chapter, as v3.4 will largely address any issues that currently exist (e.g. averaging multiple time series at the same site even though the RMSE may be quite high). It will do this by effectively making each time series it’s own stand alone point. But that doesn’t mean that it still isn’t worth commenting on how different two points can be when so close to one another…

After the writing I started looking into how I could write a power analysis script in R to begin analyzing our datasets when I got a bit stumped as the main thing we are looking for is how the accuracy of the instruments used affects the power of the results, and this is not taken into account for a traditional power analysis . So I asked AJ and it turns out I should have been looking into how to do a power analysis for detecting trends. Subtle but important difference. And in order to better help me understand this difference he sent me a paper by Gerrodette (1987) that covers the creation and practical application of this statistical analysis. This test is designed for a single regression and takes into account not only the length and overall change of a time series but also the coefficient of variation (CV), which is the standard deviation of a time series divided by its mean. This allows one to see how the accuracy of the measurements affects the power of the results. Though one largely confounding issue is that natural systems do vary by more than just one variable. Making the direct interpretation of the results a bit cloudy. For example, ocean temperature time series have have strong interannual variation, which is to be expected. This variation however could be misinterpreted by this test as being sloppy sampling technique,rather than actual change that is happening independent of sampling effort.  We are looking for a way to measure the effect of the coarseness of the thermal resolution of the different instruments that are used to measure ocean temperature in our overall dataset. To do this it may be necessary to calculate the test based on daily anomaly values rather than the raw data. This would take out a lot of the normal variation and leave only the long term change we are looking for. Making the coarseness of the sampling a more prominent factor in the results. One final thought on this issue is that this test assumes independence of samples, meaning that the high level of autocorrelation in the time seres’ may be an issue. A plan must perhaps be made to address this issue as well.

Chapter 2 – Introduction

Today I made an official start on chapter 2 of my dissertation. This chapter is going to investigate the relevance of the differences between the many in situ time series that we are working with. These variables include: length, variance, mean temperature, change in mean over time and the degrees of significance to which the temperature is actually measured. These differences, and how they affect the time series’ ability to actually detect a significant change in mean ocean temperature need to be better understood, measured and standardized if the SACTN is to be able to reliably use these in situ data.

This means that I must write an R script that can do power analyses smoothly on all of the time series and create the stats that we would like to know. This will also be done on virtual data, that will be created from a modelling script I wrote last year. These virtual data can be controlled so that certain aspects, like length, variance etc. can be individually manipulated in order to better understand what part of the time series is the most important/ has the largest impact on it’s ability to accurately measure climate change.

I’ve written a rough draft of the introduction for this chapter. And now, while I write up the methodology section I will write the aforementioned script. Besides power analyses, this script, or perhaps another script (urgg, adding everything to the flow chart really makes me think twice before creating new files, which I suppose is a good thing) must then also devise a method of standardization/ appraisal that acts as a metric against which all current and future time series can be compared in order to monitor and evaluate their usefulness. This piece of code could then potentially be added to the “assertr” pipeline and be another quality control step used to filter out data that are not up to scratch.

Demersal Regime Shifts

Today I decided to take a break from flow charting. With the graph scripts documented I wanted to catch my digital breath before splitting up and documenting the functions.
So I finally got back to reading research articles and today I read a paper by Kirkman et al. (2015) that looked at possible regime shifts in demersal fauna at the community and population levels in the Benguela Current Large Marine Ecosystem (BCLME). So Angola, Namibia and the west coast of South Africa. Some shifts were detected around the mid 90’s, similar timing to the shift in small pelagics to the Agulhas Bank, but these findings were confounded by the fact that the sampling gear used in the study was changed in the mid 90’s as well. So it was hard to say what the real cause of the change may have been. One possible environmental driver for the regime shift found in several of the species in the study was the large Benguela Nino event in 1995. The authors admitted that sustained fishing pressure may have had some impact on the species in question, but that this likely was not a cause for any of the significant shifts detected.
The authors used three different methods to detect regime shifts: chronological clustering, STARS analysis and change point analysis. None of these methods found regime shifts in the same species in the same years however; most shifts were detected in the 1994 to 1997 time range. With some shifts occurring in the mid 2000’s as well. All in all it is good that someone took the time to do an analysis on this question, even if the findings were not terribly conclusive.

Flow Charts Day 6

Victory! For now…

The “graph/*” scripts have all been documented and added to the flow chart. Almost everything that can be upgraded to v3.3 has and most scripts that are still v3.2 make more sense if left that way, though v3.3 copies should be made eventually. The final product (excluding the functions, which still need to be documented…) can be found here:

It may look like a mess, but this is a good thing. The functions aren’t terribly important in understanding the flow of the system, so I can now make a more informed decision about how to streamline everything. The next version will look much better.

Flow Charts Day 5

Sometimes things take longer than you would have hoped…

I am now 3/5 of the way through the graph scripts. This is by far the most voluminous folder weighing in at a total of 30 scripts. Ugghh. And they are mostly all useful, too. Very little redundancy. Many of them are already v3.3 or are easily upgradeable (particularly the satellite scripts), whereas a few make more sense being left as v3.2. These represent an interesting problem and are perhaps an argument for why v3.2 should be kept  alive as the pipeline moves forward. Or perhaps a special function which would reduce v3.3 files to v3.2 standards. Meaning, removing multisites and deep sites etc.

I think my eyes will start bleeding if I look at a computer screen for much longer today. So I am stopping now.

Flow Charts Day 4

Moving forward from some work I did over the weekend I have visualized all of the scripts and files in the “tempssa_v3.0” project (up until “graph”) on prezi.  It looks a frightful mess at the moment but I suppose that is sort of the point. Using brackets boxes and circles to differentiate between the different types of data, I then use the limited colour palette options in prezis to link the progenitor scripts with their data offspring in other areas of the flow chart. I chose to arrange the boxes in a somewhat circulat pattern for now as it allows for a more clear understanding of which scripts are out of place and which ones flow nicely. If the directions of the arrows are going against the grain or at odd angles it is immediately clear that something is out of place. This visual flow chart is also a great way to double check the .txt flow chart I have been working on for the past few days.

While going through this undertaking I realized that I couldn’t really call this “flowChart_v3.3” as so much of the data is still v3.2 and to upgrade everything to v3.3 before completing the flow chart would hinder my progress on the project to much. So I will finish collecting the information for “graph” tomorrow and complete “flowChart_v3.2.5.txt” as well as the prezi representation.

Once this is complete it will be much easier to conceptualize what needs to be changed and I can potentially jump from v3.2.5 to v3.4 in order to save myself some time as this will prevent a lot of redundant data management. I have started getting some interesting ideas when it comes to streamlining the data and I am starting to really be able to see what a pipeline may look like. The prezi ” flowChart_v3.4″ is going to be a straight line with nothing out of order. The extra bits: “setupParams”, “stats”, “coords_to_extract” and “func” will be running alongside the pipeline, feeding in as necessary. Running underneath this row of information will be the alternate, optional, satellite pipeline. The whole thing is starting to look more and more beautiful the more clear my idea on how to move forward gets… and the crazier I get from doing it.

Part of making v3.4 a smashing success, is the complete documentation of all of the functions I have made over the last year. I need to document them the same way they are documented in R packages. This makes them more useful, and it will make it easier to wrap all of this up into a package when the time comes.

Anyhoo, a link to the prezi is attached to this blog post for the curious.

Flow Charts Day 3

The quest continues…

You know how sometimes all you need to do is finish one quick thing and then maybe something small gets rolled in with it, and then by the time you’ve dealt with the other 20 things precipitated by that one unforeseen issue you haven’t made any progress on anything all day… ya…. So that was not super great. Needless to say I didn’t finish automating all of the “graph/*” scripts, today. But I did get through the majority of the “coords_to_extract/*” scripts. So I’ve got that going for me, which is nice.

Also, I added the plugin that allows code pasted to this blog to look pretty. And that is never a bad thing!

HelloWorld <- data.frame("much code", 1:3)

Funnily enough, the R code language isn’t directly supported. Rather, “splus” is what the plugin recognizes. So next week the climb up automation mountain begins anew. Though in order to better prepare myself I will be making a visualization of all of the information I have so far assembled. Manually for now as I don’t have a strong idea on how I would do that via R… Though I’m sure some clever person has figured something out.