Monthly Archives: March 2015

Code Renovation

The game plan for today was to get knitr working on my new Ubuntu OS and to write an ordination function for the coastlines.

I assumed from the get go that getting knitr to install and run on Linux was going to be a monstrous process so I went straight to interwebs for help files and walkthroughs. Turns out I was correct. As of this writing I am still plugging away at it. I’ve tried a couple of different approaches but there is always one little hump that stops me in my tracks.

Same issue with the ordination function. Rather than go straight in and write the thing, I wanted to get all of the new v3.4 scripts up to speed so I could use it with the ordination. I have done so. All “load” and”prep” scripts have now been updated. So that is great however; the next three days are going to be very diving intense and it is unclear if I will have the brain power to get back to this code until Friday. I will at least have “knitr” working by then!

 

EDIT: Got “knitr” working! And learned a couple new things in the process. “knitr” is pretty ūüôā

UTRs

Today was mostly swamped with admin for our first deployment happening tomorrow. Exciting stuff. I have the UTRs (Underwater Temperature Recorder) here at my house and am currently setting them up and calibrating them for their deployment tomorrow. We are aiming at getting UTRs deployed at two new sites tomorrow, weather permitting. The wind can really pick up in the middle of the day. While running the setup and calibration of these UTRs I will be practising the presentation I made in prezi a while back that I must present to the science faculty at UWC this Thursday.

Besides this admin I read a paper by Dunham et al. (2005) that covers methods for water temperature data management using waterproof data loggers. They are based in rivers in the Pacific North West of the USA but their methods apply pretty well to our situation here in SA. They mainly cover four things: 1) Study Planning 2) Field Procedures 3) Data Processing 4) Data Archiving. The most interesting for me was the data processing. All of the metrics they track are dependent on maximum warm temperatures detected, and across what time span these temperatures were detected. As well as how hot these temperatures were. I am going to write a script that does the same thing, except also for the cold temperatures. That should be interesting.

Degrees of significance

An interesting problem with R, and pretty much any statistical software, is that when you create a mean from multiple data points, the product will normally have degrees of significance added to the end to more accurately display the full length of the new number. This is bad and I’ll tell you why. A degree of significance is how many meaningful digits long a number is. This is generally taken to be analogous with the accuracy data. If many digits are added to the end, someone could incorrectly infer that these new data are much more accurate, when this is not the case. So for example, if one wants to take population estimates of geese in a park, a rough estimate may say 100 geese, whereas a detailed tagging method would show there are actually 82. The first number, 100, is only one degree of significance as the number has been rounded. Even though it is three digits long, only one of the digits is significant. The second number, 82, has two degrees of significance because each digit is relevant. But what do geese have to do with ocean temperature and how is any of this relevant?!

Well, to answer your question reader, not all of our temperature data have the same degree of significance. For more precise data, as we like to thinks ours is, the accuracy of individual data points is not judged by the number of relevant digits before the decimal place, but after. So 10.1 is less accurate than 10.101. In fact it has two fewer degrees of significance.¬† This is relevant because many climate change experts believe that the degrees of significance to which a data point is measured has an impact on the overall accuracy and usability of the time series of which it is a part. The rub here is that going back into the literature so far hasn’t revealed any proof to this hypothesis. It seems as though at some point someone started saying it and people just jumped on the bandwagon. But in order to deal with this issue I have created a function that finds the degrees of significance in a set of numbers that are destined to become one mean, and rounds the result to have the same number of degrees of significance as the predominate degrees of significance in the data. Sounds a bit convoluted? Unfortunately nothing like this exists in R, but I have a very strong feeling I must be wrong about that as this should be a pretty standard issue. Anyway, here is the code that I wrote. It could still use some work, but will do the job for now.


meands <- function(data){
data1 <- data[complete.cases(data)]
data2 <- data1[grep(pattern = "[.]", x = data1)] # Remove numbers with no decimal place
data2 <- as.numeric(sapply(strsplit(as.character(fileBase1), split = "[.]"), "[", 2)) # Select only numbers after decimal place
w <- round((length(data1)-length(data2))/length(data1),2) # % of sites with no ds after decimal place
x <- round(sum(data2 < 10)/length(data1),2) # % of sites with one ds after decimal
y <- round(sum(data2 < 100 & data2 > 10)/length(data1),2) # % of sites with two ds after decimal
z <- round(sum(data2 < 1000 & data2 > 100)/length(data1),2) # % of sites with three ds after decimal
total <- w+x+y+z
if(w > x & w > y & w > z){
data <- round(mean(data, na.rm = T),0)
} else if(x > w & x > y & x > z){
data <- round(mean(data, na.rm = T),1)
} else if(y > w & y > x & y > z){
data <- round(mean(data, na.rm = T),2)
} else if(z > w & z > x & z > y){
data <- round(mean(data, na.rm = T),3)
}
}

The preliminary results of the analyses on our different time series’ show that the most important variable in a time series that allows it to show climate change with any accuracy is not the degrees of significance, but rather the sheer length of the time series. This makes sense if you think about it. A longer time series naturally has less variability and long term trends are easier to pull out. It doesn’t matter if the data are accurately measured or not. We are talking about 40 years of data. What does an extra 0.001 matter in the face of 2 million data points of consistent fluctuations and increases (or decreases in some areas)? It’s all very interesting if you consider that this is a new (though really, it shouldn’t be) concept. But the big DISCLAIMER is of course that these findings are not yet published and it is possible that a more refined approach will be less conclusive. So that is what I am up to currently. Reading literature on in situ temperature quality control and looking for the origins of this idea that degrees of significance are desperately important. Stay tuned, things are heating up!

…. pun ….

SAEON

Today in class we covered mangroves. Very interesting plants and not something I have a lot of experience with. After that we moved on to the last topic in the course, ecophysiology. The effect light has on plants and how light is affected by water and vice versa. Tomorrow we will cover exactly what nutrients are needed and why. As well as how seaweed go about dealing with being in salt water most of the time.

Besides that I did a bit of coding. Smoothed some stuff out. Even though my current statistical analysis I am working on produced unexpected results, it appears that I did everything correctly. So the next step is to consult a proper statistician. Going to see who I can track down tomorrow. Also had a chance to talk with AJ about all of the issues I had tabled for further explanation. Chapter 2 is back on track and I’ll be doing a lot of coding tomorrow. But the most exciting development of the day was the meeting we had with the SAEON team. We have finally made contact and I will be sending the first test batch of data their way tomorrow! The lead on the project is going to be busy throughout April but a prototype should be up by mid-March. Sweet.

False Bay Deployment

Today in class we talked about coralline red algae. These come as articulated or crustose. Articulated looking like a filamentous algae but with a hard calcium carbonate shell, while crustose appear as a lichen growing on rocks etc. This was the last group of red algae to cover before we moved on to marine flowering plants and how they fit into the picture.

Afterwards I started organizing myself to get ready for the deployment of 18 new UTRs in and around False Bay. It is a very exciting project and will expand our thermal coverage of the bay immensely. Very exciting stuff. I have been networking with people to see what advice they have and what hardware I will need to move forward with this project. It looks like a lot of this can be done with little overhead, which is a good thing. Watch this space to see updates of the deployment. I also made the following code to create a map of what we intend to do. The positions of the deployments are still rough estimates and will be replaced with the exact coordinates once they are in the water.


### AUTHOR: Robert Schlegel ###
######################################################################
## This script does:
# 1. Loads False Bay new deployment site list;
# 2. Loads and preps GSHHS coastline;
# 3. Plots sites in and around False Bay;
# 4. Saves as "~/Documents/False Bay Deployment/SiteMap.pdf"
######################################################################

######################################################################
## DEPENDS ON:
require(reshape2); require(maptools); require(mapdata); require(maps); require(sp); require(geosphere); require(maps)
require(PBSmapping); require(mapproj); require(ggplot2); require(gridExtra); require(rgdal)
#"~/R/gshhg-binquire-2.3.0/gshhs_f.b"
######################################################################

######################################################################
## USED BY:
# Nothing
######################################################################

######################################################################
## CREATES:
#"~/Documents/False Bay Deployment/SiteMap.pdf"
######################################################################

######################################################################
# Load site list
sites <- read.csv("~/Documents/False Bay Deployment/SiteList.csv", fileEncoding = "UTF-7")
# Ubuntu produces UTF-7 .csv files rather than UTF-8...

######################################################################
# Load coast outline
lat <- c(-34.4, -34); lon <- c(18.3, 18.9) # Changing these values will change how much of the map is shown
sa_shore <- importGSHHS("~/R/gshhg-bin-2.3.0/gshhs_f.b" ,xlim = lon, ylim = lat, maxLevel = 1, n = 0)
sa_shore <- droplevels(subset(sa_shore, (PID == 1))) # Cuts off islands

######################################################################
# Load Bathymetry
bathy <- as.matrix(read.table("~/tempssa_v3.0/data/bathy/gebco_sa.asc", sep = "", skip = 6)) # Load in bathymetry plot
info <- read.table("~/tempssa_v3.0/data/bathy/gebco_sa.asc", sep = "", nrows = 6) # Load in plot size parameters

xrng <- seq(from = info[3,2], by = info[5,2], length.out = info[1,2]) # Create x range integers
yrng <- seq(from = info[4,2], by = info[5,2], length.out = info[2,2]) # Create y range integers
xy <- expand.grid(xrng, yrng)
bathy[bathy == info[6,2]] <- NA # Renames error values with NA
bathy[bathy > 0] <- NA # Removes any values above the surface of the water
bathy <-  t(bathy[nrow(bathy):1,]) # Adds a column giving row names
bathy <- as.vector(melt(bathy, value.name = "z", na.rm = FALSE)$z)
bathy <- as.data.frame(cbind(xy, bathy))
colnames(bathy) <- c("lon", "lat", "depth"); rm(xy)

######################################################################
## The scale bar and north arrow
# Create parameters for a scale bar:
scales <-  c(0, 2500, 5000) # Sets initial distances for scale in metres
scaleBar <- data.frame("DISTANCE" = scales/1000, # Changes the distance to look like km
"b.lon" = 18.6, # Set beginning longitude point for scale bar
"b.lat" = -34.3, # Set beginning latitude point for scale bar
destPoint(p = c(18.6, -34.3), b = 90, d = scales)) # Set start point, direction and length of scale bar
scaleLength <- scaleBar[3,] # The ending point

# Create parameters for a North arrow:
nArrow <- data.frame("b.lon" = scaleBar[3,4],
"b.lat" = -34.28,
destPoint(p = c(scaleBar[3,4], -34.28), b = 0, d = 2000))

######################################################################
# Load coast outline
b <- ggplot(sa_shore, aes(x = X, y = Y)) + theme_bw() + coord_equal() + geom_path(aes(width = 1)) + # Plot coast outline
geom_point(data = sites, aes(x = lon, y = lat, colour = factor(day),
shape = factor(boat), size = factor(installed))) +
scale_size_discrete(range = c(3,6)) +
stat_contour(data = bathy, aes(x = lon, y = lat, z = depth, alpha = ..level..), col = "#254C6D", size = 0.1, binwidth = 10) +
geom_polygon(aes(x = X, y = Y, group = PID),
fill = "#929292", colour = "black", size = 0.2, show_guide = FALSE) +
scale_x_continuous(name = "", breaks = seq(18, 18.9, 0.1), expand = c(0, 0), limits = lon) +
scale_y_continuous(name = "", breaks = seq(-34.4, -34, 0.1), expand = c(0, 0), limits = lat) +
scale_alpha_continuous(breaks = c(-10, -20, -30, -40, -50, -60, -70, -80, -90, -100),
guide_legend(title = "Depth (m)")) +
guides(colour = guide_legend(title = "day of deployment"),
shape = guide_legend(title = "boat necessary"),
size = guide_legend(title = "in the water"),
font = "bold") +
### Scale bar and N arrow
# Insert scale bar bottom:
geom_segment(data = scaleLength, aes(x = b.lon, y = b.lat, xend = lon, yend = b.lat), color = "BLACK", size = 0.3) +
# Insert scale bar tips:
geom_segment(data = scaleBar, aes(x = lon, y = b.lat, xend = lon, yend = b.lat + 0.005), color = "BLACK", size = 0.3) +
# Label distances:
annotate("text", label = c("0", "25", "50"), x = scaleBar[,4], y = scaleBar[,3] + 0.01, color = "BLACK", size = 4) +
annotate("text", label = "km", x = scaleBar[3,4] + .01, y = scaleBar[1,3],  color = "BLACK", size = 4) + # Label scale type
# Insert N arrow:
geom_segment(data = nArrow, aes(x = b.lon, y = b.lat, xend = lon, yend = lat), arrow = arrow(length = unit(0.2, "cm"))) +
annotate("text", label = "N", x = nArrow$b.lon + .005, y = nArrow$b.lat + .009, size = 4) + # Label N arrow
###
theme(panel.border = element_blank(), # Creates border
plot.background = element_blank()) # Makes background transparent

ggsave("~/Documents/False Bay Deployment/SiteMap.pdf", width = 16, height = 12, pointsize = 1)
b

# Clean Up...
rm(nArrow); rm(sa_shore); rm(scaleBar); rm(scaleLength); rm(sites); rm(b)
rm(lon); rm(lat);rm(scales); rm(bathy); rm(info); rm(xrng); rm(yrng)

SiteMap

Coastal Scientists

In today’s lecture we began covering the reds, Rhodophyta. These algae have three¬† morphological phases in their life cycle. Very interesting. Though it is more like two and a half as one of the morphological phases is when a spermatia fuses with the haploid female plant to create a carposporophyte. This is diploid and broadcasts out diploid carpospores that become a diploid iteration of the algae.

Did a bit of editing on Chapter 2 before meeting with all of the coastal ecosystem scientists from UWC and UCT (so pretty much the whole Western Cape). We had a round discussion and everyone let everyone else know what they are up to and how what we are all doing fits in with each other. I was reminded that my temperature data are rather important and it was a good motivator to make sure that the QC on the SACTN is high. After that I started networking on getting a bunch of new UTRs in the water. That needs to happen soon.

All about .dat data

Today in seaweed class we learned about the order Fucales and the Fucoid algae that call that branch of life home. Though they only have one life cycle phase, they tend to be rather complex, more so than the kelps.  After covering the basics on this group we discussed Diatoms and Dinoflagellates. The little guys on the scene.

Whilst on campus I remembered that the DAFF data we have may or may not have extra bits at the beginning or ends that are not water temperature data, but rather “sitting in the back of a bakie before being thrown in the water” data. So I borrowed the paper log books for the deployments and manually went through all of our .dat files. Some of the newer stuff hadn’t had these bits cropped out so it was good that I went through all of it. I also discovered that there is a lot of old excel data floating around in there as well. These go back to the 70’s in some cases and are daily means in a weird excel layout that may need to be extracted manually… Though this may take hours if not days. Probably worth the time to write a script that does this somehow. Though this also means that the time series’ will now have multiple different degrees of significance within them. Tricky tricky.

Otherwise I figured out the issue causing my new tsa() function not to work. For some reason some of the time series didn’t like it when I used a numeric, rather than an integer for selecting a related piece of data from another data.frame. And this is within a logic loop, so there shouldn’t have been any change in the types of data. I don’t see how it is possible. But hey, it’s fixed now so I’ll just be happy about that and let the mystery stand. Using the functions nestled within this one I can now do a lot of exciting time series stuffs.

Missing time

In today’s seaweed lecture we covered brown algae. The kelps. The good stuff.

As for time series analysis. A rough draft of an automated function has been written to analyze the individual time series’ however; missing data within a time series prevents it from being interpreted correctly as a “time series”object. Meaning it cannot be analyzed by tsl() or HoltWinters(). Sad face. Linear models can still be made from these data, which is why I haven’t encountered this issue until now. The easiest fix for this is to first convert the data into a zoo object using the “zoo” package. This can then fill in the missing values with na.approx(). Once filled in, the zoo object can be converted to a time series object and analyzed with stl(). It is unlikely that this is the ideal fix but it provides a bridge for this issue and can be replaced in the future.

Delving further into the question of missing data I found that a number of time series’ have big chinks (months) of “missing” data at the beginning of the time series, before the recording actually begins. Why these are there is a question that will need to be answered. And the load functions may need to be altered so that they cut these extraneous bits off the beginning (and potentially end) of the raw data as they are loaded however; na.approx() does this automatically, in what must be the happiest coincidence of the week. After getting this to work I needed to parse out the information from the analysis that was relevant to this project. Conveniently, the lists can be uncoupled and resorted as necessary. One must just figure out what the correct calls are to get these data.

Now that the seasonal trend, variation and noise detection have been automated for every time series, these data need to be used to run a more precise (i.e. with the noise and seasonal signal removed) power analysis of trend. This will then be able to detect if any significant increase or decrease has occurred in temperature. The coefficient of variance that these de-noised data produce is almost a quarter of the size of the original data set, allowing for a much higher power at smaller levels of change. This change can then be computed with lm() and compared to the results of powertrend(). Though I can’t help but feel like I am missing something. This isn’t really supposed to be the main focus of Chapter 2. Or at least it seems like there is more to it. I’ll need to consult the oracle on this one…

Functions, so many functions!

It is amazing the degree to which the R community has gone to support this free software. I did a lot of exploratory data analysis last year and for that I created a lot of my own functions. This was also one of the main ways I taught myself how to program. Now going back to all of these old scripts I am revising them with some new knowledge. Reading other tutorials on how to do what I figured out on my own I see that functions already exist for basically everything I have already done, and these functions are usually more flexible   than mine.

For example, “tseries”comes with “R” automatically and is a package designed to do all sorts of things with time series analysis. The first step can be a simple linear smoothing of the data with filter(). After this one must usually turn a number vector into a time series object with ts(), and then things can really open up. plot(stl()) automatically creates a 4 panel figure showing the data, the seasonal variations, the overall trend and the noise. The more common method of finding a fitted model for a time series is fitting it to a linear model with lm(), which doesn’t require the vector to be a time series object. This is the method that¬† I have used in the past as it is good for plotting in ggplot(). This method also produces nice statistics for the significance of the coefficients etc. Though, as per my previous blog, one mustn’t rely too much on that. Another method of time series analysis is HoltWinters(), which is used for exponential smoothing.¬† This makes a much smoother version of the data though without any loss like with filter() at high levels. A more sophisticated method of doing this is by creating ARIMA models (autoregressive integrated moving averages). These show you the residuals within your time series as well as a the significance of the Ljung-Box statistic depending on the lag used for comparing the time series. Also, I don’t know what any of that means. It was at this point in the chapter in which the author stopped explaining what was going on. I’m not sure who his target audience was intended to be, but apparently they are supposed to already know these esoteric analyses. What and how an ARIMA models is and does was not made clear, neither were any of the acronyms or test results associated with it. Though with three other good methods of analyzing time series’ I see little reason to invest more time in learning about this one.

I spent the day learning about these statistical analyses (as well as attending a seaweed lecture talking about green and then a bit about brown algae) and writing an R script as I walked through the tutorials on how to do these analyses. For every tut I also wrote code to do the same analysis on the Oudekraal data, which is my new favourite. So the code to do this on the SACTN time series’ already exists in “R” and tomorrow I will revive the time series analysis script I have been working on in order create the code necessary to get Chapter 2 on it’s feet. That being said I still need to get more comfortable with power analysis of trend package, too.

And about that damn fan. Well it looks like the fan itself is just noisy. I have found some tutorials that show one how to lubricate fans, helping to minimize the sound that they make while spinning. Though the fan in my laptop isn’t really designed for this so I opted out of this fix. Plus I don’t have any computer lube on hand at the moment. So I did a bit more digging and it turns out that yes, the BIOS is to blame. But that is because it is working correctly (?). Strange but true. The people at Dell have decided with their recent line of Inspiron 15’s that the temperature used to tell the fan if it needs to turn on or not is the skin temperature of the computer, not the CPU etc. And this temperature threshold is set at 50C. The safe max for your CPU etc. is 70C (with 100C being the critical shutdown/ overheating point). And the BIOS also tells the fan to run at max speed whenever it is on. This has the effect, when coupled with an already noisy fan, to always have the fan on at max speed, singing me the song of it’s people. Sure it keeps the computer ice cold. But so what. It’s sitting on my desk, not a baby kitten. The laptop can get a bit warm, as long as the CPU is fine.

Herein Ubuntu shows some of it’s indie muscle. With a quick google search I was able to find two different methods of monitoring and controlling the speed and activity of the fan: “sensors” and “i8kmon”. As these are nested in the Ubuntu OS, they override the BIOS when activated and allow one to set the speed of the fan(s) as well as monitor the temperature of the CPU, and any other cores present. This changes the fan activation temperature back to 70C for the CPU. The reason why this helps so much is that by reducing the fan speed and frequency with which it runs, the noise is greatly reduced in volume and duration. And because the fan only runs when the computer is at 70C, the convective cooling the spinning fan provides is¬† greater as the air to CPU temperature ratio is larger than at only 50C. Happy days.