Coastal transects


This week I have expanded the coastR package with the inclusion of a function that calculates the angle of the heading for alongshore or shore-normal transects. The rest of this blog post is the vignette that I’ve written detailing the set of this function. Next week I’ll likely be taking a break from coastR development to finally create a package for the SACTN dataset. That is a project that has been in the works for a loooong time and it will be good to finally see a development release available to the public.


There are a number of reasons why one would want to calculate transects along or away from a coastline. Examples include: finding the fetch across an embayment, finding the coordinates of a point 200 km from the coast, finding the appropriate series of SST pixels along/away from the coast, (or if one is feeling particular feisty) the creation of shape files for a given area away from the coast. The function that we will be introducing here does none of these things. What the transects() function does do is calculate the angle of the heading along or away from the coast against true North, which is then the basis for all of the other fancy things one may want to do. Baby steps people. Baby steps.

Sample locations

For this vignette we will re-use the same coastlines as those created for the sequential sites vignette. The ordering of the sites remains jumbled up to demonstrate that transects() does not require orderly data. Should one want to order ones site list before calculating transect headings it is possible to do so with seq_sites(). This is of course a recommended step in any workflow.


With our site lists created we now want to see what the correct headings for alongshore and shore-normal transects are for our sites. We will also demonstrate what happens when we increase the spread used in the calculation and also how the inclusion of island masks affects the angle of the headings.


Now that the correct headings have been calculated for our alongshore and shore-normal transects let’s visualise them with ggplot. First we will create a function that does this in order to keep the length of this vignette down.

Cape Point, South Africa

The transect() function is designed to work well at small scales by default. We may see this here with the effortlessness of plotting transects around a peninsula and then across an embayment in one go.


Figure 1: Alongshore and shore-normal transects around Cape Point and False Bay.


South Africa

The intentions one may have for calculating shore-normal transects will differ depending on ones research question. If one is interested in visualising the convolutions of a coastline at a sub-meso-scale then the default spread of the transect() function is probably the way to go, as shown above. If however one is interested in seeing the shore-normal transects broadly for the coastline of an entire country it is likely that one will want to greatly expand the spread of coastline used to calculate said transects. In the figure below we may see how changing the spread of the coastline considered for the transects changes the results. The top row shows the transects resulting from the narrow default spread, while the bottom row shows the results of using a much wider spread for the calculation. Note particularly how the transect changes at St. Helena Bay and Gansbaai (second and fourth sites from the top left), as well as a general smoothing of all of the other transects. This is due to the sensitivity of the function. The St. Helena Bay and Gansbaai sites lay within embayments; therefore, the shore-normal transects that would come out directly from these sites will not follow the general contour of the coastline of South Africa. Should we be interested in the “bigger picture” we must increase the spread argument in transects(). This may require some trial and error for particularly difficult coastlines before a satisfactory result is produced, but it is certainly still faster than running the calculations by hand. Should small scale accuracy along part of the coast, and broader accuracy elsewhere be required, one must simply divide the site list into the different sections and run transects() on each subset with the desired spread.


Figure 2: Alongshore and shore-normal transects around all of South Africa.


Baja Peninsula, Mexico

In the following figure we see how the inclusion of islands affects the results of our transects. The first site up from the tip of the peninsula on the left-hand side is on an island. Note the minor adjustment to the transect when the island mask is used for the calculation. In this case it’s not large, but in other instances it may be massive. By default island masks are removed and it is our advice that they not be used unless extreme caution is observed.


Figure 3: Alongshore and shore-normal transects around the Baja Peninsula.


Bohai Sea, China

This figure serves as a good visualisation for just how localised the coastline is that is used to calculate the shore-normal transects. Note how the alongshore transects look a little dodgy, but when shown as shore-normal transects everything works out. This is something to consider if one is interested in calculating alongshore transects rather than shore-normal transects. For alongshore transects that show more fidelity for coastal direction it is advisable to increase the spread argument.


Figure 4: Alongshore and shore-normal transects within the Bohai Sea.



As we may see in the previous example figures, the transect() function tends to work better by default at smaller scales. This was an intentional decision as it is much more accurate when scaling the function up for larger coastal features than when scaling it down for smaller ones.

The calculation of the heading for alongshore and shore-normal transects is rarely the end goal itself. One then generally wants to find specific points from the coastline along the transects that have been determined. This is done in the code above within the plot_sites() function created within this vignette, but the process is not detailed specifically. How to do more elaborate things with transects will be explained with the following functions to be added to coastR. This will include how to draw coastal polygons based on distance and bathymetry.

Sequential coastal sites


The rest of the blog post after this preface section is a copy of the vignette I’ve written for the first function in the new package I am developing: coastR. This package aims to provide functions that are useful for coastal oceanography but that do not yet exist in the R language. It is not my intention to provide algorithms for physical oceanography as these may already be found elsewhere. This post covers how one may determine the correct sequence of sites along a convoluted coastline.

Now that I’ve handed in my PhD I am a little less pressed as far as deadlines go and I would like to return to my habit of creating a new blog post every Friday. I’ve written quite a bit of code over the last three years and much of it needs to find it’s way into the coastR. Next week I am planning on uploading a function for calculating shore normal transects. Until then, please enjoy the spectacle of sequential ordering.


The human mind prefers to see patterns in whatever it encounters. To this end we try to provide ourselves with data that are stored in a way that will appeal to that disposition. For a time series this usually means that the data are saved sequentially through time. For spatial data this means that the data are saved in some sort of sequential state, too. But what might that be? For 2D, 3D, or 4D data this can get tricky rather quickly and one tends to default to netcdf files. But with 1D data we are able to decide how we want the data to be structured as they will fit within a simple dataframe. But how can spatial data be 1D? Nothing in nature truly is, but I use 1D here as an expedient way of describing data that are constrained to some physical (usually continuous) barrier. Specifically for use with seq_sites() we will be looking at sites along a coastline.

If one has meta-data for a number of sampling sites they should be saved in the order they may be found along the coastline. Some would perhaps prefer to order sites alphabetically, I am not one of them for a number of reasons. Not least of which being that this is too simple a way of organising. One could also choose to organise ones coastal sites in numerical order of either longitude or latitude. This quickly becomes problematic for most stretches of coastline as natural formations such as peninsulas and embayments will prevent the correct ordering of sites based on only latitude or longitude. It is therefore necessary to query the longitude and latitude of each site in a list against a land mask in order to determine the correct order along the coastline. This is what will be demonstrated below.

Sample locations

For the purpose of this vignette we will manually create a few dataframes for different coastlines around the world of varying degrees of complexity and length. The first two dataframes are taken from the SACTN site list included in the coastR package. The rest have their lon/lat values grabbed from Google maps. Note that the order of the sites is intentionally entered incorrectly so as to be able to demonstrate the efficacy of seq_sites().

Sequential sites

Now that we have our sample sites it is time to order them correctly along the coast sequentially. Should one prefer the opposite order to what seq_sites() produces, this may be changed by using the reverse argument found within the function. Additionally, if one has sites located on islands just off the coast, one may choose to allow the algorithm to take these islands into account. Note that this then will force the algorithm to calculate the sequential order of these sites as though they were part of a different sequence because they will no longer be on the same 1D plain. Generally this would not be desirable and one would rather order sites on coastal islands in line with the rest of the coast. This is the default setting, but we may see how this changes with the Baja Peninsula site list.


With the sites correctly ordered sequentially along the coast we may now compare the before and after products. To do so in a tidy way we will first create a function that plots our sites for us on a global map.

Cape Point, South Africa


Figure 1: Comparison of ordering of sites around Cape Point, South Africa.


South Africa


Figure 2: Comparison of ordering of sites around South Africa.


Baja Peninsula, Mexico

Note in the image below that site seven in the ‘Islands’ panel appears to be ordered incorrectly. This is because we have asked the function to first look for sites along the coast, and then order sites around nearby islands by setting the argument coast to TRUE. This is because the algorithm only works on one continuous line. When islands are introduced this then represents a second set of 1D coordinates and so the algorithm plans accordingly. This feature has been added so that if one chooses to have islands be apart from the initial ordering of the coastal sites it may be done. The default however is to remove islands from the coastal land mask so that they are ordered according to their nearest location to the coast.


Figure 3: Comparison of ordering of sites around the Baha Peninsula, Mexico.


Bohai Sea, China

Below in the ‘Sequential’ panel we see the result of having set the reverse argument to TRUE. Hardly noticeable, but potentially useful.


Figure 4: Comparison of ordering of sites around the Bohai Sea, China.



The usefulness of the seq_sites() function is demonstrated above on a number of different scales and coastal features. This is in no way an exhaustive test of this function and I welcome any input from anyone that uses it for their own work. The premise on which this function operates is very basic and so theoretically it should be very adaptive. The only thing to look out for is if one has a very convoluted coastline with a long stretch without any sites the algorithm may think this is two separate coastlines.

Polar Plot Climatologies


Whilst cruising about on Imgur I found a post about science stuff. Not uncommon, which is nice. These sorts of grab-bag posts about nothing in particular often include some mention of climate science, almost exclusively some sort of clever visualisation of a warming planet. That seems to be what people are most interested in. I’m not complaining though, it keeps me employed. The aforementioned post caught my attention more than usual because it included a GIF, and not just a static picture of some sort of blue thing that is becoming alarmingly red (that was not meant to be a political metaphor). I’m referring to the now famous GIF by climate scientist Ed Hawkins (@ed_hawkins) whose blog may be found here, and the specific post in question here. A quick bit of research on this animation revealed that it has likely been viewed by millions of people, was featured in the opening ceremony of the Rio Olympics, and was created in MATLAB. Those three key points made me decide to do a post on how to re-create this exact figure in R via a bit of reverse engineering. The original GIF in question is below.


Figure 1: The ever broadening spiral of global temperatures created by Ed Hawkins.



Figure 1 above uses the global mean temperature anomalies taken from HadCRUT4. These data have an impressive range of collection, going back to 1850. Very few datasets match this length of collection, and I’m not going to attempt to do so here. What I am going to do is use the data that I work with on a daily basis. These are the SACTN data that may also be downloaded here via a GUI. As a coastal oceanographer I am mostly interested in changing climates in the near shore. While not publish explicitly, a paper about the appropriate methodology one should use does exist, and this methodology has been applied to all of the time series in the SACTN dataset accordingly. It is therefore known what the rates of decadal change along the coast of South Africa are, and we may rely on this in order to cherry pick the more dramatic time series in order to make prettier visuals.


With our end goal established (Figure 1), and our dataset chosen (SACTN), we may now get busy with the actual code necessary. As one may have inferred from the title of this post, Figure 1 is what we call a “polar plot”. This may appear complex to some, but is actually a very simple visualisation, as we shall see below. But first we need to prep our data. For consistency in the creation of the anomaly values below I will use 1981 — 2010 for the climatology of each time series.

With our data prepared we may now create the series of functions that will make a spiraling polar plot of temperatures for any time series we feed into it. I prefer to use the animation package to create animations in R. This requires that one also installs image magick beforehand. This is a free software that is available for all major operating systems. There are a few ways to create animations in R, but I won’t go into that now. The method I employ to create the animations below may seem odd at first, but as far as I have seen it is the most efficient way to do so. The philosophy employed here is that we want to have one final function that simply counts forward one step at a time, creating each frame of the GIF. This function calls on other functions that are calculating the necessary stats and creating the visuals from them in the background. By creating animations in this way, our up front prep and calculation time is almost non-existent. It does mean that the animations take longer to compile, but they are also much more dynamic and we may feed any number of different dataframes into them to get different outputs. I have found over the years that the more automated ones code can be the better.

With the above two functions created, we may now call them nested within one another via the saveGIF function below.


As one may see in the following GIFs, local extremes often outpace global averages. This should not be terribly surprising. In order to better illustrate this I have expanded the anomaly labels along the y-axes more so than seen in Figure 1. The increasing patterns are not as clear in these following GIFs as in the original that they are based on. This is because the original is based on a global average, which provides for a much smoother trend. I hope people enjoy these and feel free to plop your own temperature time series into the code to create your own polar plot figures!


Figure 2: The polar plot for Port Nolloth, where temperatures have been increasing.


Figure 3: The polar plot for Sea Point, where temperatures have been decreasing.


Figure 4: The polar plot for Kent Bay, where temperatures have been holding level.

Visa Fun

Hello avid readers. Unfortunately I do not have a post ready for this Monday as I’ve had to spend the whole day on visa application issues. I recently discovered that a Post Doc visa does not fall under the umbrella of the study visa, as I had been led to believe. Instead one must acquire a Scarce Skills Visa, which is a type of proper work visa. This then requires quite a bit more paperwork and several more steps than the renewal of a study visa.

I can happily say however that the systems in place for visa facilitation have definitely improved since my last visa application in 2014. Not just in SA either. More and more things are able to be completed without the need for couriers. This is always great as mailing things internationally is always the slowest least certain part of the compiling of the supporting documents.

I’ll see if I can’t get a post up later this week with some pretty environmental figures.

Mapping with ggplot2


There are many different things that require scientists to use programming languages (like R). Far too many to count here. There is however one common use amongst almost all environmental scientists: mapping. Almost every report, research project or paper will have need to refer to a study area. This is almost always “Figure 1”. To this end, whenever I teach R, or run workshops on it, one of the questions I am always prepared for is how to create a map of a particular area. Being a happy convert to the tidyverse I only teach the graphics of ggplot2. I have found that people often prefer to use the ggmap extension to create ggplot quality figures with Google map backgrounds, but I personally think that a more traditional monotone background for maps looks more professional. What I’ve decided to showcase this week is the data and code required to create a publication quality map. Indeed, the following code will create the aforementioned obligatory “Figure 1” in a paper I am currently preparing for submission.


There are heaps of packages etc. that one may use to create maps. And there is a never ending source of blogs, books and tutorials that illustrate many of the different ways to visualise spatial data. For my international and geographic borders I prefer to use data I’ve downloaded from GSHHSG and then converted to dataframes using functions found in the PBSmapping package. I then save these converted dataframes as .Rdata objects on my computer for ease of use with all of my projects. For the domestic borders of a country, which I won’t use in this post, one may go here. Note however that for some strange reason this website still has the pre-1994 borders for South Africa. For the correct SA borders one must go here. The current SA borders may actually be download in the .Rdata format, which is neat.

Once one has the borders to be used in the map, the next step is to think about what one actually wants to show. The main purpose of this map is to show where several in situ coastal seawater temperature time series were collected. This could be done quite simply but a plain black and white map is offensively boring so we want to make sure there is a good amount of (but not too much!) colour in order to entice the reader. I personally find pictures of meso-scale oceanic phenomena particularly beautiful so try to include them whenever I can. Luckily that is also what I study so it is not strange that I include such things in my work. Now if only I studied panda’s, too…

Panda’s aside, the current work I am engaged in also requires that the atmospheric processes around southern Africa be considered in addition to the oceanography. To visualise both air and sea concurrently would be a mess so we will want to create separate panels for each. Because I have been working with reanalysis data lately, and not satellite data, I am also able to include the wind/ current vectors in order to really help the temperature patterns pop. The oceanic data are from the BRAN2016 product and the atmospheric data are from ERA-Interim. Both of which are available for download for free for scientific pursuits. I’ve chosen here to use the mean values for January 1st as the summer months provide the most clear example of the thermal differences between the Agulhas and Benguela currents. The code used to create the scale bar in the maps may be found here. It’s not a proper ggplot geom function but works well enough. I’ve also decided to add the 200 m isobath to the sea panel. These data come from NOAA.


I find that it is easier to keep track of the different aspects of a map when they are stored as different dataframes. One should however avoid having too many loose dataframes running about in the global environment. It is a balancing act and requires one to find a happy middle ground. Here I am going to cut the all_jan1_0.5 dataframe into 4. One each for air and sea temperatures and vectors. I am also going to reduce the resolution of the wind so that the vectors will plot more nicely.

With just a few alterations to our nicely divided up dataframes we are ready to create a map. We will look at the code required to create each map and then put it all together in the end.

First up is the most busy. The following code chunk will create the top panel of our map, the sea state. It is necessary to label all of the locations mentioned in the text and so they are thrown on here. In order to make the site label easier to read I’ve made them red. This is particularly jarring but I think I like it.

Many of the sites that need to be plotted are laying on top of each other. This is never good, but is made worse when the sites in question are refereed to frequently in the text. For this reason we need to create a little panel inside of the larger figure that shows a zoomed in picture of False Bay. Complete with text labels.

We could possibly create another inset panel for the clomp of sites around Hamburg but this figure is already getting too busy. So we’ll leave it for now. One inset panel will serve to illustrate the code necessary to create a faceted map so for the purposes of this post it will also suffice. That leaves us with only the bottom panel to create. The air state. I’ve decided to put the scale bar/ North arrow on this panel in an attempt to balance the amount of information in each panel.

With our three pieces of the map complete, it is time to stick them together. There are many ways to do this but I have recently found that using annotation_custom allows one to stick any sort of ggplot like object onto any other sort of ggplot object. This is an exciting development and opens up a lot of doors for some pretty creative stuff. Here I will just use it to demonstrate simple faceting, but combined with panel gridding. Really though the sky is the limit.


The developments in the gridding system have brought the potential for using ggplot for these more complex maps forward quite a bit. As long as one does not use a constrained mapping coordinate system (i.e. coord_fixed) the grob-ification of the ggplot objects seems to allow the placing of the pieces into a common area to be performed smoothly. Displaying many different bits of information cleanly is always a challenge. This figure is particularly busy, out of necessity. I think it turned out very nicely though.


Figure 1: Map showing the southern tip of the African continent. The top panel shows the typical sea surface temperature and surface currents on January 1st. The bottom panel likewise shows the typical surface air temperatures and winds on any given January 1st.

Goats per Capita


A few weeks ago for a post about the relationship between gender equality and GDP/ capita I found a nifty website that has a massive amount of census information for most countries on our planet. Much of this information could be used to answer some very interesting and/ or important questions. But some of the data can be used to answer seemingly pointless questions. And that’s what I intend to do this week. Specifically, which countries in the world have the highest rates of goats/ capita?


The goats per capita data were downloaded from the clia-infra website. These data are already in the format we need so there is little to be done before jumping straight into the analysis. We will however remove any records from before 1900 as these are almost entirely estimates, and not real records.


First of all, I would like to know what the global trend in goats/ capita has been since 1900. To do so we need to create annual averages and apply a simple linear model to them. We will also plot boxplots to give us an idea of the spread of goats/ capita over the world.


Figure 1: Boxplots with a fitted linear model showing the global trend in goats/ capita over the last century.


As we may see in Figure 1, the overall trend in goats/ capita in the world has been decreasing very slightly over the last century. The striking result from Figure 1 however is the massive range of values as seen by the outliers from the boxplots. So which countries are these that have so many more goats/ capita than the rest of the world?

We want to see which countries have the most goats/ capita but there are 172 unique countries in this dataset so it would look much too busy to plot them all. To that end we want only the top and bottom 10 countries from the most recent year of reporting (2010).


Figure 2: Line graphs showing the rate of goats/ capita for the top and bottom 10 goat having countries in the world over the last century.



I was a bit surprised to find that Mongolia is far and away the country with the most goats/ capita at 5.140 in 2010. Less surprising is that the other top 9 goat having countries in the world in 2010 were all in Africa and their rate of goats/ capita was between 1.634 (Mauritania) to 0.124 (South Africa). This makes for a massive spread in what is already an outlying set of countries. How is it that Mongolia has so many more goats/ capita? This is a very odd result but the data were reported annually from 2000 to 2010 and they consistently show similarly high rates for Mongolia.

The bottom 10 goat having countries in the world are a mix of European, Asian, North American and Pacific Islands. This mix is not surprising as we may see in Figure 1 that there are no outliers in the bottom of the distribution. The highest value for the bottom 10 countries in 2010 was Tonga at 0.121. This is very close to the lowest value from the top 10 countries, and shows us that most of the 172 countries in this dataset have ~0.12 goats per person. With this average in mind, we see that the other bottom nine countries in Figure 2 really are much lower than the global average with rates approaching 0 goats/ capita. It is worth mentioning that the lowest overall rate of goats/ capita in 2010 was Japan at 0.0001. Meaning that there is only one goat in Japan for every 10,000 people. As opposed to Mongolia that has more than five goats for every one person. Therefore there were 50,000 times more goats/ capita in Mongolia than Japan in 2010…

I supposes the take away message from this analysis is that if one ever wants to get away from it all and just go spend time with a lot of goats, Mongolia is the place for you!

(and definitely avoid Japan)


US Parties and Immigration


As an immigrant myself, all of the talk of immigration to be found in main stream media outlets today makes me a bit nervous. Whereas most people that speak of the pro’s and con’s of immigration do so from the point of view of how it may affect the country of their birth, I view this issue as something that affects my ability to live outside the country of my birth. I immigrated into the Republic of South Africa in 2013 and have been living here since. I would do a piece on South African immigration but the numbers are difficult to get a hold of and honestly most people are less interest in South Africa than the USA.

Immigration is not a new talking point. It’s something that comes up in political and a-political circles all of the time. The current debate on the Muslim Ban in the USA may have reached a new level for this sort of rhetoric in the West, but targeted crackdowns of this sort are not new in the world. I won’t bother with citations here, but if one is interested a quick google of “xenophobia” + “border control” should yield some convincing results. As this current row of immigration debates in the USA has become so partisan, I decided that an interesting question to ask would be “Under which of the two parties have more people immigrated into the USA?” and “Under which of the two parties have more people been removed from the USA?”


The historical data on immigration into the USA are located at the Department of Homeland Securities website. In 2013 the DHS started keeping very detailed reports of all immigration by age, country, marital status, etc. These highly detailed data are very interesting but will not help us to ask our central questions. We want long time series of data so that we may compare many different administrations from each party. For ease of analysis I have chosen to classify the party in power at any point in time based on the party of the President. I understand that the Senate or Congress would perhaps be better, if not more egalitarian choices, but the current focus of this issue has the US President at it’s core, so I decided to keep that theme constant in this analysis. I’ll only start from Eisenhower and go up until Obama as the publicly available DHS data end in 2015. They begin as far back as 1892, but ggplot2 has built into it a US president dataframe and I am going to just use that because I’m lazy.


In this first figure we are defining immigration as the number of people actually receiving a green card in any given year. This is the most strict definition of “immigration” and I think may be best used to show whom the USA was choosing to let in.


Figure 1: Bar charts showing the number of green cards granted each year in the USA. The colour of the bars show the ruling party at the time of issuance. A linear model is imposed in black.


We may see in Figure 1 that the all time high for the granting of green cards was during the four year administration of George Bush Senior. These values are so much higher than the other administrations that it leverages the linear model drawn on these data up past where it should normally be to show the more normal trend exhibited by all of the other administrations since Eisenhower. That being said, we actually see a bit of a turn down during the Obama administration, with the largest year of green card issuance during the George Bush Junior administration larger than any year under Obama. I find that surprising. Figure 1 also shows us that we can’t really directly compare the different administrations because as populations increase, so too will the number of people that want to immigrate. In a quick pinch however we may use the residuals from the linear model to give us a slightly better visualisation of how the parties stack up against one another.


Figure 2: The residuals from a linear model fitted to the data shown in Figure 1.


It appears as though whenever there was a Bush in office it was much easier to get a green card. And that Democrats generally made it more difficult to do so.


Now that we have seen that it is easier to enter the USA during a Republican presidency, let’s see under which party an illegal immigrant is most likely to be expelled. There are two different classes of expulsion: ‘Removal’ and ‘Return’. Removal means that a legal order was issued to remove the individual. Return means that the individual was likewise not legally in the states, but left of their own volition.


Figure 3: Two bar charts showing the rates of returns and removals for immigrants from the USA.


Figure 3 tells a very interesting story. In the top panel (Returns), we see that from 1953-55 (shortly after WWII) there were massive numbers of immigrants that returned to their home countries voluntarily. Then there is a period of increase leading up to the 80’s. This then follows a somewhat normal distribution, peaking in the late 90’s near the end of the Clinton administration before the peaceful return of immigrants drops steadily through the 8 years of Bush then Obama. The bottom panel (Removals) shows the reason for this apparent relaxation on immigrants. It isn’t that immigrants were being sent away less, but rather they began to be removed more forcefully than appears to have been the policy until something changed during the Clinton Administration. The rate of immigrants being removed became greater than those being returned in 2011 under Obama. Again we see a heavy hand on immigration during years with a Democratic president in office. It is hard to compare the parties on this issue as the policy of forcefully removing immigrants in favour of having them leave peacefully has only been in practice over three administrations (Clinton, Bush Jr. and Obama). That being said, it is worth noting that the y axes on these two figures are not the same. The increase in removals does not outweigh the decrease in returns. Overall the rate of expulsion of immigrants from the states declined during the Obama administration. And perhaps also Bush Jr.

It is fair criticism to point out that a green card may take several years to acquire. This means that when one begins the process of applying for a green card it may take so long that a different party will be in power by the time it is granted… or not. I would argue however that most of these parties (with the exception of Carter) are in office for eight year stretches. This is not meant to be a definitive analysis, but I think it has proven to be a rather interesting first step. I didn’t expect the data to look like this. George Bush Senior, saviour to immigrants, who would have thought.

Emulating ODV Figures In R


With more and more scientists moving to open source software (i.e. R or Python) to perform their numerical analyses the opportunities for collaboration increase and we may all benefit from this enhanced productivity. At the risk of sounding sycophantic, the future of scientific research truly is in multi-disciplinary work. What then could be inhibiting this slow march towards progress? We tend to like to stick to what is comfortable. Oceanographers in South Africa have been using MATLAB and ODV (Ocean Data View) since about the time that Jesus was lacing up his sandals for his first trip to Palestine. There has been much debate on the future of MATLAB in science, so I won’t get into that here, but I will say that the package oce contains much of the code that one would need for oceanographic work in R, and the package angstroms helps one to work with ROMS (Regional Ocean Modeling System) output. The software that has however largely gone under the radar in these software debates has been ODV. Probably because it is free (after registration) it’s fate has not been sealed by university departments looking to cut costs. The issue with ODV however is the same with all Microsoft products; the sin of having a “pointy clicky” user interface. One cannot perform truly reproducible research with a menu driven user interface. The steps must be written out in code. And so here I will lay out those necessary steps to create an interpolated CTD time series of temperature values that looks as close to the default output of ODV as possible.


Figure 1: The default output of ODV when plotting temperature by depth through time at one location.


Colour palette

Perhaps the most striking thing about the figures that ODV creates is it’s colour palette. A large criticism of this colour palette is that the range of colours used are not equally weighted visually, with the bright reds drawing ones eye more than the muted blues. This issue can be made up for using the viridis package, but for now we will stick to a ODV-like colour palette as that is part of our current objective. To create a colour palette that appears close to the ODV standard I used GIMP to extract the hexadecimal colour values from the colour bar in Figure 1.


Figure 2: A non-interpolated scatterplot of our temperature (°C) data shown as a function of depth (m) over time.



Figure 2 is a far cry from the final product we want, but it is a step in the right direction. One of the things that sets ODV apart from other visualisation software is that it very nicely interpolates the data you give it. While this looks nice, there is some criticism that may be leveled against doing so. That being said, what we want is a pretty visualisation of our data. We are not going to be using these data for any numerical analyses so the main priority is that the output allows us to better visually interpret the data. The package MBA already has the necessary functionality to do this, and it works with ggplot2, so we will be using this to get our figure. The interpolation method used by is multilevel B-splines.

In order to do so we will need to dcast() our data into a wide format so that it simulates a surface layer. spread() from the tidyr package doesn’t quite do what we need as we want a proper surface map, which is outside of the current ideas on the structuring of tidy data. Therefore, after casting our data wide we will use melt(), rather than gather(), to get the data back into long format so that it works with ggplot2.

It is important to note with the use of that it transposes the values while it is creating the calculations and so creating an uneven grid does not work well. Using the code written below one will always need to give the same specifications for pixel count on the x and y axes.


Figure 3: The same temperature (°C) profiles seen in Figure 2 with the missing values filled in with multilevel B-splines. Note the artefact created in the bottom right corner. The 20°C contour line is highlighted in black.


At first glance this now appears to be a very good approximation of the output from ODV. An astute eye will have noticed that the temperatures along the bottom right corner of this figure are not interpolating in a way that appears possible. It is very unlikely that there would be a deep mixed layer underneath the thermoclines detected during 2015. The reason the splines create this artefact is that they are based on a convex hull around the real data points and so the interpolating algorithm wants to perform a regression towards a mean value away from the central point of where the spline is being calculated from. Because the thermoclines detected are interspersed between times where the entire water column consists of a mixed layer is filling in the areas without data as though they are a fully mixed surface layer.

Bounding boxes

There are many ways to deal with this problem with four possible fixes coming to mind quickly. The first is to set extend = F within This tells the algorithm not to fill up every part of the plotting area and will alleviate some of the inaccurate interpolation that occurs but will not eliminate it. The second fix, which would prevent all inaccurate interpolation would be to limit the depth of all of the temperature profiles to be the same as the shallowest sampled profile. This is not an ideal fix because we would then lose quite a bit of information from the deeper sampling that occurred from 2013 to 2014. The third fix is to create a bounding box and screen out all of the interpolated data outside of it. A fourth option is to use soap-film smoothing over some other interpolation method, such as a normal GAM, concurrently with a bounding box. This is normally a good choice but does not work well with these data so I have gone with option three.


Figure 4: The same temperature (°C) profiles seen in Figure 3 with a bounding box used to screen out data interpolated outside of the range of the recordings.



In this short tutorial we have seen how to create an interpolated temperature depth profile over time after a fashion very similar to the default output from ODV. We have also seen how to either fill the entire plotting area with interpolated data (Figure 3), or quickly generate a bounding box in order to remove any potential interpolation artefacts (Figure 4). I think this is a relatively straight forward work flow and would work with any tidy dataset. It is worth noting that this is not constrained to depth profiles, but would work just as well with map data. One would only need to change the date and depth variables to lon and lat.

R is an amazingly flexible language with an incredible amount of support and I’ve yet to see it not be able to emulate, or improve upon, an analysis or graphical output from any other software.

Gender and GDP


Most people living in the Western World are very quick to extol the virtues of gender equality. There are however many places where this is not so. This then inevitably leads to conflict as cultures and nations are drawn closer together on our ever shrinking earth. Perhaps not the sort of conflict that leads to saber rattling, but certainly ideological disagreements that affect policy and have real impacts on large swathes of humanity. So what is to be done? How can say how anyone else should be. There are of course all sorts of moral back and forth’s that could be pursued cyclically ad nauseum, but what if we could show quantitatively what the benefit of gender equality was to a nation and the lives of it’s people? That is the exact sort of question I like to answer here at this blog and it is a question that came up in my daily life a couple of weeks ago. Because most metrics for most countries are recorded, this is the sort of thing that can be answered. Indeed, it has been before, so here I add another take on an argument that really shouldn’t still be happening…


In order to quantify the benefit of gender equality we are going to need global census data, as this is a conversation that needs to be had at a global level. Happily for us these data may be found at Clio Infra. This is an amazing web hosted database that contains an unexpected range of information as it aggregates data from many other sources. I’ll likely be going back to this data repository on multiple occasions in the future. But for now we’ll stick to gender equality. For this post we will be looking at a number of variables as this allows us to develop a more nuanced view of the relationships that exist between the wealth of nations and the egalitarianism their cultures engender. It is unfortunately not possible to show causality in the relationships we will model below but I think that educated readers will likely arrive at the same inferences as myself. Of course, if anyone has any objections, faults or errors to point out in the analysis that follows I would be very happy to hear them.

The data used in this blog were downloaded in “Compact” form, which comes in .xlsx format. These files were then opened and only the tab containing the data in long format were saved as a .csv file. As an aside, these data are visualised on the Clio Infra website as boxplots using what is clearly ggplot2 code. Also note that unless specified all values are taken as either women/ men or women-men. Meaning that proportions below 1, or negative values signify greater abundance for men than women in those populations.

I’ve chosen o use the metrics seen below as I think they represent inequality between genders as well as throughout populations as a whole.


I recently learned how to use the dplyr pipe (%>%) to perform nested models and so am excited to put that to use here. Our first step is to run simple linear models for each metrics we have loaded against GDP per capita without controlling for time (year of sampling) or space (country). The GDP per capita will be correctly compared against it’s corresponding metric at that year and in that country, but we are then allowing mean values effectively to be made of these results. These models will provide us with a range of R^2 values (coefficients of determination), which is the statistic showing how much of the variance in a dependent variable is explained by the independent variable. This allows us to see which of our metrics have the strongest relationship with GDP per capita. Which we may infer as being related to the prosperity of a nation and it’s people.


Table 2: R^2 and p-values for the relationship between several metrics and GDP per capita. Years and country of sampling were not controlled for.

       n metric                                           R^2   p       
1162 education_years                      0.623 0.000
   215 gender_equality_index          0.523 0.000
   134 gender_equality_education 0.321 0.000
9779 education_inequality             0.318 0.000
   656 income_inequality                   0.106 0.000
   348 gender_equality_numeracy 0.075 0.000
   205 sex_ratio                                      0.053 0.001
1371 total_population                     -0.001 0.760


It is possible however that these relationships have not been static over time. We will also want to display the changes in the R^2 values as time series.


Figure 1: Line graph showing the fluctuations in the strength of the relationship (R^2) between the metrics shown in colour and GDP per capita over time.



Figure 2: Boxplots showing the distribution of the strength of the relationships for many countries between the metrics and GDP per capita. Blue numbers at the bottom show the total number of data points used to calculate the scores whereas the blue dots show the single R^2 value for a country.


Years of education of the populace is constantly coming out on top, with gender equality in education in third. These two metrics must be related in some way though so let’s look specifically at that relationship.


Figure 3: Scatterplot showing two different linear model results. The blue line shows the relationship between total years of education within a populace and the gender based equality of those educations. The red line shows how this relationship flattens out when nations with low (below 5 years) levels of education are removed.



From Table 1 we may see that the three metrics in descending order that best explain the variance occurring in GDP per capita over time and space are: education of the population as a whole, equality between the genders and the equality in education between the genders. Equality in education for the populace as a whole came in at a close fourth and is perhaps a better predictor than equality between genders due to the much higher sample size that the result is based on (this is an oddly well reported metric). The metrics of income inequality for the populace as a whole, equality in numeracy between genders, and the ratio of sex near birth showed significant relationships with GDP per capita but explained little of the variance. The total population of a country has absolutely nothing to do with the GDP per capita when time and space are not controlled for.

We see in Figure 1 that for our top four metrics (education_years, gender_equality_index, gender_equality_education and education_inequality), only education_years shows any consistency over time. We also see that education_inequality has become a much better test of the GDP per capita of a country since the mid 19th century.

Figure 2 shows that there is a very large range in the relationship between many of these metrics and GDP per capita when we control for country of measurement. Interestingly, the total_population of the country is relevant to the GDP per capita when this information is analysed independently by each country. More important than total_population we see are the metrics education_years and gender_equality_index.

When we look at the specific relationship between education_years and gender_equality_education it would at first appear that at an R^2 value of 0.518 this was a strong relationship. Looking more closely at Figure 3 one will notice that this relationship is not linear and that the fit of the linear model is being affected by several scores at very low education levels and equality scores. If one were to only account for scores where the education_years metric is at least 5 years the strength of the relationship falls dramatically to R^2 = 0.199.


If one was satisfied to take all of the GDP data and the several metrics of equality used in this analysis and clump them together one would be led to believe that the most important thing for a prosperous society is the overall years of education for that countries citizens. This metric is clearly the most important in Table 1, Figure 1 and Figure 2. Equality in education between the genders is also of great importance, though perhaps half as much so as the overall education of the populace. It leads that a society with more equitable education levels between genders would have a greater proportion of educated people and this is generally so however, Figure 3 shows that if one removes scores from nations with particularly low levels of overall education this relationship falls apart. This is because most countries in the world have relatively even levels of education between genders. There is little pattern here because there is little variance in gender_equality_education once the less educated nations have been removed from the calculation.

More important to GDP than equal education for genders appears to be the overall equality between genders. This value accounts for much more than education alone and appears to better capture how an egalitarian society benefits from the equal treatment of it’s citizens. Clio Infra states that this metric represents:

The composite index aims to evaluate countries’ performances regarding the progress they made in closing the gender gap in the fields of health, socio-economic resources, politics and household since 1950

When I started this research I had assumed that equal education between genders would clearly be a very important metric in regards to a prosperous society. Surprisingly this appears to not be as concrete as expected. It certainly is important, more so than income inequality, for example, but is not as important as the overall equality between the genders. This finding is deceiving though because the equality in education for most countries (not just rich ones) is high. It is the difference in equality between genders generally that shows a broader range, and the difference relates strongly to GDP per capita. So a potential recipe for success seems to be to first ensure that as much of a populace as possible (of all genders) receives as much education as possible. It seems likely that this would then fuel gender equality, but the data don’t show this unilaterally. So to really move a nation to the top of the GDP per capita list would also require an active effort in ensuring well rounded rights to all genders.


Wind Vectors


As more and more physical scientists (e.g. oceanographers) move to R from other object oriented command line programming languages, such as Matlab, there will be more and more demand for the code that is needed to do some basic things that they may already know how to do in their previous languages that they don’t yet know how to do in R. Surprisingly, there are many things that should be very easy to find how to do in R that are not. Or are at least not widely publicized. One such example is how to plot wind vectors as a time series. This is a very necessary part of any analysis of the winds or currents in a particular area. Making it useful broadly to most climate scientists. Try as I might, I’ve only been able to find one source that gives an example of how to plot wind (or current) vectors as a time series with ggplot2 in R. Having now been asked how to do this by several people I thought it would be useful to write up my workflow and put it on the internet so that there is one more source that people searching for answers may find.


The data used in this example come from the ERA-Interim reanalysis product. This is an amazing product with dozens of data layers stretched over an even global grid at a native resolution of 0.75°. Here we use the one pixel from this product closest to Cape Town in order to illustrate how to use these data as a time series for a single point in space at a daily resolution for one month (December, 2016). For the purposes of this example we will only be using the following three layers: “2 metre temperature”, “10 metre U wind component” and “10 metre V wind component”. The data used in this post may be downloaded here.

Wind Vectors

The data as they exist within ERA-Interim are already in the format that we need. U and V vectors. Had they not been then we would have needed to convert them. The other common format for wind data are bearing (degrees) and speed (m/s). With this information it is possible to use trigonometry to create the necessary U and V vectors. One may do so with the following chunk of code:

Assuming that we now have a u and v column in our wind vector data frame, and that the speed of these vectors are in metres per second, we may now use the functionality within ggplot2 to create our wind vector time series. It’s almost anti-climactic how straight forward it all is.


As we will be using geom_segment() within ggplot2 to create our wind vectors we will want to maintain some control over the output by introducing a scaling object and a default range for the y axis. For now we will just set the wind scalar to 1. We’ll change it later to see what benefit this extra step adds. I have chosen the range of values for the y_axis below based on what I knew would illustrate my point, not through any a priori statistical analysis.

To create a wind vector time series we use the following few lines of code:


Figure 1: Wind speeds and bearings shown as a time series.


Visualizing wind data in this way introduces a problem into our figure that we do not normally need to contend with when we create time series figures. That problem is that wind speed/ bearing is a two dimensional value, not one dimensional. Therefore the y axis as well as the x axis must be utilized in order to show the wind speed/ bearing accurately. We find a compromise by using coord_equal() to ensure that every step on the x axis is the same size as the y axis. Thus preventing any distortion of vectors that are showing mostly an East/ West heading.

For particularly windy parts of the world (such as Cape Town) it may be better to scale the wind vectors so as to help them to look a bit more tidy. But if we do this, our x and y axis relationship will no longer be 1 for 1, as with the figure above. That is where our decision to set a static wind scalar and y axis come into play.


Figure 2: The same as Figure 1 with an expanded y axis to show all of the vectors.


By assigning our scaling variable to an object and using it to control the y axis of our ggplot code we may be sure that no matter how we choose to scale our wind variables they will be correctly represented. This does however limit how large our y axis may be as it must be kept to the same ratio as the x axis. Any larger than the figure shown above and we would start to have unappealingly square figures.

Multiple Y Axes

While it may be frowned upon in the tidyverse, there are still many people that like to use multiple y axes within one figure. I’m not going to get into the pro’s and con’s here, but I will say that of all the possible reasons for using multiple y axes, comparing wind vectors against some other variable is more reasonable than most. This is because the wind vectors themselves do not really have a y axis per se. As we’ve seen above, the actual values on the y axis don’t matter as long as they are shown 1 for 1 against whatever is on the x axis. Therefore we may very easily show temperature on the y axis and still overplot wind vectors without needing a proper second y axis. We will however insert a label onto the figure so as to make it clear how the wind vectors are to be estimated.


Figure 3:Temperature and wind time series shown on the same y axis. The label in the upper left corner shows the scale of the wind vectors.


I’ll leave it to the reader to decide if this is a reasonable way to visualise these data. But I do know that there is some use for displaying the information in this way. For example, one can clearly see that on days were the temperature decreases rapidly there is a south westerly wind moving over the city. Of course this could still be elucidated were these two figures plotted next to each other via facets, but it should be noted that it isn’t necessary.