Monthly Archives: June 2017

Emulating ODV Figures In R

Objective

With more and more scientists moving to open source software (i.e. R or Python) to perform their numerical analyses the opportunities for collaboration increase and we may all benefit from this enhanced productivity. At the risk of sounding sycophantic, the future of scientific research truly is in multi-disciplinary work. What then could be inhibiting this slow march towards progress? We tend to like to stick to what is comfortable. Oceanographers in South Africa have been using MATLAB and ODV (Ocean Data View) since about the time that Jesus was lacing up his sandals for his first trip to Palestine. There has been much debate on the future of MATLAB in science, so I won’t get into that here, but I will say that the package oce contains much of the code that one would need for oceanographic work in R, and the package angstroms helps one to work with ROMS (Regional Ocean Modeling System) output. The software that has however largely gone under the radar in these software debates has been ODV. Probably because it is free (after registration) it’s fate has not been sealed by university departments looking to cut costs. The issue with ODV however is the same with all Microsoft products; the sin of having a “pointy clicky” user interface. One cannot perform truly reproducible research with a menu driven user interface. The steps must be written out in code. And so here I will lay out those necessary steps to create an interpolated CTD time series of temperature values that looks as close to the default output of ODV as possible.

 

Figure 1: The default output of ODV when plotting temperature by depth through time at one location.

 

Colour palette

Perhaps the most striking thing about the figures that ODV creates is it’s colour palette. A large criticism of this colour palette is that the range of colours used are not equally weighted visually, with the bright reds drawing ones eye more than the muted blues. This issue can be made up for using the viridis package, but for now we will stick to a ODV-like colour palette as that is part of our current objective. To create a colour palette that appears close to the ODV standard I used GIMP to extract the hexadecimal colour values from the colour bar in Figure 1.

 

 

Figure 2: A non-interpolated scatterplot of our temperature (°C) data shown as a function of depth (m) over time.

 

Interpolating

Figure 2 is a far cry from the final product we want, but it is a step in the right direction. One of the things that sets ODV apart from other visualisation software is that it very nicely interpolates the data you give it. While this looks nice, there is some criticism that may be leveled against doing so. That being said, what we want is a pretty visualisation of our data. We are not going to be using these data for any numerical analyses so the main priority is that the output allows us to better visually interpret the data. The package MBA already has the necessary functionality to do this, and it works with ggplot2, so we will be using this to get our figure. The interpolation method used by mba.surf() is multilevel B-splines.

In order to do so we will need to dcast() our data into a wide format so that it simulates a surface layer. spread() from the tidyr package doesn’t quite do what we need as we want a proper surface map, which is outside of the current ideas on the structuring of tidy data. Therefore, after casting our data wide we will use melt(), rather than gather(), to get the data back into long format so that it works with ggplot2.

It is important to note with the use of mba.surf() that it transposes the values while it is creating the calculations and so creating an uneven grid does not work well. Using the code written below one will always need to give the same specifications for pixel count on the x and y axes.

 

 

Figure 3: The same temperature (°C) profiles seen in Figure 2 with the missing values filled in with multilevel B-splines. Note the artefact created in the bottom right corner. The 20°C contour line is highlighted in black.

 

At first glance this now appears to be a very good approximation of the output from ODV. An astute eye will have noticed that the temperatures along the bottom right corner of this figure are not interpolating in a way that appears possible. It is very unlikely that there would be a deep mixed layer underneath the thermoclines detected during 2015. The reason the splines create this artefact is that they are based on a convex hull around the real data points and so the interpolating algorithm wants to perform a regression towards a mean value away from the central point of where the spline is being calculated from. Because the thermoclines detected are interspersed between times where the entire water column consists of a mixed layer mba.surf() is filling in the areas without data as though they are a fully mixed surface layer.

Bounding boxes

There are many ways to deal with this problem with four possible fixes coming to mind quickly. The first is to set extend = F within mba.surf(). This tells the algorithm not to fill up every part of the plotting area and will alleviate some of the inaccurate interpolation that occurs but will not eliminate it. The second fix, which would prevent all inaccurate interpolation would be to limit the depth of all of the temperature profiles to be the same as the shallowest sampled profile. This is not an ideal fix because we would then lose quite a bit of information from the deeper sampling that occurred from 2013 to 2014. The third fix is to create a bounding box and screen out all of the interpolated data outside of it. A fourth option is to use soap-film smoothing over some other interpolation method, such as a normal GAM, concurrently with a bounding box. This is normally a good choice but does not work well with these data so I have gone with option three.

 

 

Figure 4: The same temperature (°C) profiles seen in Figure 3 with a bounding box used to screen out data interpolated outside of the range of the recordings.

Summary

In this short tutorial we have seen how to create an interpolated temperature depth profile over time after a fashion very similar to the default output from ODV. We have also seen how to either fill the entire plotting area with interpolated data (Figure 3), or quickly generate a bounding box in order to remove any potential interpolation artefacts (Figure 4). I think this is a relatively straight forward work flow and would work with any tidy dataset. It is worth noting that this is not constrained to depth profiles, but would work just as well with map data. One would only need to change the date and depth variables to lon and lat.

R is an amazingly flexible language with an incredible amount of support and I’ve yet to see it not be able to emulate, or improve upon, an analysis or graphical output from any other software.

Gender and GDP

Objective

Most people living in the Western World are very quick to extol the virtues of gender equality. There are however many places where this is not so. This then inevitably leads to conflict as cultures and nations are drawn closer together on our ever shrinking earth. Perhaps not the sort of conflict that leads to saber rattling, but certainly ideological disagreements that affect policy and have real impacts on large swathes of humanity. So what is to be done? How can say how anyone else should be. There are of course all sorts of moral back and forth’s that could be pursued cyclically ad nauseum, but what if we could show quantitatively what the benefit of gender equality was to a nation and the lives of it’s people? That is the exact sort of question I like to answer here at this blog and it is a question that came up in my daily life a couple of weeks ago. Because most metrics for most countries are recorded, this is the sort of thing that can be answered. Indeed, it has been before, so here I add another take on an argument that really shouldn’t still be happening…

Data

In order to quantify the benefit of gender equality we are going to need global census data, as this is a conversation that needs to be had at a global level. Happily for us these data may be found at Clio Infra. This is an amazing web hosted database that contains an unexpected range of information as it aggregates data from many other sources. I’ll likely be going back to this data repository on multiple occasions in the future. But for now we’ll stick to gender equality. For this post we will be looking at a number of variables as this allows us to develop a more nuanced view of the relationships that exist between the wealth of nations and the egalitarianism their cultures engender. It is unfortunately not possible to show causality in the relationships we will model below but I think that educated readers will likely arrive at the same inferences as myself. Of course, if anyone has any objections, faults or errors to point out in the analysis that follows I would be very happy to hear them.

The data used in this blog were downloaded in “Compact” form, which comes in .xlsx format. These files were then opened and only the tab containing the data in long format were saved as a .csv file. As an aside, these data are visualised on the Clio Infra website as boxplots using what is clearly ggplot2 code. Also note that unless specified all values are taken as either women/ men or women-men. Meaning that proportions below 1, or negative values signify greater abundance for men than women in those populations.

I’ve chosen o use the metrics seen below as I think they represent inequality between genders as well as throughout populations as a whole.

 

Analysis

I recently learned how to use the dplyr pipe (%>%) to perform nested models and so am excited to put that to use here. Our first step is to run simple linear models for each metrics we have loaded against GDP per capita without controlling for time (year of sampling) or space (country). The GDP per capita will be correctly compared against it’s corresponding metric at that year and in that country, but we are then allowing mean values effectively to be made of these results. These models will provide us with a range of R^2 values (coefficients of determination), which is the statistic showing how much of the variance in a dependent variable is explained by the independent variable. This allows us to see which of our metrics have the strongest relationship with GDP per capita. Which we may infer as being related to the prosperity of a nation and it’s people.

 

 

Table 2: R^2 and p-values for the relationship between several metrics and GDP per capita. Years and country of sampling were not controlled for.

       n metric                                           R^2   p       
1162 education_years                      0.623 0.000
   215 gender_equality_index          0.523 0.000
   134 gender_equality_education 0.321 0.000
9779 education_inequality             0.318 0.000
   656 income_inequality                   0.106 0.000
   348 gender_equality_numeracy 0.075 0.000
   205 sex_ratio                                      0.053 0.001
1371 total_population                     -0.001 0.760

 

It is possible however that these relationships have not been static over time. We will also want to display the changes in the R^2 values as time series.

 

Figure 1: Line graph showing the fluctuations in the strength of the relationship (R^2) between the metrics shown in colour and GDP per capita over time.

 

Figure 2: Boxplots showing the distribution of the strength of the relationships for many countries between the metrics and GDP per capita. Blue numbers at the bottom show the total number of data points used to calculate the scores whereas the blue dots show the single R^2 value for a country.

 

Years of education of the populace is constantly coming out on top, with gender equality in education in third. These two metrics must be related in some way though so let’s look specifically at that relationship.

 

Figure 3: Scatterplot showing two different linear model results. The blue line shows the relationship between total years of education within a populace and the gender based equality of those educations. The red line shows how this relationship flattens out when nations with low (below 5 years) levels of education are removed.

 

Results

From Table 1 we may see that the three metrics in descending order that best explain the variance occurring in GDP per capita over time and space are: education of the population as a whole, equality between the genders and the equality in education between the genders. Equality in education for the populace as a whole came in at a close fourth and is perhaps a better predictor than equality between genders due to the much higher sample size that the result is based on (this is an oddly well reported metric). The metrics of income inequality for the populace as a whole, equality in numeracy between genders, and the ratio of sex near birth showed significant relationships with GDP per capita but explained little of the variance. The total population of a country has absolutely nothing to do with the GDP per capita when time and space are not controlled for.

We see in Figure 1 that for our top four metrics (education_years, gender_equality_index, gender_equality_education and education_inequality), only education_years shows any consistency over time. We also see that education_inequality has become a much better test of the GDP per capita of a country since the mid 19th century.

Figure 2 shows that there is a very large range in the relationship between many of these metrics and GDP per capita when we control for country of measurement. Interestingly, the total_population of the country is relevant to the GDP per capita when this information is analysed independently by each country. More important than total_population we see are the metrics education_years and gender_equality_index.

When we look at the specific relationship between education_years and gender_equality_education it would at first appear that at an R^2 value of 0.518 this was a strong relationship. Looking more closely at Figure 3 one will notice that this relationship is not linear and that the fit of the linear model is being affected by several scores at very low education levels and equality scores. If one were to only account for scores where the education_years metric is at least 5 years the strength of the relationship falls dramatically to R^2 = 0.199.

Discussion

If one was satisfied to take all of the GDP data and the several metrics of equality used in this analysis and clump them together one would be led to believe that the most important thing for a prosperous society is the overall years of education for that countries citizens. This metric is clearly the most important in Table 1, Figure 1 and Figure 2. Equality in education between the genders is also of great importance, though perhaps half as much so as the overall education of the populace. It leads that a society with more equitable education levels between genders would have a greater proportion of educated people and this is generally so however, Figure 3 shows that if one removes scores from nations with particularly low levels of overall education this relationship falls apart. This is because most countries in the world have relatively even levels of education between genders. There is little pattern here because there is little variance in gender_equality_education once the less educated nations have been removed from the calculation.

More important to GDP than equal education for genders appears to be the overall equality between genders. This value accounts for much more than education alone and appears to better capture how an egalitarian society benefits from the equal treatment of it’s citizens. Clio Infra states that this metric represents:

The composite index aims to evaluate countries’ performances regarding the progress they made in closing the gender gap in the fields of health, socio-economic resources, politics and household since 1950

When I started this research I had assumed that equal education between genders would clearly be a very important metric in regards to a prosperous society. Surprisingly this appears to not be as concrete as expected. It certainly is important, more so than income inequality, for example, but is not as important as the overall equality between the genders. This finding is deceiving though because the equality in education for most countries (not just rich ones) is high. It is the difference in equality between genders generally that shows a broader range, and the difference relates strongly to GDP per capita. So a potential recipe for success seems to be to first ensure that as much of a populace as possible (of all genders) receives as much education as possible. It seems likely that this would then fuel gender equality, but the data don’t show this unilaterally. So to really move a nation to the top of the GDP per capita list would also require an active effort in ensuring well rounded rights to all genders.

 

Wind Vectors

Objective

As more and more physical scientists (e.g. oceanographers) move to R from other object oriented command line programming languages, such as Matlab, there will be more and more demand for the code that is needed to do some basic things that they may already know how to do in their previous languages that they don’t yet know how to do in R. Surprisingly, there are many things that should be very easy to find how to do in R that are not. Or are at least not widely publicized. One such example is how to plot wind vectors as a time series. This is a very necessary part of any analysis of the winds or currents in a particular area. Making it useful broadly to most climate scientists. Try as I might, I’ve only been able to find one source that gives an example of how to plot wind (or current) vectors as a time series with ggplot2 in R. Having now been asked how to do this by several people I thought it would be useful to write up my workflow and put it on the internet so that there is one more source that people searching for answers may find.

Data

The data used in this example come from the ERA-Interim reanalysis product. This is an amazing product with dozens of data layers stretched over an even global grid at a native resolution of 0.75°. Here we use the one pixel from this product closest to Cape Town in order to illustrate how to use these data as a time series for a single point in space at a daily resolution for one month (December, 2016). For the purposes of this example we will only be using the following three layers: “2 metre temperature”, “10 metre U wind component” and “10 metre V wind component”. The data used in this post may be downloaded here.

Wind Vectors

The data as they exist within ERA-Interim are already in the format that we need. U and V vectors. Had they not been then we would have needed to convert them. The other common format for wind data are bearing (degrees) and speed (m/s). With this information it is possible to use trigonometry to create the necessary U and V vectors. One may do so with the following chunk of code:

Assuming that we now have a u and v column in our wind vector data frame, and that the speed of these vectors are in metres per second, we may now use the functionality within ggplot2 to create our wind vector time series. It’s almost anti-climactic how straight forward it all is.

Plotting

As we will be using geom_segment() within ggplot2 to create our wind vectors we will want to maintain some control over the output by introducing a scaling object and a default range for the y axis. For now we will just set the wind scalar to 1. We’ll change it later to see what benefit this extra step adds. I have chosen the range of values for the y_axis below based on what I knew would illustrate my point, not through any a priori statistical analysis.

To create a wind vector time series we use the following few lines of code:

Figure 1: Wind speeds and bearings shown as a time series.

 

Visualizing wind data in this way introduces a problem into our figure that we do not normally need to contend with when we create time series figures. That problem is that wind speed/ bearing is a two dimensional value, not one dimensional. Therefore the y axis as well as the x axis must be utilized in order to show the wind speed/ bearing accurately. We find a compromise by using coord_equal() to ensure that every step on the x axis is the same size as the y axis. Thus preventing any distortion of vectors that are showing mostly an East/ West heading.

For particularly windy parts of the world (such as Cape Town) it may be better to scale the wind vectors so as to help them to look a bit more tidy. But if we do this, our x and y axis relationship will no longer be 1 for 1, as with the figure above. That is where our decision to set a static wind scalar and y axis come into play.

 

Figure 2: The same as Figure 1 with an expanded y axis to show all of the vectors.

 

By assigning our scaling variable to an object and using it to control the y axis of our ggplot code we may be sure that no matter how we choose to scale our wind variables they will be correctly represented. This does however limit how large our y axis may be as it must be kept to the same ratio as the x axis. Any larger than the figure shown above and we would start to have unappealingly square figures.

Multiple Y Axes

While it may be frowned upon in the tidyverse, there are still many people that like to use multiple y axes within one figure. I’m not going to get into the pro’s and con’s here, but I will say that of all the possible reasons for using multiple y axes, comparing wind vectors against some other variable is more reasonable than most. This is because the wind vectors themselves do not really have a y axis per se. As we’ve seen above, the actual values on the y axis don’t matter as long as they are shown 1 for 1 against whatever is on the x axis. Therefore we may very easily show temperature on the y axis and still overplot wind vectors without needing a proper second y axis. We will however insert a label onto the figure so as to make it clear how the wind vectors are to be estimated.

Figure 3:Temperature and wind time series shown on the same y axis. The label in the upper left corner shows the scale of the wind vectors.

 

I’ll leave it to the reader to decide if this is a reasonable way to visualise these data. But I do know that there is some use for displaying the information in this way. For example, one can clearly see that on days were the temperature decreases rapidly there is a south westerly wind moving over the city. Of course this could still be elucidated were these two figures plotted next to each other via facets, but it should be noted that it isn’t necessary.

Religious Sentiment

Objective

Before we begin, I would like to acknowledge that the framework for this analysis was adapted from a blogpost found on the wonderfully interestingly R-bloggers website. The objective of this analysis is to use sentiment analysis on different religious texts to visualise the differences/ similarities between them. This concept is of course fraught with a host of issues. Not least of which being detractors who will likely not endeavour to engage in rational debate against the findings of this work. This is of course beyond the control of the experimental design so I rather focus here on the issue that the translations of the texts used are not modern, and none of them were written (and generally not intended to be read in) English. Therefore a sentiment analysis of these works will be given to a certain amount of inaccuracy because the sentiment database is a recent project and one must assume that the emotions attached to the words in the English language reflect modern sentiment, and not the emotive responses that were necessarily desired at the time of writing/ translation. That being said, the sentiment that these texts would elicit in a reader today (assuming that anyone actually still reads the source material for their faith) would be accurately reflected by the sentiment analysis project and so this issue is arguably a minor one. As a control group (non-religious text), we will be using Pride and Prejudice, by Jane Austen simply because this has already been ported into R and is readily accesible in the package ‘janeaustenr’.

For more information on the sentiment analysis project one may find the home page here.

Text Downloads

The following links were used to download the religious texts used in this analysis:

I generally used the first link that DuckDuckGo provided when searching for pdf versions of these books. If anyone knows of better sources for these texts please let me know and I would be happy to update the data source.

Packages

library(pdftools)
library(tidyverse)
library(tidytext)
library(janeaustenr)
library(RmarineHeatWaves)

Analysis

First we convert the PDF’s to text and merge them.

## Load the texts
# Pride and Prejudice
pride <- data_frame(txt = prideprejudice,
src = "pride")
# Quran
quran <- data_frame(txt = as.character(pdf_text("~/blog/data/EQuran.pdf")), src = "quran")
# Bible
bible <- data_frame(txt = as.character(pdf_text("~/blog/data/KJBNew.pdf")), src = "bible")
# Torah
torah <- data_frame(txt = as.character(pdf_text("~/blog/data/Torah.pdf")), src = "torah")
# Bhagavad Gita
gita <- data_frame(txt = as.character(pdf_text("~/blog/data/BGita.pdf")), src = "gita")
# Book of the dead
dead <- data_frame(txt = as.character(pdf_text("~/blog/data/EBoD.pdf")), src = "dead")
# Combine
texts <- rbind(pride, quran, bible, torah, gita, dead)

After loading and merging the texts we want to run some quick word counts on them.

# Total relevant words per text
total_word_count <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
group_by(src) %>%
summarise(total = n()) %>%
ungroup()

# Total emotive words per text
emotive_word_count <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(!(sentiment == "negative" | sentiment == "positive" | sentiment == "NA")) %>%
group_by(src) %>%
summarise(emotions = n()) %>%
ungroup()

# Positivity proportion
positivity_word_count <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(sentiment == "positive" | sentiment == "negative") %>%
group_by(src, sentiment) %>%
summarise(emotions = n()) %>%
summarise(positivity = emotions[sentiment == "positive"]/ emotions[sentiment == "negative"]) %>%
ungroup()

# Combine
word_count <- cbind(total_word_count, emotive_word_count[2], positivity_word_count[2]) %>%
mutate(proportion = round(emotions/total, 2))

# Lolliplot
ggplot(data = word_count, aes(x = as.factor(src), y = proportion)) +
geom_lolli() +
geom_point(aes(colour = positivity)) +
geom_text(aes(y = 0.45, label = paste0("n = ", total)), colour = "red") +
scale_y_continuous(limits = c(0,0.5)) +
scale_color_continuous("Proportion of\nPositivity", low = "steelblue", high = "salmon") +
labs(x = "Text", y = "Proportion of Emotive Words")

Figure 1: Proportion of emotive words in each text to their total word counts (shown in red).

 

I find it somewhat surprising that The Egyptian Book of the Dead would have the highest proportion of positive sentiment to negative sentiment.

But let's take a more in depth look at the sentiment for each text.

# Calculate emotions
emotions <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('King James Version of the New Testament - ', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
filter(!grepl('BHAGAVAD GITA', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(!(sentiment == "negative" | sentiment == "positive")) %>%
group_by(src, sentiment) %>%
summarize(freq = n()) %>%
mutate(proportion = round(freq/sum(freq),2)) %>%
# select(-freq) %>%
ungroup()
# Boxplot
ggplot(emotions, aes(x = sentiment, y = proportion)) +
geom_boxplot(fill = "grey70") +
geom_point(aes(colour = src), position = "jitter", size = 2) +
labs(x = "Sentiment", y = "Proportion", colour = "Text")

Figure 2: Boxplots of emotion for the texts. Individual scores shown as coloured points.

 

ggplot(emotions, aes(x = sentiment, y = proportion)) +
geom_polygon(aes(fill = src, group = src), alpha = 0.6, show.legend = FALSE) +
coord_polar() +
scale_y_continuous(limits = c(0, 0.3)) +
facet_wrap(~src) +
labs(x = "", y = "")

Figure 3: Frequency polygons of the proportion of emotions for each text shown with facets.

 

Results

The frequency polygons shown here most notably outline that all of the texts emphasise trust over anything else. Another interesting pattern is the lack of surprise employed in the texts. The Bhagavad Gita seems to have the smoothest distribution of emotion, with the Quran perhaps coming in second. I found it odd that a novel, Pride and Prejudice, would employ a higher proportion of joyous sentiment than the religious texts against which it was compared. The Quran just barely nudges out the Bhagavad Gita to come out on top for angry and fearful sentiment. The Torah stands out from the other texts due to the amount of disgust expressed therein. After Pride and Prejudice, the Bible uses the largest proportion of words that elicited emotions of anticipation.

There is still much more that may or may not be deduced from these cursory results. Most notably lacking from this analysis is any test for significant differences. The reason for that being that I am not interested in supporting any sort of argument for differences between these texts. With n values (word count) in excess of 3,000 significant differences between the different texts is almost guaranteed as well so any use of a p-value here would just be cheap p-hacking anyway.