Tag Archives: R

Mapping with ggplot2

Objective

There are many different things that require scientists to use programming languages (like R). Far too many to count here. There is however one common use amongst almost all environmental scientists: mapping. Almost every report, research project or paper will have need to refer to a study area. This is almost always “Figure 1”. To this end, whenever I teach R, or run workshops on it, one of the questions I am always prepared for is how to create a map of a particular area. Being a happy convert to the tidyverse I only teach the graphics of ggplot2. I have found that people often prefer to use the ggmap extension to create ggplot quality figures with Google map backgrounds, but I personally think that a more traditional monotone background for maps looks more professional. What I’ve decided to showcase this week is the data and code required to create a publication quality map. Indeed, the following code will create the aforementioned obligatory “Figure 1” in a paper I am currently preparing for submission.

Data

There are heaps of packages etc. that one may use to create maps. And there is a never ending source of blogs, books and tutorials that illustrate many of the different ways to visualise spatial data. For my international and geographic borders I prefer to use data I’ve downloaded from GSHHSG and then converted to dataframes using functions found in the PBSmapping package. I then save these converted dataframes as .Rdata objects on my computer for ease of use with all of my projects. For the domestic borders of a country, which I won’t use in this post, one may go here. Note however that for some strange reason this website still has the pre-1994 borders for South Africa. For the correct SA borders one must go here. The current SA borders may actually be download in the .Rdata format, which is neat.

Once one has the borders to be used in the map, the next step is to think about what one actually wants to show. The main purpose of this map is to show where several in situ coastal seawater temperature time series were collected. This could be done quite simply but a plain black and white map is offensively boring so we want to make sure there is a good amount of (but not too much!) colour in order to entice the reader. I personally find pictures of meso-scale oceanic phenomena particularly beautiful so try to include them whenever I can. Luckily that is also what I study so it is not strange that I include such things in my work. Now if only I studied panda’s, too…

Panda’s aside, the current work I am engaged in also requires that the atmospheric processes around southern Africa be considered in addition to the oceanography. To visualise both air and sea concurrently would be a mess so we will want to create separate panels for each. Because I have been working with reanalysis data lately, and not satellite data, I am also able to include the wind/ current vectors in order to really help the temperature patterns pop. The oceanic data are from the BRAN2016 product and the atmospheric data are from ERA-Interim. Both of which are available for download for free for scientific pursuits. I’ve chosen here to use the mean values for January 1st as the summer months provide the most clear example of the thermal differences between the Agulhas and Benguela currents. The code used to create the scale bar in the maps may be found here. It’s not a proper ggplot geom function but works well enough. I’ve also decided to add the 200 m isobath to the sea panel. These data come from NOAA.

Mapping

I find that it is easier to keep track of the different aspects of a map when they are stored as different dataframes. One should however avoid having too many loose dataframes running about in the global environment. It is a balancing act and requires one to find a happy middle ground. Here I am going to cut the all_jan1_0.5 dataframe into 4. One each for air and sea temperatures and vectors. I am also going to reduce the resolution of the wind so that the vectors will plot more nicely.

With just a few alterations to our nicely divided up dataframes we are ready to create a map. We will look at the code required to create each map and then put it all together in the end.

First up is the most busy. The following code chunk will create the top panel of our map, the sea state. It is necessary to label all of the locations mentioned in the text and so they are thrown on here. In order to make the site label easier to read I’ve made them red. This is particularly jarring but I think I like it.

Many of the sites that need to be plotted are laying on top of each other. This is never good, but is made worse when the sites in question are refereed to frequently in the text. For this reason we need to create a little panel inside of the larger figure that shows a zoomed in picture of False Bay. Complete with text labels.

We could possibly create another inset panel for the clomp of sites around Hamburg but this figure is already getting too busy. So we’ll leave it for now. One inset panel will serve to illustrate the code necessary to create a faceted map so for the purposes of this post it will also suffice. That leaves us with only the bottom panel to create. The air state. I’ve decided to put the scale bar/ North arrow on this panel in an attempt to balance the amount of information in each panel.

With our three pieces of the map complete, it is time to stick them together. There are many ways to do this but I have recently found that using annotation_custom allows one to stick any sort of ggplot like object onto any other sort of ggplot object. This is an exciting development and opens up a lot of doors for some pretty creative stuff. Here I will just use it to demonstrate simple faceting, but combined with panel gridding. Really though the sky is the limit.

Summary

The developments in the gridding system have brought the potential for using ggplot for these more complex maps forward quite a bit. As long as one does not use a constrained mapping coordinate system (i.e. coord_fixed) the grob-ification of the ggplot objects seems to allow the placing of the pieces into a common area to be performed smoothly. Displaying many different bits of information cleanly is always a challenge. This figure is particularly busy, out of necessity. I think it turned out very nicely though.

 

Figure 1: Map showing the southern tip of the African continent. The top panel shows the typical sea surface temperature and surface currents on January 1st. The bottom panel likewise shows the typical surface air temperatures and winds on any given January 1st.

Goats per Capita

Objective

A few weeks ago for a post about the relationship between gender equality and GDP/ capita I found a nifty website that has a massive amount of census information for most countries on our planet. Much of this information could be used to answer some very interesting and/ or important questions. But some of the data can be used to answer seemingly pointless questions. And that’s what I intend to do this week. Specifically, which countries in the world have the highest rates of goats/ capita?

Data

The goats per capita data were downloaded from the clia-infra website. These data are already in the format we need so there is little to be done before jumping straight into the analysis. We will however remove any records from before 1900 as these are almost entirely estimates, and not real records.

 

Analysis

First of all, I would like to know what the global trend in goats/ capita has been since 1900. To do so we need to create annual averages and apply a simple linear model to them. We will also plot boxplots to give us an idea of the spread of goats/ capita over the world.

 

Figure 1: Boxplots with a fitted linear model showing the global trend in goats/ capita over the last century.

 

As we may see in Figure 1, the overall trend in goats/ capita in the world has been decreasing very slightly over the last century. The striking result from Figure 1 however is the massive range of values as seen by the outliers from the boxplots. So which countries are these that have so many more goats/ capita than the rest of the world?

We want to see which countries have the most goats/ capita but there are 172 unique countries in this dataset so it would look much too busy to plot them all. To that end we want only the top and bottom 10 countries from the most recent year of reporting (2010).

 

 

Figure 2: Line graphs showing the rate of goats/ capita for the top and bottom 10 goat having countries in the world over the last century.

 

Summary

I was a bit surprised to find that Mongolia is far and away the country with the most goats/ capita at 5.140 in 2010. Less surprising is that the other top 9 goat having countries in the world in 2010 were all in Africa and their rate of goats/ capita was between 1.634 (Mauritania) to 0.124 (South Africa). This makes for a massive spread in what is already an outlying set of countries. How is it that Mongolia has so many more goats/ capita? This is a very odd result but the data were reported annually from 2000 to 2010 and they consistently show similarly high rates for Mongolia.

The bottom 10 goat having countries in the world are a mix of European, Asian, North American and Pacific Islands. This mix is not surprising as we may see in Figure 1 that there are no outliers in the bottom of the distribution. The highest value for the bottom 10 countries in 2010 was Tonga at 0.121. This is very close to the lowest value from the top 10 countries, and shows us that most of the 172 countries in this dataset have ~0.12 goats per person. With this average in mind, we see that the other bottom nine countries in Figure 2 really are much lower than the global average with rates approaching 0 goats/ capita. It is worth mentioning that the lowest overall rate of goats/ capita in 2010 was Japan at 0.0001. Meaning that there is only one goat in Japan for every 10,000 people. As opposed to Mongolia that has more than five goats for every one person. Therefore there were 50,000 times more goats/ capita in Mongolia than Japan in 2010…

I supposes the take away message from this analysis is that if one ever wants to get away from it all and just go spend time with a lot of goats, Mongolia is the place for you!

(and definitely avoid Japan)

 

US Parties and Immigration

Objective

As an immigrant myself, all of the talk of immigration to be found in main stream media outlets today makes me a bit nervous. Whereas most people that speak of the pro’s and con’s of immigration do so from the point of view of how it may affect the country of their birth, I view this issue as something that affects my ability to live outside the country of my birth. I immigrated into the Republic of South Africa in 2013 and have been living here since. I would do a piece on South African immigration but the numbers are difficult to get a hold of and honestly most people are less interest in South Africa than the USA.

Immigration is not a new talking point. It’s something that comes up in political and a-political circles all of the time. The current debate on the Muslim Ban in the USA may have reached a new level for this sort of rhetoric in the West, but targeted crackdowns of this sort are not new in the world. I won’t bother with citations here, but if one is interested a quick google of “xenophobia” + “border control” should yield some convincing results. As this current row of immigration debates in the USA has become so partisan, I decided that an interesting question to ask would be “Under which of the two parties have more people immigrated into the USA?” and “Under which of the two parties have more people been removed from the USA?”

Data

The historical data on immigration into the USA are located at the Department of Homeland Securities website. In 2013 the DHS started keeping very detailed reports of all immigration by age, country, marital status, etc. These highly detailed data are very interesting but will not help us to ask our central questions. We want long time series of data so that we may compare many different administrations from each party. For ease of analysis I have chosen to classify the party in power at any point in time based on the party of the President. I understand that the Senate or Congress would perhaps be better, if not more egalitarian choices, but the current focus of this issue has the US President at it’s core, so I decided to keep that theme constant in this analysis. I’ll only start from Eisenhower and go up until Obama as the publicly available DHS data end in 2015. They begin as far back as 1892, but ggplot2 has built into it a US president dataframe and I am going to just use that because I’m lazy.

 

Immigration

In this first figure we are defining immigration as the number of people actually receiving a green card in any given year. This is the most strict definition of “immigration” and I think may be best used to show whom the USA was choosing to let in.

 

Figure 1: Bar charts showing the number of green cards granted each year in the USA. The colour of the bars show the ruling party at the time of issuance. A linear model is imposed in black.

 

We may see in Figure 1 that the all time high for the granting of green cards was during the four year administration of George Bush Senior. These values are so much higher than the other administrations that it leverages the linear model drawn on these data up past where it should normally be to show the more normal trend exhibited by all of the other administrations since Eisenhower. That being said, we actually see a bit of a turn down during the Obama administration, with the largest year of green card issuance during the George Bush Junior administration larger than any year under Obama. I find that surprising. Figure 1 also shows us that we can’t really directly compare the different administrations because as populations increase, so too will the number of people that want to immigrate. In a quick pinch however we may use the residuals from the linear model to give us a slightly better visualisation of how the parties stack up against one another.

 

 

Figure 2: The residuals from a linear model fitted to the data shown in Figure 1.

 

It appears as though whenever there was a Bush in office it was much easier to get a green card. And that Democrats generally made it more difficult to do so.

Expulsion

Now that we have seen that it is easier to enter the USA during a Republican presidency, let’s see under which party an illegal immigrant is most likely to be expelled. There are two different classes of expulsion: ‘Removal’ and ‘Return’. Removal means that a legal order was issued to remove the individual. Return means that the individual was likewise not legally in the states, but left of their own volition.

 

 

Figure 3: Two bar charts showing the rates of returns and removals for immigrants from the USA.

 

Figure 3 tells a very interesting story. In the top panel (Returns), we see that from 1953-55 (shortly after WWII) there were massive numbers of immigrants that returned to their home countries voluntarily. Then there is a period of increase leading up to the 80’s. This then follows a somewhat normal distribution, peaking in the late 90’s near the end of the Clinton administration before the peaceful return of immigrants drops steadily through the 8 years of Bush then Obama. The bottom panel (Removals) shows the reason for this apparent relaxation on immigrants. It isn’t that immigrants were being sent away less, but rather they began to be removed more forcefully than appears to have been the policy until something changed during the Clinton Administration. The rate of immigrants being removed became greater than those being returned in 2011 under Obama. Again we see a heavy hand on immigration during years with a Democratic president in office. It is hard to compare the parties on this issue as the policy of forcefully removing immigrants in favour of having them leave peacefully has only been in practice over three administrations (Clinton, Bush Jr. and Obama). That being said, it is worth noting that the y axes on these two figures are not the same. The increase in removals does not outweigh the decrease in returns. Overall the rate of expulsion of immigrants from the states declined during the Obama administration. And perhaps also Bush Jr.

It is fair criticism to point out that a green card may take several years to acquire. This means that when one begins the process of applying for a green card it may take so long that a different party will be in power by the time it is granted… or not. I would argue however that most of these parties (with the exception of Carter) are in office for eight year stretches. This is not meant to be a definitive analysis, but I think it has proven to be a rather interesting first step. I didn’t expect the data to look like this. George Bush Senior, saviour to immigrants, who would have thought.

Emulating ODV Figures In R

Objective

With more and more scientists moving to open source software (i.e. R or Python) to perform their numerical analyses the opportunities for collaboration increase and we may all benefit from this enhanced productivity. At the risk of sounding sycophantic, the future of scientific research truly is in multi-disciplinary work. What then could be inhibiting this slow march towards progress? We tend to like to stick to what is comfortable. Oceanographers in South Africa have been using MATLAB and ODV (Ocean Data View) since about the time that Jesus was lacing up his sandals for his first trip to Palestine. There has been much debate on the future of MATLAB in science, so I won’t get into that here, but I will say that the package oce contains much of the code that one would need for oceanographic work in R, and the package angstroms helps one to work with ROMS (Regional Ocean Modeling System) output. The software that has however largely gone under the radar in these software debates has been ODV. Probably because it is free (after registration) it’s fate has not been sealed by university departments looking to cut costs. The issue with ODV however is the same with all Microsoft products; the sin of having a “pointy clicky” user interface. One cannot perform truly reproducible research with a menu driven user interface. The steps must be written out in code. And so here I will lay out those necessary steps to create an interpolated CTD time series of temperature values that looks as close to the default output of ODV as possible.

 

Figure 1: The default output of ODV when plotting temperature by depth through time at one location.

 

Colour palette

Perhaps the most striking thing about the figures that ODV creates is it’s colour palette. A large criticism of this colour palette is that the range of colours used are not equally weighted visually, with the bright reds drawing ones eye more than the muted blues. This issue can be made up for using the viridis package, but for now we will stick to a ODV-like colour palette as that is part of our current objective. To create a colour palette that appears close to the ODV standard I used GIMP to extract the hexadecimal colour values from the colour bar in Figure 1.

 

 

Figure 2: A non-interpolated scatterplot of our temperature (°C) data shown as a function of depth (m) over time.

 

Interpolating

Figure 2 is a far cry from the final product we want, but it is a step in the right direction. One of the things that sets ODV apart from other visualisation software is that it very nicely interpolates the data you give it. While this looks nice, there is some criticism that may be leveled against doing so. That being said, what we want is a pretty visualisation of our data. We are not going to be using these data for any numerical analyses so the main priority is that the output allows us to better visually interpret the data. The package MBA already has the necessary functionality to do this, and it works with ggplot2, so we will be using this to get our figure. The interpolation method used by mba.surf() is multilevel B-splines.

In order to do so we will need to dcast() our data into a wide format so that it simulates a surface layer. spread() from the tidyr package doesn’t quite do what we need as we want a proper surface map, which is outside of the current ideas on the structuring of tidy data. Therefore, after casting our data wide we will use melt(), rather than gather(), to get the data back into long format so that it works with ggplot2.

It is important to note with the use of mba.surf() that it transposes the values while it is creating the calculations and so creating an uneven grid does not work well. Using the code written below one will always need to give the same specifications for pixel count on the x and y axes.

 

 

Figure 3: The same temperature (°C) profiles seen in Figure 2 with the missing values filled in with multilevel B-splines. Note the artefact created in the bottom right corner. The 20°C contour line is highlighted in black.

 

At first glance this now appears to be a very good approximation of the output from ODV. An astute eye will have noticed that the temperatures along the bottom right corner of this figure are not interpolating in a way that appears possible. It is very unlikely that there would be a deep mixed layer underneath the thermoclines detected during 2015. The reason the splines create this artefact is that they are based on a convex hull around the real data points and so the interpolating algorithm wants to perform a regression towards a mean value away from the central point of where the spline is being calculated from. Because the thermoclines detected are interspersed between times where the entire water column consists of a mixed layer mba.surf() is filling in the areas without data as though they are a fully mixed surface layer.

Bounding boxes

There are many ways to deal with this problem with four possible fixes coming to mind quickly. The first is to set extend = F within mba.surf(). This tells the algorithm not to fill up every part of the plotting area and will alleviate some of the inaccurate interpolation that occurs but will not eliminate it. The second fix, which would prevent all inaccurate interpolation would be to limit the depth of all of the temperature profiles to be the same as the shallowest sampled profile. This is not an ideal fix because we would then lose quite a bit of information from the deeper sampling that occurred from 2013 to 2014. The third fix is to create a bounding box and screen out all of the interpolated data outside of it. A fourth option is to use soap-film smoothing over some other interpolation method, such as a normal GAM, concurrently with a bounding box. This is normally a good choice but does not work well with these data so I have gone with option three.

 

 

Figure 4: The same temperature (°C) profiles seen in Figure 3 with a bounding box used to screen out data interpolated outside of the range of the recordings.

Summary

In this short tutorial we have seen how to create an interpolated temperature depth profile over time after a fashion very similar to the default output from ODV. We have also seen how to either fill the entire plotting area with interpolated data (Figure 3), or quickly generate a bounding box in order to remove any potential interpolation artefacts (Figure 4). I think this is a relatively straight forward work flow and would work with any tidy dataset. It is worth noting that this is not constrained to depth profiles, but would work just as well with map data. One would only need to change the date and depth variables to lon and lat.

R is an amazingly flexible language with an incredible amount of support and I’ve yet to see it not be able to emulate, or improve upon, an analysis or graphical output from any other software.

Gender and GDP

Objective

Most people living in the Western World are very quick to extol the virtues of gender equality. There are however many places where this is not so. This then inevitably leads to conflict as cultures and nations are drawn closer together on our ever shrinking earth. Perhaps not the sort of conflict that leads to saber rattling, but certainly ideological disagreements that affect policy and have real impacts on large swathes of humanity. So what is to be done? How can say how anyone else should be. There are of course all sorts of moral back and forth’s that could be pursued cyclically ad nauseum, but what if we could show quantitatively what the benefit of gender equality was to a nation and the lives of it’s people? That is the exact sort of question I like to answer here at this blog and it is a question that came up in my daily life a couple of weeks ago. Because most metrics for most countries are recorded, this is the sort of thing that can be answered. Indeed, it has been before, so here I add another take on an argument that really shouldn’t still be happening…

Data

In order to quantify the benefit of gender equality we are going to need global census data, as this is a conversation that needs to be had at a global level. Happily for us these data may be found at Clio Infra. This is an amazing web hosted database that contains an unexpected range of information as it aggregates data from many other sources. I’ll likely be going back to this data repository on multiple occasions in the future. But for now we’ll stick to gender equality. For this post we will be looking at a number of variables as this allows us to develop a more nuanced view of the relationships that exist between the wealth of nations and the egalitarianism their cultures engender. It is unfortunately not possible to show causality in the relationships we will model below but I think that educated readers will likely arrive at the same inferences as myself. Of course, if anyone has any objections, faults or errors to point out in the analysis that follows I would be very happy to hear them.

The data used in this blog were downloaded in “Compact” form, which comes in .xlsx format. These files were then opened and only the tab containing the data in long format were saved as a .csv file. As an aside, these data are visualised on the Clio Infra website as boxplots using what is clearly ggplot2 code. Also note that unless specified all values are taken as either women/ men or women-men. Meaning that proportions below 1, or negative values signify greater abundance for men than women in those populations.

I’ve chosen o use the metrics seen below as I think they represent inequality between genders as well as throughout populations as a whole.

 

Analysis

I recently learned how to use the dplyr pipe (%>%) to perform nested models and so am excited to put that to use here. Our first step is to run simple linear models for each metrics we have loaded against GDP per capita without controlling for time (year of sampling) or space (country). The GDP per capita will be correctly compared against it’s corresponding metric at that year and in that country, but we are then allowing mean values effectively to be made of these results. These models will provide us with a range of R^2 values (coefficients of determination), which is the statistic showing how much of the variance in a dependent variable is explained by the independent variable. This allows us to see which of our metrics have the strongest relationship with GDP per capita. Which we may infer as being related to the prosperity of a nation and it’s people.

 

 

Table 2: R^2 and p-values for the relationship between several metrics and GDP per capita. Years and country of sampling were not controlled for.

       n metric                                           R^2   p       
1162 education_years                      0.623 0.000
   215 gender_equality_index          0.523 0.000
   134 gender_equality_education 0.321 0.000
9779 education_inequality             0.318 0.000
   656 income_inequality                   0.106 0.000
   348 gender_equality_numeracy 0.075 0.000
   205 sex_ratio                                      0.053 0.001
1371 total_population                     -0.001 0.760

 

It is possible however that these relationships have not been static over time. We will also want to display the changes in the R^2 values as time series.

 

Figure 1: Line graph showing the fluctuations in the strength of the relationship (R^2) between the metrics shown in colour and GDP per capita over time.

 

Figure 2: Boxplots showing the distribution of the strength of the relationships for many countries between the metrics and GDP per capita. Blue numbers at the bottom show the total number of data points used to calculate the scores whereas the blue dots show the single R^2 value for a country.

 

Years of education of the populace is constantly coming out on top, with gender equality in education in third. These two metrics must be related in some way though so let’s look specifically at that relationship.

 

Figure 3: Scatterplot showing two different linear model results. The blue line shows the relationship between total years of education within a populace and the gender based equality of those educations. The red line shows how this relationship flattens out when nations with low (below 5 years) levels of education are removed.

 

Results

From Table 1 we may see that the three metrics in descending order that best explain the variance occurring in GDP per capita over time and space are: education of the population as a whole, equality between the genders and the equality in education between the genders. Equality in education for the populace as a whole came in at a close fourth and is perhaps a better predictor than equality between genders due to the much higher sample size that the result is based on (this is an oddly well reported metric). The metrics of income inequality for the populace as a whole, equality in numeracy between genders, and the ratio of sex near birth showed significant relationships with GDP per capita but explained little of the variance. The total population of a country has absolutely nothing to do with the GDP per capita when time and space are not controlled for.

We see in Figure 1 that for our top four metrics (education_years, gender_equality_index, gender_equality_education and education_inequality), only education_years shows any consistency over time. We also see that education_inequality has become a much better test of the GDP per capita of a country since the mid 19th century.

Figure 2 shows that there is a very large range in the relationship between many of these metrics and GDP per capita when we control for country of measurement. Interestingly, the total_population of the country is relevant to the GDP per capita when this information is analysed independently by each country. More important than total_population we see are the metrics education_years and gender_equality_index.

When we look at the specific relationship between education_years and gender_equality_education it would at first appear that at an R^2 value of 0.518 this was a strong relationship. Looking more closely at Figure 3 one will notice that this relationship is not linear and that the fit of the linear model is being affected by several scores at very low education levels and equality scores. If one were to only account for scores where the education_years metric is at least 5 years the strength of the relationship falls dramatically to R^2 = 0.199.

Discussion

If one was satisfied to take all of the GDP data and the several metrics of equality used in this analysis and clump them together one would be led to believe that the most important thing for a prosperous society is the overall years of education for that countries citizens. This metric is clearly the most important in Table 1, Figure 1 and Figure 2. Equality in education between the genders is also of great importance, though perhaps half as much so as the overall education of the populace. It leads that a society with more equitable education levels between genders would have a greater proportion of educated people and this is generally so however, Figure 3 shows that if one removes scores from nations with particularly low levels of overall education this relationship falls apart. This is because most countries in the world have relatively even levels of education between genders. There is little pattern here because there is little variance in gender_equality_education once the less educated nations have been removed from the calculation.

More important to GDP than equal education for genders appears to be the overall equality between genders. This value accounts for much more than education alone and appears to better capture how an egalitarian society benefits from the equal treatment of it’s citizens. Clio Infra states that this metric represents:

The composite index aims to evaluate countries’ performances regarding the progress they made in closing the gender gap in the fields of health, socio-economic resources, politics and household since 1950

When I started this research I had assumed that equal education between genders would clearly be a very important metric in regards to a prosperous society. Surprisingly this appears to not be as concrete as expected. It certainly is important, more so than income inequality, for example, but is not as important as the overall equality between the genders. This finding is deceiving though because the equality in education for most countries (not just rich ones) is high. It is the difference in equality between genders generally that shows a broader range, and the difference relates strongly to GDP per capita. So a potential recipe for success seems to be to first ensure that as much of a populace as possible (of all genders) receives as much education as possible. It seems likely that this would then fuel gender equality, but the data don’t show this unilaterally. So to really move a nation to the top of the GDP per capita list would also require an active effort in ensuring well rounded rights to all genders.

 

Wind Vectors

Objective

As more and more physical scientists (e.g. oceanographers) move to R from other object oriented command line programming languages, such as Matlab, there will be more and more demand for the code that is needed to do some basic things that they may already know how to do in their previous languages that they don’t yet know how to do in R. Surprisingly, there are many things that should be very easy to find how to do in R that are not. Or are at least not widely publicized. One such example is how to plot wind vectors as a time series. This is a very necessary part of any analysis of the winds or currents in a particular area. Making it useful broadly to most climate scientists. Try as I might, I’ve only been able to find one source that gives an example of how to plot wind (or current) vectors as a time series with ggplot2 in R. Having now been asked how to do this by several people I thought it would be useful to write up my workflow and put it on the internet so that there is one more source that people searching for answers may find.

Data

The data used in this example come from the ERA-Interim reanalysis product. This is an amazing product with dozens of data layers stretched over an even global grid at a native resolution of 0.75°. Here we use the one pixel from this product closest to Cape Town in order to illustrate how to use these data as a time series for a single point in space at a daily resolution for one month (December, 2016). For the purposes of this example we will only be using the following three layers: “2 metre temperature”, “10 metre U wind component” and “10 metre V wind component”. The data used in this post may be downloaded here.

Wind Vectors

The data as they exist within ERA-Interim are already in the format that we need. U and V vectors. Had they not been then we would have needed to convert them. The other common format for wind data are bearing (degrees) and speed (m/s). With this information it is possible to use trigonometry to create the necessary U and V vectors. One may do so with the following chunk of code:

Assuming that we now have a u and v column in our wind vector data frame, and that the speed of these vectors are in metres per second, we may now use the functionality within ggplot2 to create our wind vector time series. It’s almost anti-climactic how straight forward it all is.

Plotting

As we will be using geom_segment() within ggplot2 to create our wind vectors we will want to maintain some control over the output by introducing a scaling object and a default range for the y axis. For now we will just set the wind scalar to 1. We’ll change it later to see what benefit this extra step adds. I have chosen the range of values for the y_axis below based on what I knew would illustrate my point, not through any a priori statistical analysis.

To create a wind vector time series we use the following few lines of code:

Figure 1: Wind speeds and bearings shown as a time series.

 

Visualizing wind data in this way introduces a problem into our figure that we do not normally need to contend with when we create time series figures. That problem is that wind speed/ bearing is a two dimensional value, not one dimensional. Therefore the y axis as well as the x axis must be utilized in order to show the wind speed/ bearing accurately. We find a compromise by using coord_equal() to ensure that every step on the x axis is the same size as the y axis. Thus preventing any distortion of vectors that are showing mostly an East/ West heading.

For particularly windy parts of the world (such as Cape Town) it may be better to scale the wind vectors so as to help them to look a bit more tidy. But if we do this, our x and y axis relationship will no longer be 1 for 1, as with the figure above. That is where our decision to set a static wind scalar and y axis come into play.

 

Figure 2: The same as Figure 1 with an expanded y axis to show all of the vectors.

 

By assigning our scaling variable to an object and using it to control the y axis of our ggplot code we may be sure that no matter how we choose to scale our wind variables they will be correctly represented. This does however limit how large our y axis may be as it must be kept to the same ratio as the x axis. Any larger than the figure shown above and we would start to have unappealingly square figures.

Multiple Y Axes

While it may be frowned upon in the tidyverse, there are still many people that like to use multiple y axes within one figure. I’m not going to get into the pro’s and con’s here, but I will say that of all the possible reasons for using multiple y axes, comparing wind vectors against some other variable is more reasonable than most. This is because the wind vectors themselves do not really have a y axis per se. As we’ve seen above, the actual values on the y axis don’t matter as long as they are shown 1 for 1 against whatever is on the x axis. Therefore we may very easily show temperature on the y axis and still overplot wind vectors without needing a proper second y axis. We will however insert a label onto the figure so as to make it clear how the wind vectors are to be estimated.

Figure 3:Temperature and wind time series shown on the same y axis. The label in the upper left corner shows the scale of the wind vectors.

 

I’ll leave it to the reader to decide if this is a reasonable way to visualise these data. But I do know that there is some use for displaying the information in this way. For example, one can clearly see that on days were the temperature decreases rapidly there is a south westerly wind moving over the city. Of course this could still be elucidated were these two figures plotted next to each other via facets, but it should be noted that it isn’t necessary.

Religious Sentiment

Objective

Before we begin, I would like to acknowledge that the framework for this analysis was adapted from a blogpost found on the wonderfully interestingly R-bloggers website. The objective of this analysis is to use sentiment analysis on different religious texts to visualise the differences/ similarities between them. This concept is of course fraught with a host of issues. Not least of which being detractors who will likely not endeavour to engage in rational debate against the findings of this work. This is of course beyond the control of the experimental design so I rather focus here on the issue that the translations of the texts used are not modern, and none of them were written (and generally not intended to be read in) English. Therefore a sentiment analysis of these works will be given to a certain amount of inaccuracy because the sentiment database is a recent project and one must assume that the emotions attached to the words in the English language reflect modern sentiment, and not the emotive responses that were necessarily desired at the time of writing/ translation. That being said, the sentiment that these texts would elicit in a reader today (assuming that anyone actually still reads the source material for their faith) would be accurately reflected by the sentiment analysis project and so this issue is arguably a minor one. As a control group (non-religious text), we will be using Pride and Prejudice, by Jane Austen simply because this has already been ported into R and is readily accesible in the package ‘janeaustenr’.

For more information on the sentiment analysis project one may find the home page here.

Text Downloads

The following links were used to download the religious texts used in this analysis:

I generally used the first link that DuckDuckGo provided when searching for pdf versions of these books. If anyone knows of better sources for these texts please let me know and I would be happy to update the data source.

Packages

library(pdftools)
library(tidyverse)
library(tidytext)
library(janeaustenr)
library(RmarineHeatWaves)

Analysis

First we convert the PDF’s to text and merge them.

## Load the texts
# Pride and Prejudice
pride <- data_frame(txt = prideprejudice,
src = "pride")
# Quran
quran <- data_frame(txt = as.character(pdf_text("~/blog/data/EQuran.pdf")), src = "quran")
# Bible
bible <- data_frame(txt = as.character(pdf_text("~/blog/data/KJBNew.pdf")), src = "bible")
# Torah
torah <- data_frame(txt = as.character(pdf_text("~/blog/data/Torah.pdf")), src = "torah")
# Bhagavad Gita
gita <- data_frame(txt = as.character(pdf_text("~/blog/data/BGita.pdf")), src = "gita")
# Book of the dead
dead <- data_frame(txt = as.character(pdf_text("~/blog/data/EBoD.pdf")), src = "dead")
# Combine
texts <- rbind(pride, quran, bible, torah, gita, dead)

After loading and merging the texts we want to run some quick word counts on them.

# Total relevant words per text
total_word_count <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
group_by(src) %>%
summarise(total = n()) %>%
ungroup()

# Total emotive words per text
emotive_word_count <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(!(sentiment == "negative" | sentiment == "positive" | sentiment == "NA")) %>%
group_by(src) %>%
summarise(emotions = n()) %>%
ungroup()

# Positivity proportion
positivity_word_count <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(sentiment == "positive" | sentiment == "negative") %>%
group_by(src, sentiment) %>%
summarise(emotions = n()) %>%
summarise(positivity = emotions[sentiment == "positive"]/ emotions[sentiment == "negative"]) %>%
ungroup()

# Combine
word_count <- cbind(total_word_count, emotive_word_count[2], positivity_word_count[2]) %>%
mutate(proportion = round(emotions/total, 2))

# Lolliplot
ggplot(data = word_count, aes(x = as.factor(src), y = proportion)) +
geom_lolli() +
geom_point(aes(colour = positivity)) +
geom_text(aes(y = 0.45, label = paste0("n = ", total)), colour = "red") +
scale_y_continuous(limits = c(0,0.5)) +
scale_color_continuous("Proportion of\nPositivity", low = "steelblue", high = "salmon") +
labs(x = "Text", y = "Proportion of Emotive Words")

Figure 1: Proportion of emotive words in each text to their total word counts (shown in red).

 

I find it somewhat surprising that The Egyptian Book of the Dead would have the highest proportion of positive sentiment to negative sentiment.

But let's take a more in depth look at the sentiment for each text.

# Calculate emotions
emotions <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('King James Version of the New Testament - ', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
filter(!grepl('BHAGAVAD GITA', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(!(sentiment == "negative" | sentiment == "positive")) %>%
group_by(src, sentiment) %>%
summarize(freq = n()) %>%
mutate(proportion = round(freq/sum(freq),2)) %>%
# select(-freq) %>%
ungroup()
# Boxplot
ggplot(emotions, aes(x = sentiment, y = proportion)) +
geom_boxplot(fill = "grey70") +
geom_point(aes(colour = src), position = "jitter", size = 2) +
labs(x = "Sentiment", y = "Proportion", colour = "Text")

Figure 2: Boxplots of emotion for the texts. Individual scores shown as coloured points.

 

ggplot(emotions, aes(x = sentiment, y = proportion)) +
geom_polygon(aes(fill = src, group = src), alpha = 0.6, show.legend = FALSE) +
coord_polar() +
scale_y_continuous(limits = c(0, 0.3)) +
facet_wrap(~src) +
labs(x = "", y = "")

Figure 3: Frequency polygons of the proportion of emotions for each text shown with facets.

 

Results

The frequency polygons shown here most notably outline that all of the texts emphasise trust over anything else. Another interesting pattern is the lack of surprise employed in the texts. The Bhagavad Gita seems to have the smoothest distribution of emotion, with the Quran perhaps coming in second. I found it odd that a novel, Pride and Prejudice, would employ a higher proportion of joyous sentiment than the religious texts against which it was compared. The Quran just barely nudges out the Bhagavad Gita to come out on top for angry and fearful sentiment. The Torah stands out from the other texts due to the amount of disgust expressed therein. After Pride and Prejudice, the Bible uses the largest proportion of words that elicited emotions of anticipation.

There is still much more that may or may not be deduced from these cursory results. Most notably lacking from this analysis is any test for significant differences. The reason for that being that I am not interested in supporting any sort of argument for differences between these texts. With n values (word count) in excess of 3,000 significant differences between the different texts is almost guaranteed as well so any use of a p-value here would just be cheap p-hacking anyway.