Religious Sentiment

Objective

Before we begin, I would like to acknowledge that the framework for this analysis was adapted from a blogpost found on the wonderfully interestingly R-bloggers website. The objective of this analysis is to use sentiment analysis on different religious texts to visualise the differences/ similarities between them. This concept is of course fraught with a host of issues. Not least of which being detractors who will likely not endeavour to engage in rational debate against the findings of this work. This is of course beyond the control of the experimental design so I rather focus here on the issue that the translations of the texts used are not modern, and none of them were written (and generally not intended to be read in) English. Therefore a sentiment analysis of these works will be given to a certain amount of inaccuracy because the sentiment database is a recent project and one must assume that the emotions attached to the words in the English language reflect modern sentiment, and not the emotive responses that were necessarily desired at the time of writing/ translation. That being said, the sentiment that these texts would elicit in a reader today (assuming that anyone actually still reads the source material for their faith) would be accurately reflected by the sentiment analysis project and so this issue is arguably a minor one. As a control group (non-religious text), we will be using Pride and Prejudice, by Jane Austen simply because this has already been ported into R and is readily accesible in the package ‘janeaustenr’.

For more information on the sentiment analysis project one may find the home page here.

Text Downloads

The following links were used to download the religious texts used in this analysis:

I generally used the first link that DuckDuckGo provided when searching for pdf versions of these books. If anyone knows of better sources for these texts please let me know and I would be happy to update the data source.

Packages

library(pdftools)
library(tidyverse)
library(tidytext)
library(janeaustenr)
library(RmarineHeatWaves)

Analysis

First we convert the PDF’s to text and merge them.

## Load the texts
# Pride and Prejudice
pride <- data_frame(txt = prideprejudice,
src = "pride")
# Quran
quran <- data_frame(txt = as.character(pdf_text("~/blog/data/EQuran.pdf")), src = "quran")
# Bible
bible <- data_frame(txt = as.character(pdf_text("~/blog/data/KJBNew.pdf")), src = "bible")
# Torah
torah <- data_frame(txt = as.character(pdf_text("~/blog/data/Torah.pdf")), src = "torah")
# Bhagavad Gita
gita <- data_frame(txt = as.character(pdf_text("~/blog/data/BGita.pdf")), src = "gita")
# Book of the dead
dead <- data_frame(txt = as.character(pdf_text("~/blog/data/EBoD.pdf")), src = "dead")
# Combine
texts <- rbind(pride, quran, bible, torah, gita, dead)

After loading and merging the texts we want to run some quick word counts on them.

# Total relevant words per text
total_word_count <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
group_by(src) %>%
summarise(total = n()) %>%
ungroup()

# Total emotive words per text
emotive_word_count <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(!(sentiment == "negative" | sentiment == "positive" | sentiment == "NA")) %>%
group_by(src) %>%
summarise(emotions = n()) %>%
ungroup()

# Positivity proportion
positivity_word_count <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(sentiment == "positive" | sentiment == "negative") %>%
group_by(src, sentiment) %>%
summarise(emotions = n()) %>%
summarise(positivity = emotions[sentiment == "positive"]/ emotions[sentiment == "negative"]) %>%
ungroup()

# Combine
word_count <- cbind(total_word_count, emotive_word_count[2], positivity_word_count[2]) %>%
mutate(proportion = round(emotions/total, 2))

# Lolliplot
ggplot(data = word_count, aes(x = as.factor(src), y = proportion)) +
geom_lolli() +
geom_point(aes(colour = positivity)) +
geom_text(aes(y = 0.45, label = paste0("n = ", total)), colour = "red") +
scale_y_continuous(limits = c(0,0.5)) +
scale_color_continuous("Proportion of\nPositivity", low = "steelblue", high = "salmon") +
labs(x = "Text", y = "Proportion of Emotive Words")

Figure 1: Proportion of emotive words in each text to their total word counts (shown in red).

 

I find it somewhat surprising that The Egyptian Book of the Dead would have the highest proportion of positive sentiment to negative sentiment.

But let's take a more in depth look at the sentiment for each text.

# Calculate emotions
emotions <- texts %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words, by = "word") %>%
filter(!grepl('[0-9]', word)) %>%
filter(!grepl('King James Version of the New Testament - ', word)) %>%
filter(!grepl('www.christistheway.com', word)) %>%
filter(!grepl('BHAGAVAD GITA', word)) %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(!(sentiment == "negative" | sentiment == "positive")) %>%
group_by(src, sentiment) %>%
summarize(freq = n()) %>%
mutate(proportion = round(freq/sum(freq),2)) %>%
# select(-freq) %>%
ungroup()
# Boxplot
ggplot(emotions, aes(x = sentiment, y = proportion)) +
geom_boxplot(fill = "grey70") +
geom_point(aes(colour = src), position = "jitter", size = 2) +
labs(x = "Sentiment", y = "Proportion", colour = "Text")

Figure 2: Boxplots of emotion for the texts. Individual scores shown as coloured points.

 

ggplot(emotions, aes(x = sentiment, y = proportion)) +
geom_polygon(aes(fill = src, group = src), alpha = 0.6, show.legend = FALSE) +
coord_polar() +
scale_y_continuous(limits = c(0, 0.3)) +
facet_wrap(~src) +
labs(x = "", y = "")

Figure 3: Frequency polygons of the proportion of emotions for each text shown with facets.

 

Results

The frequency polygons shown here most notably outline that all of the texts emphasise trust over anything else. Another interesting pattern is the lack of surprise employed in the texts. The Bhagavad Gita seems to have the smoothest distribution of emotion, with the Quran perhaps coming in second. I found it odd that a novel, Pride and Prejudice, would employ a higher proportion of joyous sentiment than the religious texts against which it was compared. The Quran just barely nudges out the Bhagavad Gita to come out on top for angry and fearful sentiment. The Torah stands out from the other texts due to the amount of disgust expressed therein. After Pride and Prejudice, the Bible uses the largest proportion of words that elicited emotions of anticipation.

There is still much more that may or may not be deduced from these cursory results. Most notably lacking from this analysis is any test for significant differences. The reason for that being that I am not interested in supporting any sort of argument for differences between these texts. With n values (word count) in excess of 3,000 significant differences between the different texts is almost guaranteed as well so any use of a p-value here would just be cheap p-hacking anyway.

SAWS

I received some rather distressing news the other day. It appears that due to budget cuts the South African Weather Service (SAWS) will no longer be recording coastal seawater temperatures. While seemingly esoteric, this is a real concern for climate change scientists in this country as SAWS produces the currently longest running time series for the country and as we all know, the longer a time series the better. So I’ve largely spent the last few days of my working hours on preparing meta-data, analyses and figures on the SAWS data to better understand how and to what degree we will need to act in order to ensure some continuation of these time series. We are making a plan and it looks like we will be able to largely salvage the situation. But more on that as it progresses.

In more positive news, it looks like we are going to be getting a cordless underwater drill! This is terribly exciting as it means we will no longer have to deploy UTRs via the railway bar method. Something I’ve always detested. This will allow us to improve our field methods drastically and increase our potential range of monitoring. Not just for us but potentially for the whole country.

2016

My prolonged absence from this blog has not meant that I’ve been absent from my studies as well. Barring the usual holiday break for the second half of December I’ve been hard at work on two main projects specifically. The first has been the wrapping up of the statistics paper. That has since been submitted to the Journal of Climate and is awaiting the outcome of the review process. Once that was out the door I quickly put a rough draft together for the marine heatwave (MHW) collaboration with Thomas Wernberg, Eric Oliver and naturally Albertus Smit. A couple of additional steps have been suggested for the methodology, which must be complete before the results will be complete and the discussion will be able to attain a higher level of impact. This will all hopefully be done within the next few weeks and I’ll be on to a new project by the end of February. I need to pick up the pace if I am to catch up for the time lost on that monstrous statistics paper.

The bathymetry project is still in the works and will likely be the next project after the MHW one. There are several papers expected to be wrung from the MHW idea and perhaps they will be useful to put an honours student or masters student or two to work on, too.

Can’t keep a good bathymetry figure down

Wowee we wow. It’s been a couple weeks of mad admin and bathymetry. It has consumed my consciousness. The team kelp jackets are coming along. A couple of snags here and there but there is enough momentum behind it now due to the many interested/ involved parties that it will drag itself forward inextricably from now on. Knock wood… A couple of diving misadventures were planned and fell flat, too, unfortunately. Haven’t blown bubbles for at least a couple of months now. Urrgg. The planning for the West Coast Trip started in earnest late last week and has almost finished itself already. So that is a positive result. There are a few bits that will require more direct pressure to come to fruition, so stay tuned on that. But a general quote on the cost exists. I hope SAEON is chipping in on this…

But let’s talk about bathymetry! Because who doesn’t love that topic?!? After having familiarised myself with the many different ways to plot bathymetry in R I created the figure posted previously showing a bathy plot and ribbon together for South Africa. I have since learned how to automate this process for anywhere in the world and have applied this code to the four EBUS cells. Namely the:  Benguela, Humboldt, Canary and California  current upwelling cells. I think these look pretty sweet. They still need international borders and some more tweaking.  But that can wait until the final polish is being applied before submission for publication. The ultimate goal is to combine the statistics that are behind these figures with measures of upwelling strength to author a paper on the interactions between shelf characteristics and upwelling strengths around the world. That is going to be tasty.

I probably won’t get my greedy mits on those data for a time however. So in the meantime I will return to the previous chapter as that is still hanging. The statisticians I consulted on my methodology have no strong opinions on the correctness of it… So I suppose that is as good as a green light… Everything has been computed, so I need to revisit the right up and polish the most recent graphical additions and then push it out the door. I have a list of potential journals and need to get over my avoidance at learning the publishing process. It is unfortunately a mental barrier for me.

Bathymetry III

After a couple of more half days plugging away at the issues I had failed to notice at the first pass of the bathymetry coding I finally sorted out the bugs and the new bathymetry figure is looking tasty. It could use some tweaking here and there but that will hold for another day.

For now I will focus on kelp research as all of my other projects are currently on hold. Waiting for my contacts to get back to me about this or that. I am hoping that I will see something in the kelp literature to give me an idea for the chapter (or two) in my thesis that must handle kelps, preferably in the context of climate change.

So ya, a pretty productive Tuesday all around.

Bathymetry again

The quest for a good bathymetry figure continues. Over the last couple of days I have put the aforementioned figure together but it still needs a lot of work. The top panel shows the bathymetry off the coast of SA in delightful shades of blue at a fairly high spatial resolution. This figure takes my computer almost a minute to compile. That’s pretty hefty. The bottom panel shows the shelf width as calculated by my shelf width function. The width values are geolocated correctly along the coast as can be seen by matching up the labelled cities on the upper panel with the labels on the x axis in the lower panel. One doesn’t need to look very closely to see that these values are far off from where they should be… It turns out my shore normal transects aren’t so normal after all. I need to go back to the drawing board and actually plot the shore normal transects to see what they are doing, because whatever it is it isn’t right.

So that is the next step in this process. Straighten out the transects. This means that all of the statistics I calculated previously were not correct either. So I will need to start over from the beginning. But all of the code that comes afterwards will still be usable so it’s not much of a set back really. The odd thing is that I used code I had previously written to create shore normal transects that had worked quite well, so it is strange that it isn’t working now. It’s probably just some small hiccup somewhere. Also, it would probably help if I smoothed out the coastline shapefile before calculating the shore normal transects. That will be the first step in the redo. And then I need to make a plan for when the shore normal transects are in a bay, and so don’t properly project out over the shelf edge. But that’s enough bathymetry for now. Going to do a bit more research on kelp to give the coding side of my brain a rest.

 

Bathymetry

The code for the bathymetry function I have had knocking around in my head has been brought into existence. It is still pretty rough, but it runs. It could probably be sped up a decent amount as running this on ~600 lines of data takes about 12 minutes… Not great. Besides that, it’s also completely dependent on a coastline shapefile and bathymetry .asc file saved on my computer. Meaning one can only calculate shelf width and and slope if it is within the boundaries of the southern African coast. The ideal situation would be for this to be able to run on any place in the world and would use downloading on demand to retrieve the bathy data. Or perhaps I could just get the bathy data for the whole world and save that as an .RData file… That could work actually. Hmm, something to look into.

Anyway, the code shows how to calculate the width of the continental shelf (calculated here as everything above -200m) at a certain point along the coast. This is done by drawing a shore normal transect from the point. Besides shelf width, this function also calculates how wide the shelf slope is (-200 to -1000m) and uses this to come to the angle of the continental shelf. This last bit I am not 100% sure is calculated correctly as the angles are exceedingly small. With the largest being only 0.2 degrees whereas the global average is 4.0 degrees. Though perhaps I am calculating this differently.


shelfWidthSlope <- function(site){
# Find the site on the coastline and it's nearest neighbour points
coords <- data.frame(lon = site$lon, lat = site$lat)
coords2 <- knnx.index(southern_africa_coast[,1:2], as.matrix(coords), k = 1)
coords3 <- data.frame(site = site$site, southern_africa_coast
<pre></pre>
) # This line controls how wide the spread of points used to determine the shore-normal transect is. A wider spread gives a more even, less sensitive transect
coords3 <- coords3[2:1,1:3]
# Define the shore normal transect bearing
heading <- earth.bear(coords3[1,2], coords3[1,3], coords3[2,2], coords3[2,3]) + 90
if(heading >= 360){
heading <- heading-360
} else {
heading <- heading
}
# Find the depths along this bearing for 300km
distances <- seq(from = 0, to = 300000, by = 1000)
distances2 <- destPoint(p = coords, b = heading, d = distances)
sitesIdx <- knnx.index(bathy[,1:2], as.matrix(distances2), k = 1)
bathy2 <- bathy[sitesIdx,]
bathy2 <- bathy2[complete.cases(bathy2$depth),]
if(length(bathy2$depth) < 2){
info <- data.frame(site = site$site, coords, shelf = NA, slope = NA, depth_diff = NA, angle = NA)
} else {
bathy3 <- data.frame(sapply(bathy2[1:2], deg2rad), depth = bathy2$depth)
# Find the point at which the depth first hits/ exceeds 200m
shelf_end <- bathy3[bathy3$depth <= -200,][1,]
# Calculate how far it takes to reach the 200m isobath
shelf_width <- gcd.hf(bathy3$lon[1], bathy3$lat[1], shelf_end$lon, shelf_end$lat)
# Find the point at which the depth first hits/ exceeds 1000m
slope_end <- bathy3[bathy3$depth <= -1000,][1,]
# Calculate how far it takes to reach the 1000m isobath
slope_width <- gcd.hf(shelf_end$lon, shelf_end$lat, slope_end$lon, slope_end$lat)
# From this calculate the shelf slope angle
slope_depth <- abs(slope_end$depth - shelf_end$depth)
slope_angle <- tan(slope_width/slope_depth)
# Put it all together
info <- data.frame(site = site$site, coords, shelf = shelf_width, slope = slope_width, depth_diff = slope_depth, angle = slope_angle)
}
return(info)
}

Side projects

It has been a week now with no word back from the statisticians. But I certainly haven’t been sitting still in the mean time. The figures I referenced in the previous post have been completed. In rough form for the moment, but the data and base code are there. When I have a better idea of how it will all tie together I will polish them off.

After completing as much of my chapter as I could I moved on to my list of side projects. First up was how to inset map figures within one another. I also gave my whole work flow for coastal and country border mapping a major overhaul. This really helped to improve the quality of my land/ sea based figures. This was then applied to one of the figures that AJ is using for his upcoming publication. It looks real nice. In between these mapping projects I have also been organising the Team Kelp jackets. The logos are all done. Just waiting to hear back on quotes from two more companies.

The current side project I am working on is another AJ idea. It is important to know how wide/ steep a section of continental shelf may be as this can have an effect on the local oceanography of an area. And this is a central aspect of the kelps and climate change question. So it is up to me to figure out how to calculate this in a very snazy way that can be applied to anywhere in the world. I have found a bathymetry database that can be accessed via a download function within an R script and so the first step is sorted. Now I just need to figure out how to load the .tiff file into R so that I can then calculate shore normal transects to find shelf widths and slopes for anywhere in the world all wrapped up in one tidy function. It’s all rather exciting really as no one has done this yet, for some reason.

Leaps and bounds

It has been a wild ride these past couple of weeks. After the CV PSE issue (see previous post) I spent a week looking deep in the literature to make a plan on how to move forward. I also contacted the author of the fishmethods package that contains the powertrend() function, who was little help. I found some very good papers that changed the focus and approach of my chapter in good ways, regardless of what happens with the variance issue. Ultimately I have settled on a new set of methods that are more appropriate to the question at hand. But then it was time to get ready for my CapeR user group talk.

So the previous week I got a start on that while simultaneously trying to finish up all the new methodology, figures and corrections to my poor limping beat up chapter. Now barely a semblance of its previous self. But will be stronger once the bandages are removed. Spending perhaps a bit too much time on the chapter, and not enough on my talk, I didn’t really finish my presentation until on the car ride to the venue at UCT. And even then it was something of a compromise. One can view the CapeR slides here. Ultimately it went well and I made some rather valuable contacts.

The most prevalent to this blog being a couple of statisticians, one of which was a consultant for the stats department at UCT. Very exciting. I have summed up my issue for her and am waiting to hear back from her and/ or her colleague as she sees if anyone has previously dealt with this problem. Best case they will get back to me by the end of the week with an answer about what the correct variance is, or at least set up a time to meet to go over the issue in more detail. Worst case the power analysis of trend will be all wrong and I’ll need to again redo/ re-plan the majority of it. But at least I have thought of a couple of clever new figures in the mean time. They still need a bit of polish and I’ll have that wrapped up in the next couple of days. If the statisticians don’t get back to me by then I will progress to the next chapter for now. Comparing all of the different SST time series to the in situ time series.

Preparing for launch… and then

While adding information I had gleaned from several new papers I have read over the last week I realised that I had made an error in my analyses. The power analysis of trend is a method of detecting the power of a time series based on simple linear regression models and was developed by Gerrodette in 1987. The coefficient of variation (CV) which is the standard deviation (SD) divided by the mean, is central to this analysis. I noticed this evening that I had been rounding this value incorrectly and when I corrected this flaw it made all of the lower precision data unusable as they were no longer able to accurately show any change. I have now realised that I have been calculating these statistics incorrectly as well. In that I have been rounding these smaller values by digits rather than significant figures.

Rounding by significant figures is different in that the information stored within a number means more than simply what the numbers are. Their placement around the decimal place and how many zeroes are in front of any real numbers is relevant, too. So data collected with a precision of 0.1 must have more than one significant figure past the decimal. But I have incorrectly been rounding by decimal places this whole time, meaning if the SD of a time series with a precision of 0.1 is found to be 0.043443778…, I have been rounding it down to 0.0 as this is the closest value at a precision of 0.1 however!; it would be more appropriate to round this value to 0.04. If the precision had been 0.01, then it would be best to round the previous example to 0.043, and not 0.04 as I would have done previously. This is a pretty stupid mistake to have made and is really going to allow my analyses to run a lot more smoothly… but then I noticed something even worse.

As mentioned previously, CV is at the heart of the power analysis of trend. This is clearly explained in Gerrodette (1987, 1991) and I assumed was the case with the function from the fishmethods package in R. Boy was I wrong! 🙁 It turns out they use proportional standard error (PSE) which is calculated by taking the standard error (SE) and dividing that by the sample size. This number is much much smaller than the SD for these same time series and so will likely cause the power of every time series to increase greatly. Why the authors of this package chose to use PSE rather than CV is confusing and unclear. And frankly rather annoying. I must now redo all of my analyses and re-write up my entire paper. Thankfully I have done everything in R so the analyses should be able to be reworked within a few hours. And much of the re-write will be small. But the work is there to do and must be done.