A brilliant grad school friend of mine recently asked if I knew anything about text extraction. From there she unknowingly opened a can of worms for me!
In this blog post, I’d like to show you how you can analyze text from a PDF using the packagepdftools. This process is helpful for many types of analyses, but we’ll use it to do some basic text frequency analyses.
First things first, we need an article. I chose one that aligns with something I’m particularly passionate about: force-free dog training. If you’d like to learn more about how to use science-backed training methods that build a positive relationship between you and your dog, check out Zak George’s channel for starters.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pdftools)
Using poppler version 23.08.0
library(tidytext)library(knitr)# tell R the name of your pdf filedogs_pdf <-"dog-training-methods-review.pdf"
There a few handy functions that we can play around with. Here, I’m using pdf_text() to convert all text on the pages of the document to a large string. Using cat() we can take a look at one of those pages.
dogs_text <-pdf_text(dogs_pdf)cat(dogs_text[1])
Review of dog training methods December 2018
March 2023
Prepared for the British Columbia Society for the Prevention of Cruelty to Animals Author:
I.J. Makowska, M.Sc., Ph.D.
Updated by: C.M. Cavalli, Ph.D.
cat(dogs_text[69])
Review of dog training methods March 2023
Dog Welfare (continued)
Study Study type Sample size Task Outcomes
Todd, 2018 Review N/A N/A Barriers to the use of humane training methods:
Disagreeing positions of animal behaviour and veterinary organizations and dog trainers may contribute
to the idea that there is a lack of consensus on appropriate methods.
Lack of knowledge of the welfare risks
Lack of theoretical and practical knowledge of dog training
Poor quality of the information available to guardians
Lack of regulations for dog trainers
Wiliams & Blackwell, 2019 Survey 630 Various Predictors of current use and reported future intention of using positive reinforcement methods:
Perceived efficacy of the method
Guardians perceived ability to effectively implement the method
Woodward et al., 2021 Survey 2154 Various At 16 weeks
guardians in 99.7% of the guardians reported the intention to use positive reinforcement and/or negative
the UK or punishment
Ireland with 84.1% intended to use positive punishment and/or negative reinforcement
puppies <16 15.6% could be classified as reward only
weeks 12.9% could be classified as using a mix of reward and aversive-based training
976 of them At 9 months
completed a 99.7% of the guardians reported using positive reinforcement and/or negative punishment
follow-up 74.2% used positive punishment and/or negative reinforcement
survey at 9 25.8% could be classified as reward only
months. 29.2% could be classified as using a mix of reward and aversive-based training
Guardian factors that increased the likelihood of using both reward and aversive-based training at 9
months:
Males
Age > 55 years
Not having dog related employment
Not having attended a training class in the 2 months before completing the questionnaire
Prepared for the BC SPCA by I.J. Makowska & updated by C.M. Cavalli 69
It’s a bit messy. So instead, we can use pdf_data() to convert the text to a data frame. The first line, which I commented out, returns data frames for each page including information about the location of text on the page and the content of the text (i.e., the words). Here’s an example of the first page.
We can use dplyr to aggregate the data, or bind all of the rows in each of our 69 tibbles together into one big tibble. Now you see we have 25,052 rows.
dogs_text <-pdf_data(dogs_pdf) %>%bind_rows() %>%select(text) # I only want the column containing the wordsstr(dogs_text)
Now that we have our data, we need to tidy it up! Literally. Check out Text Mining with R: A Tidy Approach if you want to dig deeper into text mining.
Our data is already unnested, meaning each word is in a row. We do not have sentences. This is one-token-per-document-per-row. A “token” is a meaningful unit of text. Sometimes the unit of text you’d like to analyze is a sentence or phrase, but more commonly, you’ll want your tokens to be words.
One thing we will want to do is remove punctuation from our tokens. This could cause trouble when we want to aggregate by tokens (e.g., get word counts). So, while we already mostly have unnested tokens, the unnest_tokens function will also strip punctuation and convert tokens to lowercase. We’ll also take this opportunity to remove stop words (the little ones that turn out to be most frequent in spoken and written language, such as “the,” “and,” “to”). Just what we needed!
Color me shocked at this list! Seems like these researchers have a lot to say about dog training methods, shock collars, guardians, and aversive.
Now there are so many things you can do with this tidy text. It’s now ready to do cross-text comparison, plot frequencies or proportions, and dig deeper to answer your text-based questions.