From a PD(woo)F to Text Frequencies

A brilliant grad school friend of mine recently asked if I knew anything about text extraction. From there she unknowingly opened a can of worms for me!

In this blog post, I’d like to show you how you can analyze text from a PDF using the package pdftools. This process is helpful for many types of analyses, but we’ll use it to do some basic text frequency analyses.

First things first, we need an article. I chose one that aligns with something I’m particularly passionate about: force-free dog training. If you’d like to learn more about how to use science-backed training methods that build a positive relationship between you and your dog, check out Zak George’s channel for starters.

Here’s the document we’re going to be using.

pdftools for text extraction

# packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(pdftools)

Using poppler version 23.08.0

library(tidytext)
library(knitr)

# tell R the name of your pdf file
dogs_pdf <- "dog-training-methods-review.pdf"

There a few handy functions that we can play around with. Here, I’m using pdf_text() to convert all text on the pages of the document to a large string. Using cat() we can take a look at one of those pages.

dogs_text <- pdf_text(dogs_pdf)
cat(dogs_text[1])

Review of dog training methods                                                   December 2018

                                         March 2023

   Prepared for the British Columbia Society for the Prevention of Cruelty to Animals Author:
                                  I.J. Makowska, M.Sc., Ph.D.
                                Updated by: C.M. Cavalli, Ph.D.

cat(dogs_text[69])

      Review of dog training methods                                                                                                                               March 2023

                                                                                Dog Welfare (continued)
               Study              Study type     Sample size             Task                                                         Outcomes
Todd, 2018                   Review             N/A            N/A                      Barriers to the use of humane training methods:
                                                                                         Disagreeing positions of animal behaviour and veterinary organizations and dog trainers may contribute
                                                                                        to the idea that there is a lack of consensus on appropriate methods.
                                                                                          Lack of knowledge of the welfare risks
                                                                                          Lack of theoretical and practical knowledge of dog training
                                                                                          Poor quality of the information available to guardians
                                                                                          Lack of regulations for dog trainers


Wiliams & Blackwell, 2019    Survey             630            Various                  Predictors of current use and reported future intention of using positive reinforcement methods:
                                                                                         Perceived efficacy of the method
                                                                                         Guardians perceived ability to effectively implement the method

Woodward et al., 2021        Survey             2154           Various                  At 16 weeks
                                                guardians in                             99.7% of the guardians reported the intention to use positive reinforcement and/or negative
                                                the UK or                               punishment
                                                Ireland with                             84.1% intended to use positive punishment and/or negative reinforcement
                                                puppies <16                              15.6% could be classified as reward only
                                                weeks                                    12.9% could be classified as using a mix of reward and aversive-based training

                                                976 of them                             At 9 months
                                                completed a                              99.7% of the guardians reported using positive reinforcement and/or negative punishment
                                                follow-up                                74.2% used positive punishment and/or negative reinforcement
                                                survey at 9                              25.8% could be classified as reward only
                                                months.                                  29.2% could be classified as using a mix of reward and aversive-based training

                                                                                        Guardian factors that increased the likelihood of using both reward and aversive-based training at 9
                                                                                        months:
                                                                                         Males
                                                                                         Age > 55 years
                                                                                         Not having dog related employment
                                                                                         Not having attended a training class in the 2 months before completing the questionnaire

      Prepared for the BC SPCA by I.J. Makowska & updated by C.M. Cavalli                                                 69

It’s a bit messy. So instead, we can use pdf_data() to convert the text to a data frame. The first line, which I commented out, returns data frames for each page including information about the location of text on the page and the content of the text (i.e., the words). Here’s an example of the first page.

#pdf_data(dogs_pdf)
pdf_data(dogs_pdf)[[1]] %>% 
  head(5) %>% 
  kable()

width	height	x	y	space	text
31	11	72	32	TRUE	Review
9	11	106	32	TRUE	of
16	11	117	32	TRUE	dog
35	11	137	32	TRUE	training
38	11	174	32	FALSE	methods

We can use dplyr to aggregate the data, or bind all of the rows in each of our 69 tibbles together into one big tibble. Now you see we have 25,052 rows.

dogs_text <- pdf_data(dogs_pdf) %>% 
  bind_rows() %>% 
  select(text) # I only want the column containing the words

str(dogs_text)

tibble [25,052 × 1] (S3: tbl_df/tbl/data.frame)
 $ text: chr [1:25052] "Review" "of" "dog" "training" ...

Tidy text

Now that we have our data, we need to tidy it up! Literally. Check out Text Mining with R: A Tidy Approach if you want to dig deeper into text mining.

Our data is already unnested, meaning each word is in a row. We do not have sentences. This is one-token-per-document-per-row. A “token” is a meaningful unit of text. Sometimes the unit of text you’d like to analyze is a sentence or phrase, but more commonly, you’ll want your tokens to be words.

One thing we will want to do is remove punctuation from our tokens. This could cause trouble when we want to aggregate by tokens (e.g., get word counts). So, while we already mostly have unnested tokens, the unnest_tokens function will also strip punctuation and convert tokens to lowercase. We’ll also take this opportunity to remove stop words (the little ones that turn out to be most frequent in spoken and written language, such as “the,” “and,” “to”). Just what we needed!

data("stop_words")
dogs_tidy <- dogs_text %>% 
  unnest_tokens(input = text, output = word) %>% 
  anti_join(stop_words)

Joining with `by = join_by(word)`

dogs_tidy %>% 
  head(5) %>% 
  kable()

word
review
dog
training
methods
december

Question time

I know you’re dying to know what words these researchers used most in their report of dog training methods.

dogs_tidy %>% 
  count(word, sort = TRUE) %>% 
  head(10) %>% 
  kable()

word	n
training	388
dogs	387
dog	274
methods	248
collar	219
collars	208
shock	176
based	171
aversive	136
guardians	134

Color me shocked at this list! Seems like these researchers have a lot to say about dog training methods, shock collars, guardians, and aversive.

Now there are so many things you can do with this tidy text. It’s now ready to do cross-text comparison, plot frequencies or proportions, and dig deeper to answer your text-based questions.