From a PD(woo)F to Text Frequencies

fun
Author

Jayde Homer

Published

November 10, 2023

A brilliant grad school friend of mine recently asked if I knew anything about text extraction. From there she unknowingly opened a can of worms for me!

In this blog post, I’d like to show you how you can analyze text from a PDF using the package pdftools. This process is helpful for many types of analyses, but we’ll use it to do some basic text frequency analyses.

First things first, we need an article. I chose one that aligns with something I’m particularly passionate about: force-free dog training. If you’d like to learn more about how to use science-backed training methods that build a positive relationship between you and your dog, check out Zak George’s channel for starters.

Here’s the document we’re going to be using.

pdftools for text extraction

# packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pdftools)
Using poppler version 23.08.0
library(tidytext)
library(knitr)

# tell R the name of your pdf file
dogs_pdf <- "dog-training-methods-review.pdf"

There a few handy functions that we can play around with. Here, I’m using pdf_text() to convert all text on the pages of the document to a large string. Using cat() we can take a look at one of those pages.

dogs_text <- pdf_text(dogs_pdf)
cat(dogs_text[1])
Review of dog training methods                                                   December 2018

                                         March 2023

   Prepared for the British Columbia Society for the Prevention of Cruelty to Animals Author:
                                  I.J. Makowska, M.Sc., Ph.D.
                                Updated by: C.M. Cavalli, Ph.D.
cat(dogs_text[69])
      Review of dog training methods                                                                                                                               March 2023

                                                                                Dog Welfare (continued)
               Study              Study type     Sample size             Task                                                         Outcomes
Todd, 2018                   Review             N/A            N/A                      Barriers to the use of humane training methods:
                                                                                         Disagreeing positions of animal behaviour and veterinary organizations and dog trainers may contribute
                                                                                        to the idea that there is a lack of consensus on appropriate methods.
                                                                                          Lack of knowledge of the welfare risks
                                                                                          Lack of theoretical and practical knowledge of dog training
                                                                                          Poor quality of the information available to guardians
                                                                                          Lack of regulations for dog trainers


Wiliams & Blackwell, 2019    Survey             630            Various                  Predictors of current use and reported future intention of using positive reinforcement methods:
                                                                                         Perceived efficacy of the method
                                                                                         Guardians perceived ability to effectively implement the method

Woodward et al., 2021        Survey             2154           Various                  At 16 weeks
                                                guardians in                             99.7% of the guardians reported the intention to use positive reinforcement and/or negative
                                                the UK or                               punishment
                                                Ireland with                             84.1% intended to use positive punishment and/or negative reinforcement
                                                puppies <16                              15.6% could be classified as reward only
                                                weeks                                    12.9% could be classified as using a mix of reward and aversive-based training

                                                976 of them                             At 9 months
                                                completed a                              99.7% of the guardians reported using positive reinforcement and/or negative punishment
                                                follow-up                                74.2% used positive punishment and/or negative reinforcement
                                                survey at 9                              25.8% could be classified as reward only
                                                months.                                  29.2% could be classified as using a mix of reward and aversive-based training

                                                                                        Guardian factors that increased the likelihood of using both reward and aversive-based training at 9
                                                                                        months:
                                                                                         Males
                                                                                         Age > 55 years
                                                                                         Not having dog related employment
                                                                                         Not having attended a training class in the 2 months before completing the questionnaire

      Prepared for the BC SPCA by I.J. Makowska & updated by C.M. Cavalli                                                 69

It’s a bit messy. So instead, we can use pdf_data() to convert the text to a data frame. The first line, which I commented out, returns data frames for each page including information about the location of text on the page and the content of the text (i.e., the words). Here’s an example of the first page.

#pdf_data(dogs_pdf)
pdf_data(dogs_pdf)[[1]] %>% 
  head(5) %>% 
  kable()
width height x y space text
31 11 72 32 TRUE Review
9 11 106 32 TRUE of
16 11 117 32 TRUE dog
35 11 137 32 TRUE training
38 11 174 32 FALSE methods

We can use dplyr to aggregate the data, or bind all of the rows in each of our 69 tibbles together into one big tibble. Now you see we have 25,052 rows.

dogs_text <- pdf_data(dogs_pdf) %>% 
  bind_rows() %>% 
  select(text) # I only want the column containing the words

str(dogs_text)
tibble [25,052 × 1] (S3: tbl_df/tbl/data.frame)
 $ text: chr [1:25052] "Review" "of" "dog" "training" ...

Tidy text

Now that we have our data, we need to tidy it up! Literally. Check out Text Mining with R: A Tidy Approach if you want to dig deeper into text mining.

Our data is already unnested, meaning each word is in a row. We do not have sentences. This is one-token-per-document-per-row. A “token” is a meaningful unit of text. Sometimes the unit of text you’d like to analyze is a sentence or phrase, but more commonly, you’ll want your tokens to be words.

One thing we will want to do is remove punctuation from our tokens. This could cause trouble when we want to aggregate by tokens (e.g., get word counts). So, while we already mostly have unnested tokens, the unnest_tokens function will also strip punctuation and convert tokens to lowercase. We’ll also take this opportunity to remove stop words (the little ones that turn out to be most frequent in spoken and written language, such as “the,” “and,” “to”). Just what we needed!

data("stop_words")
dogs_tidy <- dogs_text %>% 
  unnest_tokens(input = text, output = word) %>% 
  anti_join(stop_words)
Joining with `by = join_by(word)`
dogs_tidy %>% 
  head(5) %>% 
  kable()
word
review
dog
training
methods
december

Question time

I know you’re dying to know what words these researchers used most in their report of dog training methods.

dogs_tidy %>% 
  count(word, sort = TRUE) %>% 
  head(10) %>% 
  kable()
word n
training 388
dogs 387
dog 274
methods 248
collar 219
collars 208
shock 176
based 171
aversive 136
guardians 134

Color me shocked at this list! Seems like these researchers have a lot to say about dog training methods, shock collars, guardians, and aversive.

Now there are so many things you can do with this tidy text. It’s now ready to do cross-text comparison, plot frequencies or proportions, and dig deeper to answer your text-based questions.