PDF Text Extraction

To automatically pull out tables and text from PDFs.

Thierry Warin https://www.nuance-r.com/principalInvestigator.html (SKEMA Business School (Raleigh, NC))https://www.skema-bs.fr/campus/campus-raleigh

By using SKEMA Quantum Studio framework (Warin 2019), this course will teach you how to extract tables and text from PDFs.

Course objectives

The purpose of this course is to allow you to automatically pull out tables and text from PDFs. Two packages are presented to you to tackle this task. The first one is pdftools and the second one is tabulizer.

Course plan

1. pdftools: Getting started

Extracting text from PDF.

2. pdftools: Utilities

Extracting the table of contents, PDF author, version and PDF fonts.

3. pdftools: Tables

Extracting tables from PDF.

4. pdftools: Scanned text

5. tabulizer: Getting started

Extract text from PDF.

6. tabulizer: Utilities

Splitting up a PDF by its pages.

Merging a collection of PDFs.

Getting the number of pages in a PDF.

Getting metadata associated with a PDF.

7. tabulizer: Tables

Extracting tables from PDF.


Warin, Thierry. 2019. “SKEMA Quantum Studio: A Technological Framework for Data Science in Higher Education.” https://doi.org/10.6084/m9.figshare.8204195.v2.


For attribution, please cite this work as

Warin (2020, March 24). Virtual Campus: PDF Text Extraction. Retrieved from https://virtualcampus.skemagloballab.io/posts/pdf-text-extraction/

BibTeX citation

  author = {Warin, Thierry},
  title = {Virtual Campus: PDF Text Extraction},
  url = {https://virtualcampus.skemagloballab.io/posts/pdf-text-extraction/},
  year = {2020}