PDF Text Extraction

To automatically pull out tables and text from PDFs.

Thierry Warin https://www.nuance-r.com/principalInvestigator.html (SKEMA Business School (Raleigh, NC))https://www.skema-bs.fr/campus/campus-raleigh
03-24-2020

By using SKEMA Quantum Studio framework (Warin 2019), this course will teach you how to extract tables and text from PDFs.

Course objectives

The purpose of this course is to allow you to automatically pull out tables and text from PDFs. Two packages are presented to you to tackle this task. The first one is pdftools and the second one is tabulizer.

Course plan

1. pdftools: Getting started

Extracting text from PDF.

2. pdftools: Utilities

Extracting the table of contents, PDF author, version and PDF fonts.

3. pdftools: Tables

Extracting tables from PDF.

4. pdftools: Scanned text

5. tabulizer: Getting started

Extract text from PDF.

6. tabulizer: Utilities

Splitting up a PDF by its pages.

Merging a collection of PDFs.

Getting the number of pages in a PDF.

Getting metadata associated with a PDF.

7. tabulizer: Tables

Extracting tables from PDF.



ACCESS TO THE COURSE



Warin, Thierry. 2019. “SKEMA Quantum Studio: A Technological Framework for Data Science in Higher Education.” https://doi.org/10.6084/m9.figshare.8204195.v2.

Citation

For attribution, please cite this work as

Warin (2020, March 24). Virtual Campus: PDF Text Extraction. Retrieved from https://virtualcampus.skemagloballab.io/posts/pdf-text-extraction/

BibTeX citation

@misc{warin2020pdf,
  author = {Warin, Thierry},
  title = {Virtual Campus: PDF Text Extraction},
  url = {https://virtualcampus.skemagloballab.io/posts/pdf-text-extraction/},
  year = {2020}
}