Data

Extracting data from PDFs using Python

1:30pm on Sunday, November 23

KLCC Level 3 - Room 303

About This Session

Writing Python scripts to extract data from PDFs is always a challenge: misshapen tables, arbitrary form formatting, tiny text or low-quality scanned texts. This session introduces attendees to Natural PDF, a new Python library for wrangling data that's focused on usability and cramming in as many features as possible, allowing you to write your code just like you'd ask "real-language" questions. Participants will leave with a solid understanding of how to extract data from difficult (and simple) PDFs using Python, and is best for those with at least beginner-level knowledge of Python.

Speaker

Jonathan Soma

Knight Chair in Data Journalism

Columbia University