Working with Digital Data
in Religious Studies
7. Accessing & Structuring Datasets: Don’t Freak out When You See Code (It’s Only Python!)
Summer Semester 2024
Prof. Dr. Nathan Gibson
Outline
- Review
- Data Cleaning
- Tutorial
–Break–
- Coding
- Why learn code?
- A little python 🐍
1.1. 📈 Review: Objective
Last objective: Recognize what should be standardized in your data.
1.1. Review: Why clean your data?
Garbage in, garbage out!
- Inputs: heterogenous sources, human error, conversion/digitization bugs
- Problems: column/row breaks, entity confusion, mixed measurements, wrong text encodings
- Outputs: wrong fields, duplicate/missing data, broken sorting, inaccurate charts, unreadable text
1.1. Review: How to clean your data?
Look for both systematic & random errors by sorting, filtering, grouping, charting, and proofreading.
Using, e.g., OpenRefine, Spreadsheets, Visual Studio Code, RegEx
1.1. Review: OpenRefine vs. Spreadsheets
OpenRefine |
Spreadsheets |
imports many formats |
imports tables |
cells are static |
cells can be dynamically based on formulas |
uses filters and GREL functions to manipulate data |
uses formulas to manipulate data |
can reconcile your data to web databases & import data from there |
no reconciliation, limited web import |
1.2. 📈 Epilogue
JSON data in OpenRefine
🧭 Today’s Learning Objective
Learn enough code to understand why you should learn more!
2.1. What can you do with a little bit of code?

https://xkcd.com/2138/2.1. What can you do with a little bit of code?
- fix bugs or problems in your data or in the software you use to analyze or display it
- understand how a tool works (whether it does what you need)
- create utilities for small tasks like converting your data
- “scrape” (download) data from web pages
- create graphs from your data
2.1. Psst! You’ve already seen code
- Markdown
- HTML
- XML
- Spreadsheet formulas
- GREL functions in OpenRefine
(Just not code that can run 🏃♀️)
2.1. What more can you do with code (beyond spreadsheets)?
- download things from the web
- create and edit files
- write your own functions!!!
- write code that’s easier to read than spreadsheet formulas
- have precise control over graphs
2.2. Why python 🐍?
- easy to learn
- runs on all operating systems
- works well with text data
- lots of “libraries” (plugins) for things like graphs or networks
2.2. A chance to play with python 🐍
Monday, 14-15 in my office hours
Preview
Tutorial in Google Colab (like Jupyter Notebooks)
- Accessing & Structuring Datasets: Grab More Data with Scraping and Querying