Working with Digital Data

in Religious Studies

7. Accessing & Structuring Datasets: Don’t Freak out When You See Code (It’s Only Python!)

Summer Semester 2024
Prof. Dr. Nathan Gibson

Outline

  1. Review
    1. Data Cleaning
    2. Tutorial

    Break

  2. Coding
    1. Why learn code?
    2. A little python 🐍

1.1. 📈 Review: Objective

Last objective: Recognize what should be standardized in your data.

1.1. Review: Why clean your data?

Garbage in, garbage out!

  • Inputs: heterogenous sources, human error, conversion/digitization bugs
  • Problems: column/row breaks, entity confusion, mixed measurements, wrong text encodings
  • Outputs: wrong fields, duplicate/missing data, broken sorting, inaccurate charts, unreadable text

1.1. Review: How to clean your data?

Look for both systematic & random errors by sorting, filtering, grouping, charting, and proofreading.

Using, e.g., OpenRefine, Spreadsheets, Visual Studio Code, RegEx

1.1. Review: OpenRefine vs. Spreadsheets

OpenRefine Spreadsheets
imports many formats imports tables
cells are static cells can be dynamically based on formulas
uses filters and GREL functions to manipulate data uses formulas to manipulate data
can reconcile your data to web databases & import data from there no reconciliation, limited web import

1.2. 📈 Tutorial Review

https://24data.pages.gwdg.de/openrefine-tutorial

1.2. 📈 Epilogue

JSON data in OpenRefine

Break

🧭 Today’s Learning Objective

Learn enough code to understand why you should learn more!

2.1. What can you do with a little bit of code?

Wanna See the Code? (XKCD)

https://xkcd.com/2138/

2.1. What can you do with a little bit of code?

  • fix bugs or problems in your data or in the software you use to analyze or display it
  • understand how a tool works (whether it does what you need)
  • create utilities for small tasks like converting your data
  • “scrape” (download) data from web pages
  • create graphs from your data
  • machine learning

2.1. Psst! You’ve already seen code

  • Markdown
  • HTML
  • XML
  • Spreadsheet formulas
  • GREL functions in OpenRefine

(Just not code that can run 🏃‍♀️)

2.1. What more can you do with code (beyond spreadsheets)?

  • download things from the web
  • create and edit files
  • write your own functions!!!
  • write code that’s easier to read than spreadsheet formulas
  • have precise control over graphs

2.2. Why python 🐍?

  • easy to learn
  • runs on all operating systems
  • works well with text data
  • lots of “libraries” (plugins) for things like graphs or networks

2.2. A chance to play with python 🐍

Monday, 14-15 in my office hours

Preview

Tutorial in Google Colab (like Jupyter Notebooks)

  1. Accessing & Structuring Datasets: Grab More Data with Scraping and Querying