Working with Digital Data

in Religious Studies

6. Accessing & Structuring Datasets: Clean and Augment Your Data with OpenRefine

Summer Semester 2024
Prof. Dr. Nathan Gibson

Outline

  1. Review
    1. Entities & Relationships
    2. Tutorial –Break
  2. Cleaning & Augmenting Data
    1. Why?
    2. How?

1. 📈 Review: Objective

Last objective: Structure your data as “entities” and “relationships” using tables.

1.1. 📈 Review: Entities & Relationships

Thing 1 > does x to > Thing 2

1.1. 📈 Review: Entities & Relationships

Thing 1 > does x to > Thing 2

1.1. 📈 Review: Entities & Relationships: What are they?

  • entity: a thing in a database “with an identity independent of change to its attributes” (Wikipedia)
  • attribute: information about about an entity
  • relationship: statements linking entities

1.2. 📈 Tutorial Review

  1. Copy the Google Sheets file
  2. Navigate, sort, and filter
    1. Columns, rows, and cells
    2. Sorting and Data Types
    3. Adding columns
    4. Filtering
  3. Chart
    1. Select columns to chart
    2. Edit chart

1.2. 📈 Tutorial Review

4. Formulas

  1. Insert a formula
  2. Do math
  3. Formula functions
  4. Copy the formula to all rows
  5. Change data type
  6. Adapt the formula

Break

🧭 Today’s Learning Objective

Recognize what should be standardized in your data.

2.1. Why clean your data?

Garbage in, garbage out!

Generated with Stable Diffusion in DiffusionBee 2.5.1 (model bluePencilXL, style vector art) on 2024-05-24 from the prompt "Infographic of garbage being put into a blender".

2.1. Why clean your data? Inputs

All these can cause problems with your input:

  • heterogenous sources
    • different software
    • different guidelines
    • different data structure (entities, relationships)
  • human error
    • inconsistency
    • mistaken assumptions
  • bugs in retrieving or converting data

2.1. Why clean your data? Problems

Leading to these problems:

  • inconsistent breaks between columns or rows
  • inconsistent identification of entities (e.g., using names instead of IDs)
  • different measurement systems (e.g., “mid-19th century” vs. “1830–1870”, metric vs. customary)
  • wrong text encoding

2.1. Why clean your data? Outputs

Resulting in this kind of output:

  • data appears in the wrong field (e.g., “blue” for “height”)
  • things are duplicated, conflated, or missing
  • can’t sort, filter, or group things correctly
  • can’t create accurate charts
  • can’t read the text

2.2. How to clean your data?

Look for

  • Systematic errors
  • Random errors

by sorting, filtering, grouping, charting, and proofreading.

2.2. How to clean your data?

Tools

  • OpenRefine
  • Spreadsheets
  • Visual Studio Code
  • RegEx

Preview

  1. Accessing & Structuring Datasets: Don’t Freak out When You See Code (It’s Only Python!)