Working with Digital Data
in Religious Studies
6. Accessing & Structuring Datasets: Clean and Augment Your Data with OpenRefine
Summer Semester 2024
Prof. Dr. Nathan Gibson
Outline
- Review
- Entities & Relationships
- Tutorial
–Break–
- Cleaning & Augmenting Data
- Why?
- How?
1. 📈 Review: Objective
Last objective: Structure your data as “entities” and “relationships” using tables.
1.1. 📈 Review: Entities & Relationships
Thing 1 > does x to > Thing 2
1.1. 📈 Review: Entities & Relationships
Thing 1 > does x to > Thing 2
1.1. 📈 Review: Entities & Relationships: What are they?
-
entity: a thing in a database “with an identity independent of change to its attributes” (Wikipedia)
-
attribute: information about about an entity
-
relationship: statements linking entities
1.2. 📈 Tutorial Review
- Copy the Google Sheets file
- Navigate, sort, and filter
- Columns, rows, and cells
- Sorting and Data Types
- Adding columns
- Filtering
- Chart
- Select columns to chart
- Edit chart
1.2. 📈 Tutorial Review
- Insert a formula
- Do math
- Formula functions
- Copy the formula to all rows
- Change data type
- Adapt the formula
🧭 Today’s Learning Objective
Recognize what should be standardized in your data.
2.1. Why clean your data?
Garbage in, garbage out!
Generated with Stable Diffusion in DiffusionBee 2.5.1 (model bluePencilXL, style vector art) on 2024-05-24 from the prompt "Infographic of garbage being put into a blender".2.1. Why clean your data? Problems
Leading to these problems:
- inconsistent breaks between columns or rows
- inconsistent identification of entities (e.g., using names instead of IDs)
- different measurement systems (e.g., “mid-19th century” vs. “1830–1870”, metric vs. customary)
- wrong text encoding
2.1. Why clean your data? Outputs
Resulting in this kind of output:
- data appears in the wrong field (e.g., “blue” for “height”)
- things are duplicated, conflated, or missing
- can’t sort, filter, or group things correctly
- can’t create accurate charts
- can’t read the text
2.2. How to clean your data?
Look for
- Systematic errors
- Random errors
by sorting, filtering, grouping, charting, and proofreading.
2.2. How to clean your data?
Tools
Preview
- Accessing & Structuring Datasets: Don’t Freak out When You See Code (It’s Only Python!)