Working with Digital Data

in Religious Studies

2. Working with Text: Get the Most out of “Plain” Text and Regular Expressions

Summer Semester 2024
Prof. Dr. Nathan Gibson

📈 Review

Last objective: Consider your personal goals for the semester within the big picture of digital data in religious studies.

📈 Review: Chocolate Model

cacao tree cacao pod utensils chocolate bar
Pick Prepare Process Package
Inputs/Sources Structuring data Analysis Outputs/Presentation
Manuscripts, Photos, Interviews Transcribing, Collating Textual comparison, criticism, content analysis, coding Edition, Narrative, Thematic discussion, Interactive website

📈 Review: Group Work on Sources

https://etherpad.studiumdigitale.uni-frankfurt.de/p/24data2

📈 Review: Optional Assignment

A question for a ChatAI?

🧭 Today’s Learning Objective

Understand plain text as a foundational type of data.

Which of these is text from the computer’s perspective?

a. Text

b. Very fancy

c. Backwards

d. Emojis 🦉👀🐁❤️🐛

e. Invisible character‎

f. Math equation 𝓐 = 4𝛑𝑟²

Which of these is text from the computer’s perspective? (continued)

g. Arabic عربيّة

h. Domino game 🁍🀱🀲🀺🁃

i. Hieroglyphics (version 1)1:

Cleopatra

j. Hieroglyphics (version 2):
𐦐𐦗𓂅𓂧𓂋𐦉𓂂𐦇𓂂𓂑𐦓

k. Screenshot

Screenshot

l. Page scan

Faust

Even this is text

⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⣾⣿⣷⣾⣿⣿⣿⣷⣤⣤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⢀⣤⣿⣿⣷⣢⣌⣭⣍⢻⣿⡟⣽⣿⣿⣿⣦⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⣾⣿⠟⣷⡾⠛⠋⠉⠛⠛⠛⢷⣉⣭⣿⣿⣿⡀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⢾⣿⣿⣿⣿⠁⠀⠀⠀⠀⠀⠀⠀⣿⣷⡌⣿⣿⣷⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⣸⣿⣿⣿⡟⠀⢀⡀⠀⠀⠀⢀⣀⢹⣿⡿⢿⣿⣿⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⣿⣿⣿⣿⡇⣎⢥⣿⣷⡄⣾⣿⣭⢽⣿⡇⣾⣿⣿⡄⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⢀⣿⠻⢟⣿⠇⠈⠋⠁⢹⡀⣿⡇⠹⠟⣿⣿⣮⣤⣾⣿⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠈⢿⣷⣿⢿⣠⠀⠀⠀⣤⡄⣻⡧⠀⠀⣿⣿⡟⠿⠿⠃⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⢀⣴⣿⣿⡄⠀⠠⣤⢌⠭⣥⢀⣼⣿⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⢺⣿⣿⣿⠋⢦⡀⠐⠓⠒⣿⣿⣿⠿⣿⣿⣿⠏⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀⢹⣿⣿⣆⠈⠻⢶⣶⣾⡿⠟⠁⢀⣿⣿⡟⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⢀⣠⣴⣾⣿⣯⢿⣏⢦⡀⠀⣰⢧⡀⠀⣠⡞⣿⣿⣿⣷⣦⣄⡀⠀⠀⠀ ⠀⣤⣾⣿⣿⣿⣿⣿⣿⣷⡻⣦⡉⢿⢭⡞⣻⠿⢋⣼⣿⣿⣿⣿⣿⣿⣿⣷⣦⠀ ⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣿⣿⣷⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧ ⠻⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿

There are 10 types of people in the world …

… Those who understand binary, and those who don’t.

Punch Tape

By Vanessaezekowitz at English Wikipedia, CC BY-SA 3.0, Link

The importance of text in computing

Text as the most important unit of human-machine interaction

  • communication
  • programming
  • websites

The importance of text in digital humanities

Many humanistic disciplines such as history, philology, literature, theology, and anthropology primarily use textual sources.

Many humanistic disciplines also do their analysis and communicate their results using primarily text.

But the world is not text!

  • Can and should the thing I want to study be represented as text?
    • An ancient manuscript?
    • A work of art?
    • A musical work?
    • An interview?
    • Archeological findings?
    • Polling/survey responses?

“Plain Text”

Text without formatting:

Plain text

Formatted text

Even crazier formatted text!

“Plain Text”

Plain text can be opened in any text program. Microsoft Word, Notepad, TextEdit, even a web browser. Try it!

You might see “.txt” on the end of a plain text file, but actually the file extension doesn’t really matter. Try it!

“Plain Text”

But plain text still has an “encoding.”

Arabic in an old encoding (Windows-1256) Try it!

Arabic in Unicode (UTF-8) Try it!

📖 Readings: Unicode

  • What is Unicode?
  • What surprised you in the list of Unicode characters?

Break

What are Regular Expressions (RegEx)?

A regular expression is a sequence of characters (string of letters) that defines a text pattern.

Different programs use different standards for those patterns, but many of them work similarly and are called “RegEx.”

For example, to find numbers in a text, you don’t have to search many different times for “0” and then “1”, “2”, etc. You can just search for “\d” which represents all numbers.

What can you do with RegEx?

Search for patterns of text and replace them with other patterns. For example,

  • Change dates from format 12/9/2020 to 09.12.2020.
  • Remove extra spaces and blank lines.
  • Find words or phrases with many variations (Munich, München, Monaco).
  • Find everything in a certain script (all Arabic characters, all Latin characters).
  • Standardize your transcriptions.

What can you do with RegEx?

  • Find Arabic/Hebrew words with different vocalizations.
  • Find invisible formatting characters like line breaks, tabs, and right-to-left control characters.
  • Find types of things that use common patterns, like capitalized words, dates, or names (e.g., using Abū/Umm and bin/bint to find all names with the pattern Abū Fulān Fulān bin Fulān).

RegEx Tutorial

https://docs.google.com/spreadsheets/d/1jTmHopCz8Il6tBopZlG2LfgMSEOuvMJh-Q7nzUwLkvE/edit?usp=sharing (You can experiment by copying the document for yourself or using rows 30+)

See also:

Visual Studio Code

Download and install https://code.visualstudio.com/

Brainstorming about your projects

https://etherpad.studiumdigitale.uni-frankfurt.de/p/24data2

Preview

Git Versioning

Endnotes

  1. Lundström, Peter. (2020). PHARAOH.SE Available at: https://pharaoh.se/ancient-egypt/pharaoh/cleopatra-vii/ [Accessed 25 Apr. 2024]. CC-BY 4.0.