Working with Digital Data
in Religious Studies
3. Working with Text: Git Versioning and Archiving Your Data
Summer Semester 2024
Prof. Dr. Nathan Gibson
Outline
- Review: Plain Text & RegEx
- Discuss: The “Computational Turn”
- Brainstorm: Your Datasets
- Tutorial: Git Versioning
📈 Review
Last objective: Understand plain text as a foundational type of data.
📈 Review: Perspectives on Text
- Text from the computer’s perspective includes emojis, non-Latin scripts, Braille, and even some invisible characters, but not images of text (screenshots in .png or .jpg format or some scanned PDFs).
- Text is fundamental to human-machine and human-machine-humane interaction.
- Text is often vital to research workflows in the humanities (sources, analysis, publication).
- But many objects of humanistic research can only partially be captured in or with text.
📈 Review: Plain Text
- Plain text
- is text without formatting.
- can be opened in any text editor or code editor (even if the filename doesn’t end in “.txt”).
- has an “encoding,” i.e. a mapping between letters and numbers. If you use the wrong encoding to open or save a file, you’ll get gibberish.
📈 Review: Unicode
- Unicode is an encoding that tries to assign numbers to every character, without conflicts between languages/scripts. (The same number is never used for more than one character.)
- Unicode (in the form UTF-8) is the most common text encoding today and is supported by almost all programs.
📈 Review: Regular Expressions (RegEx)
- Regular Expressions (RegEx) are patterns for finding and replacing text.
- RegEx is useful for e.g., changing date formats, finding and cleaning up messy text data (e.g. from OCR) or your footnotes in term papers, removing extra spaces and line breaks, and much more!
📈 Review: Regular Expressions (RegEx)
- For example, if I search for
(\W)[Pp]
in Peter Piper picked a peck of pickled peppers.
and replace it with the pattern $1b
, it will replace all P’s or p’s at the beginning of words:
- The parentheses
()
make a group, which I call in the replacement pattern with $1
-
\W
is a special code meaning “word break” (spaces, punctuation, etc.)
-
[Pp]
finds any character in the brackets, thus uppercase and lowercase “P”.
- Result:
beter biper bicked a beck of bickled beppers.
Discussion: The “Computational Turn”
- What kind of turns were there in humanities before the computational turn?
- What is the computational turn?
🧭 Today’s Learning Objective
Be able to use basic features of git to track changes to your data.
Preview
Mark it down and Mark it up (HTML and XML)