Working with Digital Data

in Religious Studies

9. Accessing & Structuring Datasets: Go Meta with “FAIR” Principles

Summer Semester 2024
Prof. Dr. Nathan Gibson

Outline

  1. Review
    1. Scraping & Querying
    2. Tutorial

    Break

  2. Metadata
  3. FAIR Principles

1.1 📈 Scraping & Querying Review: Learning Objective

Use URL queries, APIs, and scraping tools to download data systematically from the web.

1.1 📈 Scraping & Querying Review: Quiz

In the URL https://de.wikipedia.org, which part is the top-level domain (TLD)?

  • https
  • de
  • wikipedia
  • org

1.1 📈 Scraping & Querying Review: Quiz

Where would the URL https://google.images.com.vn take you?

  • A Google Images page in Vietnamese
  • A Vietnamese page not run by Google

1.1 📈 Scraping & Querying Review: Quiz

Which part of the URL https://24data.pages.gwdg.de/scraping-tutorial is a subdomain?

  • scraping-tutorial
  • 24data
  • gwdg
  • pages

1.1 📈 Scraping & Querying Review: Quiz

If you type https://example.com/three words together into your browser address bar, what characters will your browser change it to?

  • https://example.com/three/words/together
  • https://example.com/three-words-together
  • https://example.com/three%20words%20together
  • https://example.com/three_words_together

1.1 📈 Scraping & Querying Review: Quiz

If you’re searching Google Books, what would you add to the URL https://www.google.com/search?tbo=p&tbm=bks to search for the book A Wrinkle in Time?

  • &q=A+Wrinkle+in+Time
  • /search=A+Wrinkle+in+Time
  • &book=A%20Wrinkle%20in%20Time
  • /A/Wrinkle/in/Time

1.1 📈 Scraping & Querying Review: Quiz

What does API stand for?

  • Artificial Personal Inquirer
  • Antarctica Port Interchange
  • Aardvarks Punching Iguanas
  • Application Programming Interface

1.1 📈 Scraping & Querying Review: Quiz

What 2 structured data formats are you most likely to get back from an API?

  • Cookies
  • JSON
  • PNG
  • XML

1.1 📈 Scraping & Querying Review: Quiz

When scraping a website, what 2 things should you do to make sure you don’t break it?

  • put a pause between your requests
  • send it good vibes
  • include cookies in your request
  • set a limit on the bandwidth you use

1.2. 📈 Scraping & Querying: Tutorial

https://24data.pages.gwdg.de/scraping-tutorial

Break

🧭 Today’s Learning Objective

Assess whether and how to make your data more open.

2. Metadata: What is it?

Metadata is data about your data.

2. Metadata: File metadata

On your computer, you might typically see this information about files:

  • all files: date created, date modified, size, file type
  • photos & videos: dimensions, location, camera settings
  • audio & videos: playback length, encoding
  • PDFs: number of pages

2. Metadata: Research metadata

But you can decide more things you want to add to your metadata.

Research metadata should typically include

  • who created/entered the data
  • how the data was processed
  • according to which guidelines

as well as version information for any specific package that is being made available.

2. Metadata: Version information

Version information includes

  • release date of the current version
  • version number (e.g. “1.12”, for “Semantic Versioning” see https://semver.org/)
  • changes since the last version

2. Metadata: Examples

Information you might include depending on the type of data:

  • texts: language, author, date composed, number of lines, text encoding
  • tables: what the columns are, what they mean, what each row represents
  • databases: structure of tables, fields, and relationships
  • images/audio/video: what steps and software features were used to process them

2. Metadata: How to include it

Check what is standard for the type of data you have. For example,

  • A README.txt file for folders of files
  • “Header” information in text files, HTML, XML, JSON
  • Built-in metadata for images and PDFs (you can change or add this!)

Often, you may have a table or spreadsheet describing the individual files (e.g., images) in your data.

2. Metadata: What does it do?

Metadata lets your data fly!

kite

2. Metadata: What does it do?

Metadata allows people to

find

(libraries and web services can catalog your data)

connect

(other researchers can know how to interpret and reuse your data)

and remember

(you can keep track of what you’ve done)

your data.

3. FAIR Principles

Findable
Accessible
Interoperable
Re-usable

3. FAIR Principles: Findable

unique identifiers, metadata, in a searchable resource

Imagine: you go to the doctor …

3. FAIR Principles: Accessible

data can be accessed using a standard system on the basis of identifiers

Imagine: you don’t have SKY TV …

  • second-order effects

3. FAIR Principles: Interoperable

data is in a format that can be used by common systems and is linked to other datasets

Imagine: You can’t make coffee …

3. FAIR Principles: Interoperable

Interoperability cake

3. FAIR Principles: Re-usable

data is licensed for re-use, source is known, meets community standards

Imagine: Facebook owns all the photos you post …

3. FAIR Principles: Should you make your data FAIR?

Copyright, ethics, and your options

  • Can I legally post my entire dataset? Do others hold the copyright to parts of it?
  • Is FAIR what the community stakeholders want?
  • Do I need a higher budget?

3. FAIR Principles: When should you make your data FAIR?

cacao tree cacao pod utensils chocolate bar
Pick Prepare Process Package
Collecting sources Structuring data Outputs Presentation
Manuscripts, Photos, Interviews Transcribing, Collating Textual comparison, criticism, content analysis, coding Edition, Narrative, Thematic discussion, Interactive website

Preview

Advanced Processing & AI: Get a Grip on Big Data