Working with Digital Data

in Religious Studies

8. Accessing & Structuring Datasets: Grab More Data with Scraping and Querying

Summer Semester 2024
Prof. Dr. Nathan Gibson

Outline

  1. Review
    1. Python 🐍
    2. Tutorial

    Break

  2. Scraping & Querying
    1. Web queries
    2. APIs
    3. Scraping
    4. Ethics & legal

1.1 📈 Python Review: Learning Objective

Learn enough code to understand why you should learn more!

1.1. 📈 Python Review: With a little bit of Python, you can …

  • understand how a tool works (whether it does what you need)
  • create utilities for small tasks like converting your data
  • “scrape” (download) data from web pages
  • create graphs from your data
  • machine learning

1.1. 📈 Python Review: What we did with Python in office hours

1.2. 📈 Python Review: Tutorial

Break

🧭 Today’s Learning Objective

Use URL queries, APIs, and scraping tools to download data systematically from the web.

2.1. Web Queries: Domains

URL (Uniform Resource Locator): A web address, that is, a string of text that shows your web browser where to find something.

         
https:// www. example .com  
https:// images. google .com.vn Try it
https:// am. wikipedia .org Try it
protocol subdomain second-level domain top-level domain  

2.1. Web Queries: Pages

     
https://example.com /maps /index.html
domain a directory an HTML page
https://example.com /topic-that-might-show-up-well-in-google /because-google-gives-greater-weight-to-words-in-the-web-address
domain a category a page that is easier without “.html” on the end

2.1. Web Queries: URL encoding

URLs only allow certain characters, but your browser converts them for you.

You see this in your address bar:

https://www.deepl.com/translator#en/de/Translate this sentence with spaces.

But when you copy the address you get this:

https://www.deepl.com/translator#en/de/Translate%20this%20sentence%20with%20spaces.

2.1. Web Queries: URL encoding

Special characters (including non-Latin ones) use Unicode UTF-8 hex values.

https://ar.wikipedia.org/wiki/فيروز_(مغنية)

is the same as

https://ar.wikipedia.org/wiki/%D9%81%D9%8A%D8%B1%D9%88%D8%B2_(%D9%85%D8%BA%D9%86%D9%8A%D8%A9)

(See the UTF-8 hex encoding for any character at https://symbl.cc.)

2.1. Web Queries: Dynamic queries

     
https://example.com/maps ?place= Bora+Bora
https://www.google.com/search ?q= rickroll
base URL field name (sometimes just “q” for “query”) field value (usually with + for spaces)

2.1. Web Queries: Dynamic queries with multiple parameters

       
https://example.com/maps ?place=Bora+Bora &zoom=300 &type=terrain
https://www.google.com/search ?q=rickroll &hl=he &tbm=vid
base URL parameter 1 (after ?) parameter 2 (after &) parameter 3 (after &)

2.2. APIs

API (Application Programming Interface): a way for programs to talk to each other.

Chatting

2.2. APIs: URL queries

  • If a website has an API, you can use a URL query to get information from it (often in JSON format, sometimes in an XML format).
  • Look for the website’s “API documentation” (e.g., under a section for developers) to find out how to use it.

2.2. APIs: Websites

  • Not all websites have APIs
  • For some websites, you have to get an “API key” or “developer key” to use the API (or certain features of it). Sometimes you can do this with a free account (e.g., Europeana.eu).
  • Some websites have an open API anyone can use.

2.2 APIs: Examples

https://api.vam.ac.uk/v2/objects/search?q=%22china%22&material_technique=Silver

documentation at https://developers.vam.ac.uk/guide/v2/welcome.html

https://api.zotero.org/groups/5490830/items?include=bib&style=chicago-note-bibliography documentation at https://www.zotero.org/support/dev/web_api/v3/basics

https://syriaca.org/api/geo/json?type=monastery documentation at https://syriaca.org/api-documentation/index.html

2.2 APIs: Social Media

  • Instagram, Facebook, YouTube, Pinterest, Reddit all have APIs that you might be able to use for free.
  • You may find it easiest to use a service like https://apify.com/, which has scripts already set up to use with these APIs.

2.3 Scraping: Definition

Web scraping: extracting information from websites, usually by downloading it with a program

  • wget is a common program used for this, often in combination with Python scripts

2.3 Scraping: What would you use it for?

  • To download a “local” version (on your computer) of a whole website
  • To download all the links from a list
  • To save the results of one or more API queries
  • To save a large amount of information from websites without APIs

Preview

  1. Accessing & Structuring Datasets: Go Meta with “FAIR” Principles