Summer Semester 2024
Prof. Dr. Nathan Gibson
–Break–
Use URL queries, APIs, and scraping tools to download data systematically from the web.
In the URL https://de.wikipedia.org
, which part is the top-level domain (TLD)?
https
de
wikipedia
org
Where would the URL https://google.images.com.vn
take you?
Which part of the URL https://24data.pages.gwdg.de/scraping-tutorial
is a subdomain?
scraping-tutorial
24data
gwdg
pages
If you type https://example.com/three words together
into your browser address bar, what characters will your browser change it to?
https://example.com/three/words/together
https://example.com/three-words-together
https://example.com/three%20words%20together
https://example.com/three_words_together
If you’re searching Google Books, what would you add to the URL https://www.google.com/search?tbo=p&tbm=bks
to search for the book A Wrinkle in Time?
&q=A+Wrinkle+in+Time
/search=A+Wrinkle+in+Time
&book=A%20Wrinkle%20in%20Time
/A/Wrinkle/in/Time
What does API stand for?
What 2 structured data formats are you most likely to get back from an API?
When scraping a website, what 2 things should you do to make sure you don’t break it?
Assess whether and how to make your data more open.
Metadata is data about your data.
On your computer, you might typically see this information about files:
But you can decide more things you want to add to your metadata.
Research metadata should typically include
as well as version information for any specific package that is being made available.
Version information includes
Information you might include depending on the type of data:
Check what is standard for the type of data you have. For example,
Often, you may have a table or spreadsheet describing the individual files (e.g., images) in your data.
Metadata lets your data fly!
Metadata allows people to
find
(libraries and web services can catalog your data)
connect
(other researchers can know how to interpret and reuse your data)
and remember
(you can keep track of what you’ve done)
your data.
Findable
Accessible
Interoperable
Re-usable
unique identifiers, metadata, in a searchable resource
Imagine: you go to the doctor …
data can be accessed using a standard system on the basis of identifiers
Imagine: you don’t have SKY TV …
data is in a format that can be used by common systems and is linked to other datasets
Imagine: You can’t make coffee …
data is licensed for re-use, source is known, meets community standards
Imagine: Facebook owns all the photos you post …
Copyright, ethics, and your options
Pick | Prepare | Process | Package |
Collecting sources | Structuring data | Outputs | Presentation |
Manuscripts, Photos, Interviews | Transcribing, Collating | Textual comparison, criticism, content analysis, coding | Edition, Narrative, Thematic discussion, Interactive website |
Advanced Processing & AI: Get a Grip on Big Data