Summer Semester 2024
Prof. Dr. Nathan Gibson
–Break–
Learn enough code to understand why you should learn more!
Use URL queries, APIs, and scraping tools to download data systematically from the web.
URL (Uniform Resource Locator): A web address, that is, a string of text that shows your web browser where to find something.
https:// |
www. |
example |
.com |
|
https:// |
images. |
google |
.com.vn |
Try it |
https:// |
am. |
wikipedia |
.org |
Try it |
protocol | subdomain | second-level domain | top-level domain |
https://example.com |
/maps |
/index.html |
domain | a directory | an HTML page |
https://example.com |
/topic-that-might-show-up-well-in-google |
/because-google-gives-greater-weight-to-words-in-the-web-address |
domain | a category | a page that is easier without “.html” on the end |
URLs only allow certain characters, but your browser converts them for you.
You see this in your address bar:
https://www.deepl.com/translator#en/de/Translate this sentence with spaces.
But when you copy the address you get this:
https://www.deepl.com/translator#en/de/Translate%20this%20sentence%20with%20spaces.
Special characters (including non-Latin ones) use Unicode UTF-8 hex values.
https://ar.wikipedia.org/wiki/فيروز_(مغنية)
is the same as
https://ar.wikipedia.org/wiki/%D9%81%D9%8A%D8%B1%D9%88%D8%B2_(%D9%85%D8%BA%D9%86%D9%8A%D8%A9)
(See the UTF-8 hex encoding for any character at https://symbl.cc.)
https://example.com/maps |
?place= |
Bora+Bora |
https://www.google.com/search |
?q= |
rickroll |
base URL | field name (sometimes just “q” for “query”) | field value (usually with + for spaces) |
https://example.com/maps |
?place=Bora+Bora |
&zoom=300 |
&type=terrain |
https://www.google.com/search |
?q=rickroll |
&hl=he |
&tbm=vid |
base URL | parameter 1 (after ? ) |
parameter 2 (after & ) |
parameter 3 (after & ) |
API (Application Programming Interface): a way for programs to talk to each other.
https://api.vam.ac.uk/v2/objects/search?q=%22china%22&material_technique=Silver
documentation at https://developers.vam.ac.uk/guide/v2/welcome.html
https://api.zotero.org/groups/5490830/items?include=bib&style=chicago-note-bibliography documentation at https://www.zotero.org/support/dev/web_api/v3/basics
https://syriaca.org/api/geo/json?type=monastery documentation at https://syriaca.org/api-documentation/index.html
Web scraping: extracting information from websites, usually by downloading it with a program
wget
is a common program used for this, often in combination with Python scripts