# About this notebook

**You must run the cells of this notebook in order** for them to work properly. Also if you disconnect and reconnect, you may have to run the earlier cells again.

Programming Historian has [a great introduction to wget](https://programminghistorian.org/en/lessons/automated-downloading-with-wget), and I've liberally taken ideas from there for this tutorial.

# API query

Suppose we want to get some texts from the [Latin Wikisource](https://la.wikisource.org).

First, we can look at its [API documentation here](https://la.wikisource.org/api/rest_v1/) to see what we can query. There it is possible to find pages related to a certain title.

*   The base url for Latin Wikisource is `https://la.wikisource.org/api/rest_v1/`.
*   To that we add `page/related/` as given in the API documentation.
* Finally, we use the title of the page we want to get related entries for. Let's try `Lex_Duodecim_Tabularum` (which you can see at <https://la.wikisource.org/wiki/Lex_Duodecim_Tabularum>).

Thus, the API to do this query will be `https://la.wikisource.org/api/rest_v1/page/related/Lex_Duodecim_Tabularum`. If you go to <https://la.wikisource.org/api/rest_v1/page/related/Lex_Duodecim_Tabularum>, you'll see a JSON-formatted list of results, where "0" is the first result, "1" is the second, etc. Each of these has fields and subfields.

![JSON results](https://24data.pages.gwdg.de/assets/img/scraping-tutorial-json.png)

# Downloading results

But how do we download the results of this query? We could just right click > Save as in a web browser, which would work nicely for one page of results, but we'll be downloading a lot of pages.

Here's where the wget program comes in. Wget isn't a typical program that you interact with using a mouse. It is a command-line utility that runs in a "shell" (or "terminal"). ([To learn more about the command line, check out this tutorial from Programming Historian.](https://programminghistorian.org/en/lessons/intro-to-bash))

If you're doing this tutorial in Google Colab, you don't need to download wget, since Google will run it for you on its servers.

But if you're running this notebook on your own computer, you'll need to install it (if it's not already on your computer). You can use the [installation instructions for Windows here](https://programminghistorian.org/en/lessons/automated-downloading-with-wget#windows-instructions) or [for Mac here](https://programminghistorian.org/en/lessons/automated-downloading-with-wget#os-x-instructions).

We will also be using Python for some things. Python is a separate program from wget, so we could switch back and forth between the two. Instead, we will stay in Python and use a python script that calls wget. You **don't need to understand how this script works but you do need to run it** and understand that the python function it defines (`runcmd`) will let us put in a wget command.

In [None]:
# @title ðŸ¡· Run this cell to create the `runcmd` function { display-mode: "form" }
# Adapted from https://www.scrapingbee.com/blog/python-wget/

import subprocess # This says to use the Python module "subprocess"

# Now we define a python function that will run a command in the machine's "shell" (command line)
def runcmd(cmd, verbose = True, *args, **kwargs):

    # Defines a variable that uses the Popen method from the subprocess  module to run the "cmd" from the "runcmd" function.
    process = subprocess.Popen(
        cmd,
        stdout = subprocess.PIPE,
        stderr = subprocess.PIPE,
        text = True,
        shell = True
    )
    std_out, std_err = process.communicate()
    if verbose:
        print(std_out.strip(), std_err)
    pass

The most basic way to use wget is to just type the command `wget` followed by a space and the URL you want to download. Since we are using this within Python, we'll put it as a string inside the `runcmd` function. Try running this.

In [None]:
runcmd('wget https://la.wikisource.org/api/rest_v1/page/related/Lex_Duodecim_Tabularum')

You'll see status messages like "Resolving ...," "Connecting ...," and then the size of the file and how long it took to save it: `1.52M=0.02s`.

**But where did it save it?** If you're running this notebook on Google Colab, click the folder icon that looks like this ðŸ—€ in the left toolbar. There you'll see a file "Lex_Duodecim_Tabularum".

If you're running this notebook in Jupyter Notebooks on your own computer, it will save the file in the same folder this notebook (.ipynb file) is in.

## Try saving your own file

From anywhere, really! Copy the URL you want to save in place of `https://example.com` below.

In [None]:
runcmd('wget https://example.com')

## Using wget options

How can you save the file with a different name or to a different place? To do that, you use a wget "option". For a command-line program like wget, "options" are the extra instructions you give with the command. First you write the program name (`wget`), then the options, then the URL you want to save, like this:
`wget [options] [URL]`

If you need to see [all the possible options, you can consult them here](https://www.gnu.org/software/wget/manual/wget.html), but there are a lot!

The option to give the downloaded file our own name is `-O ` (short form) or `--output-document=` (long form), followed by the name we want, such as `lex.json`.

In [None]:
runcmd('wget -O lex.json https://la.wikisource.org/api/rest_v1/page/related/Lex_Duodecim_Tabularum')

You should see that file appear after several seconds in the file list on the left, or you might need to close and open the file list.

Try it now with your own filename in place of `yourfile.json`.

In [None]:
runcmd('wget -O yourfile.json https://la.wikisource.org/api/rest_v1/page/related/Lex_Duodecim_Tabularum')

# Download multiple files using the recursive option

The real power of wget is in downloading multiple files. wget can "follow" the links from one page to another, meaning that you can set it to also save the pages linked from the first page you give it. This is the "recursive" option with `-r`.

Suppose we're interested in the initiative ["International Partnership on Religion and Sustainable Development", which has 171 members](https://www.partner-religion-development.org/members/). Each member has its own page on the site with a short description, under a URL beginning with `https://www.partner-religion-development.org/member/`.

First, we want to **download only the member pages** linked from the page <https://www.partner-religion-development.org/members/>, not all links. So we use the "include directories" option to specify only links that are in the "member" directory: `-I /member`.

But **we need to put some limitations on this**. First, we should tell wget how many layers of links to follow. (The default is 5.) We only want the first page (with the list of members) and the pages linked from there. So we will use the option `-l 2` to say "two layers."

Next, we need to **put in a wait time and a bandwidth limit**, so that we don't overload the website. We can put a 10-second wait between downloads using `-w 10` and limit the bandwidth to 20 kilobytes per second using `--limit-rate=20k`.

Let's see what happens! Press the run button, but this will take a while. You don't have to let it finish -- you can press the stop button after a minute or so.

In [None]:
runcmd('wget -r -l 2 -w 10 --limit-rate=20k -I /member https://www.partner-religion-development.org/members/')

Now look at the files in the left sidebar. You'll see a new folder (`www.partner-religion-development.org`) with subfolders. wget is imitating the structure of the website you're downloading.

If you're trying to analyze information from those pages, you might want all of them in a single file. You can do that using the `-O` output option we used earlier. Now all these pages will be saved in a single file with whatever name you give it, e.g., `pard-members.html`. Try running this:

In [None]:
runcmd('wget -r -l 2 -w 10 --limit-rate=20k -I /member -O pard-members.html https://www.partner-religion-development.org/members/')

But what about that JSON file (`Lex_Duodecim_Tabularum`) we got earlier from Latin Wikisource? It is a list of texts related to Lex Duodecim Tabularum. But since it is a JSON file (instead of an HTML web page), wget won't be able to "see" and follow these links to download them. For that, we need to process them with python first.

# Create a list of links

If we have a list of links in a plain text file, wget can download each one of them. So to download each of the texts related to Lex Duodecim Tabularum, we need to write a python script that will first create that list, from the `Lex_Duodecim_Tabularum` json file we already have.

Python uses the `json` module to work with JSON data, so first we will import that.

In [None]:
import json

Next, we'll load the JSON file `Lex_Duodecim_Tabularum`. (Normally it would have `.json` on the end, but since we didn't tell wget a filename for it earlier, it just used the name of the page.)

To load a file in python, use `with open('filename') as f:`.
- `with` is a keyword in python that helps handle the errors that can come up in reading a file
- `open()` is a function that takes the filename as a string
- `as f` means the opened file will be made available via the variable `f`

Following the `with` statement, we need to indent. Then we will create the variable `data`, where the file (`f`) is processed as JSON using the `load` function from the `json` module.

In [None]:
with open('Lex_Duodecim_Tabularum') as f:
  data = json.load(f)

Now we have a variable (`data`) containing JSON data that python can access. If you download the `Lex_Duodecim_Tabularum` file from the files sidebar (click on the ... beside it, then on Download), you can look through it in Visual Studio Code and see that the links we need are inside pages > content_urls > desktop > page. (Or you can take my word for it.)

`pages` here contains a "list" of items that python can loop through.

First we'll create a variable (`links`) to put the list of links in.

In [None]:
links = ''

Then we'll loop through each of the items in the `pages` list of our JSON file. We're using the `page` variable to hold each of those items as we do that.

Then we're accessing the value of content_urls > desktop > page for each item, and adding it (with a line break first, `"\n"`) to the links variable.

At the end, we print the `links` variable to make sure it contains what we expect.

In [None]:
for page in data["pages"]:
  links += "\n" + page["content_urls"]["desktop"]["page"]

print(links)

You try it now with your own variable for accessing something else in the JSON file, maybe, for example, titles > canonical?

In [None]:
yourvariable  = ''

for page in data["pages"]:
  yourvariable += "\n" + page["content_urls"]["desktop"]["page"] # Replace the part after "page[..." with what you want to get from the JSON file.

print(yourvariable)

Now we just need to save our `links` variable as a file (`links.txt`), so we can use it with wget.

In [None]:
# Creates the file "links.txt" as the "file" variable, opening it in write mode ('w').
with open('links.txt', 'w') as file:
    # Puts the content of the "links" variable into the "links.txt" file.
    file.write(links)

# Download a list of links

Now we will use the `-i` option in wget with `links.txt` to download every file in that list.

We will also use the `-P` option with `wikisource` to designate a folder we want those files to go into.

Here it goes!

In [None]:
runcmd('wget -i links.txt -P wikisource -w 10 --limit-rate=20k')

In the files sidebar, you'll now (or soon) see the `wikisource` folder with files inside it. When you download these, you might want to add `.html` to the end of them so you can open them easily in your web browser. Also, remember that if you want them all in a single file, you can use the `-O` option.

When you download and open the files, you'll see that they are very basic text, because we only downloaded the HTML files themselves, not the additional files (like CSS) used to make HTML pages look pretty.

Now you have a "corpus" of Latin texts you can use for your analysis!

# Download files from Google Colab to your computer

Google Colab **will not keep your files** when you disconnect. If you want to download them to your computer to use them, you'll have to do the following.

- To download individual files from the Files sidebar to yoru computer, you can click on the `...` beside the file and then `Download`.
- To download a whole folder, you need to compress it first as a .zip file and then download it, like this (replace `wikisource` with the name of the folder you want to download, if it's different):

In [None]:
!zip -r wikisource.zip wikisource/ # Replace wikisource with the name of the folder you want to download

from google.colab import files
files.download('wikisource.zip') # Replace wikisource with the name of the folder you want to download

# The whole script

If you want the whole python script, which you could adapt for your own purposes, here it is.

In [None]:
# Import the modules we need.
import subprocess
import json

# Now we define a python function that will run a command in the machine's "shell" (command line)
# Adapted from https://www.scrapingbee.com/blog/python-wget/
def runcmd(cmd, verbose = True, *args, **kwargs):

    # Defines a variable that uses the Popen method from the subprocess  module to run the "cmd" from the "runcmd" function.
    process = subprocess.Popen(
        cmd,
        stdout = subprocess.PIPE,
        stderr = subprocess.PIPE,
        text = True,
        shell = True
    )
    std_out, std_err = process.communicate()
    if verbose:
        print(std_out.strip(), std_err)
    pass

# Uses the Latin Wikisource API to get the JSON page listing texts related to Lex Duodecim Tabularum
runcmd('wget https://la.wikisource.org/api/rest_v1/page/related/Lex_Duodecim_Tabularum')

# Loads the JSON file into the data variable
with open('Lex_Duodecim_Tabularum') as f:
  data = json.load(f)

links = ''

# Creates a list of the page links, with each URL on a new line.
for page in data["pages"]:
  links += "\n" + page["content_urls"]["desktop"]["page"]

# Creates the file "links.txt" as the "file" variable, opening it in write mode ('w').
with open('links.txt', 'w') as file:
    # Puts the content of the "links" variable into the "links.txt" file.
    file.write(links)

# Downloads the list of URLs in links.txt to the folder "wikisource".
runcmd('wget -i links.txt -P wikisource -w 10 --limit-rate=20k')

!zip -r wikisource.zip wikisource/ # Replace wikisource with the name of the folder you want to download

from google.colab import files
files.download('wikisource.zip') # Replace wikisource with the name of the folder you want to download

# More data scraping in python

## Getting specific things from HTML pages

Now that you've downloaded HTML pages, how do you extract specific information from them in python? You can use the tool "Beautiful Soap". [There's a tutorial here](https://programminghistorian.org/en/lessons/retired/intro-to-beautiful-soup) (but note the example web page used there is outdated).

## Other Tutorials

If you'd like more practice with data scraping in python, check out this tutorial from Programming Historian: <https://programminghistorian.org/en/lessons/applied-archival-downloading-with-wget>.

## Internet Archive Tool

The massive Internet Archive has its python own tool for downloading data from its site. Check out <https://programminghistorian.org/en/lessons/data-mining-the-internet-archive> for a tutorial on using that.

## Instagram Scraper

Finally, people have developed various python tools for scraping social media platforms. One of them is [Instascrape](https://github.com/chris-greening/instascrape) for downloading Instagram posts. (Thanks A.P. for the recommendation!)