OpenRefine Tutorial

1. Load the text file

Background: This uses the same IMDB data as the spreadsheet tutorial (IMDB movies with “God” in the title, released 2020–2023, with at least 100 votes).

I copied the first 50 of the films from the IMDB result into Visual Studio Code, where I used find/replace to remove the line break before “Metascore” and turn it into a comma instead. (Because some movies had Metascore and some didn’t, it gave me an uneven number of lines per film, which made it difficult to convert to columns in OpenRefine.) This is the a href=”assets/downloads/imdb.txt” download>imdb.txt</a> file we’ll be working with.

  1. Download and install OpenRefine on your computer.
  2. Download the imdb.txt file.
  3. Open OpenRefine. This will open a “local server” in your web browser, with an address like http://127.0.0.1:3333/...).
  4. Under Create Project > This Computer, click Browse and select the imdb.txt file. Click Next.
  5. As you’ll see, the data from IMDB has multiple rows (lines) for each film. We want each film to have only 1 row with multiple columns for this information. In the options at the bottom of the screen change “Parse every 1 line(s) into one row” to “Parse every 5 line(s) into one row”.
  6. At the top right, enter a name for your project, and then click Create project. In the future, you’ll be able to open this project again from the first OpenRefine screen by going to “Open project”.

2. Cleaning and structuring data

  1. At the top, under the heading “50 rows,” you’ll see the option to show more rows. Click 50 to show all the rows at once.
  2. Column 2 shows the title of the films, but we want to get rid of the number at the beginning. Click the arrow beside “Column 2”, then Edit Column > Add column based on this column… We will be doing this a lot. (Instead of adding a new column, you could also go to Edit cells > Transform …, but I like to keep the original column so I can see if anything went wrong, and then delete it later.) Enter “Title” as the New column name. The Expression field uses General Refine Expression Language (GREL), which you can learn in detail here if you’d like. But for today, we will mostly be using the replace function, which is similar to the regexreplace you used in Google Sheets. In the Expression field, paste this: replace(value,/^\d+\. /,""). This means:
    • replace(value,... Replace the value of each cell in this column (value).
    • /^\d+\. / Look for the regex expression between the / symbols. (If you were not using regex, you could just write the text you want to look for between ".)
    • ^\d+\. Beginning of the cell (^), followed by 1 or more digits (\d+), followed by a . (a special symbol in regex, which has to be written as \. in regex if you’re looking for an actual dot), followed by a space ( or \s).
    • ...,"") Replace what was found with nothing, in other words, delete it.
  3. Notice that in the Undo/Redo tab on the left, you can see a detailed history of what you’ve done and go back to any step.
  4. In Column 3, we have three different fields squished together: year, runtime, and rating. Let’s make a year column by adding a column based on column 3. We’ll call it “Year” and use the GREL Expression replace(value,/^(202\d).*/,"$1"). This erases everything after the year, but keeps the part in (), the first “capturing group,” which is $1 in the replacement.

  5. To separate out the rest of Column 3, I could keep adding columns, but I’m going to do something fancy with the “split column” feature instead.
    1. Go to Edit column > Split into several columns …
    2. Uncheck “Remove this column” and check “regular expression”.
    3. As the separator, use the regex 202\d|[hm]. This means, the places to split the text are at the year (202\d, simply removing the year), or (|) at the h or m ([hm]) for hours and minutes.
    4. Delete Column 3 1 (Edit column > Remove this column) and rename the other columns “Runtime H”, “Runtime M”, and “Rating” (Edit column > Rename this column).
  6. Ultimately I want to have runtime in the format H:MM such as “1:07”, but the minutes sometimes have just a single digit (“7”) instead of a leading zero (“07”).
    1. On the column Runtime M, click Text filter. Then click on the Facet/Filter tab on the left side of the screen, check “regular expression”, and enter ^\d$. Since ^ means “beginning of the string” and $ means “end of string,” this shows you only the rows that have a single digit.
    2. On column Runtime M, click Edit cells > Transform … This will change all the cells in the filtered rows only (!) using the expression you enter here.
    3. Enter "0" + value and click OK. This means, “Put the text ‘0’ in front of the cell’s current value.”
    4. The rows will disappear, because they now have 2 digits and n longer match your filter. Under Facet/Filter, click Remove all filters.
  7. Now I want to join the hours and minutes into a single “Runtime” column.
    1. On column Runtime H, click Edit column > Join columns …
    2. In the list of columns on the left, also select Runtime M.
    3. Select options: Enter : as the Separator between the content of each column, and select the option Write result in new column named. Enter Runtime there. Click OK.
    4. If that worked correctly, you can delete Column 3, Runtime H, and Runtime M.
  8. Let’s separate Column 4 into User Rating, Number of Ratings, and Metascore.
    1. Use Edit column > Split into several columns … again, with the regex ?\(|\),? which means that the text should be split where there is a ( possibly with a space before it ( ?\(), or (|) at a ) possibly with a comma after it (\),?).
    2. If this works correctly, you can delete Column 4 and rename the new columns.

    3. The Number of Ratings might not be important to me, but it’s a kind of measure of popularity, so if I do want to use it as a number for sorting or whatever, I should change the “K” into thousands. Do Edit cells > Transform … and enter replace(value,"K","000"). After this, you’ll have some incorrect numbers with a . in them, like “6.2000”. Do Edit cells > Transform … again and enter replace(value,/\.(\d)0/,"$1"), which finds e.g. “.20” and replaces it with just “2”. (Another approach would have been to use a filter to get all the rows containing K, then remove the K, and use a Transform to multiply the cells by 1000.)

3. Pulling in additional data

I might have everything I need now (hooray!), but perhaps I want to do something with the names of the stars in Column 1. OpenRefine actually has a feature to match your data to entities in Wikidata (largely coming from Wikipedia) or other sources. Many of these actors are surely in these databases, so I could get additional info about them. (For example, check out https://www.wikidata.org/wiki/Q349391 on Helen Mirren.)

  1. Separate the actors’ names into individual columns.
    1. Not all the films have the stars listed. The ones that do have the names of the stars followed by “in” and the film. On Column 1, click Text filter and enter ` in ` (including the spaces). You’ll see 19 films listing the stars. (Just to test, you can click “invert” on the filter to see only the ones without stars.)
    2. On the filtered list, click Edit column > Split into several columns … Uncheck “Remove this column” and check “regular expression”. Enter ,( and)? ?| and | in .*$. This will separate the names by commas and “and” including optional spaces (,( and)? ?| and ), and also remove the film title and date at the end by deleting “ in “ and everything after it. (It took me many attempts to get this regex right!)
  2. On a column of names (e.g. Column 1 1), click Reconcile > Start reconciling … As the service (database), I would normally choose Wikidata, but I’m having some difficulty today reconciling person names with that one. So I’ll use “GND reconciliation for OpenRefine” instead. This is the Gemeinsame Normdatei. If you don’t see that one, click Add standard service … and enter the URL https://lobid.org/gnd/reconcile. In the list of types, click “Individualisierte Person”, then Start reconciling … This may take a while.
  3. In the column you reconciled, you’ll now see some different options with each name:
    • Names that were matched with “high confidence” will appear simply as a link that you can click on to go to their page in the database.
    • If the database has multiple matches, you’ll see a list to choose from.
    • If there is no match, you’ll see the option “Create new”.
    • You’ll also see filters in the sidebar where you can show which rows had successful matches.

  4. Now you can add new information from the database you reconciled against.
    1. On Column 1 1, click Reconcile > Add column with URLs of matched entities. Type URL as the column name and click OK. Now you have the web address of the person in the GND database.
    2. On Column 1 1, click Edit column > Add columns from reconciled values … Try something general from the list like “Geburtsdatum” and click OK. Now you have the birth date of the actors you reconciled!

  5. You can try something similar with the film titles using Wikidata (which seems to work for this one).