Uploading a spreadsheet to Mix’n’Match

If your institution maintains a public list of IDs – like the numbers Heritage NZ gives each of its historic buildings, or the IDs that Find NZ Artists uses in its URLs – you can add them to Wikidata items two ways: the slow way, by editing items one after another and adding your number to each; or the quick way, by using a tool like Mix’n’Match.

Screen Shot 2020-05-28 at 8.24.13 PM.png

Mix’n’Match was written by Magnus Manske to allow volunteers to check and confirm computer based-matching of an uploaded database against Wikidata. Nothing is written into Wikidata until a human being confirms it.

To upload a spreadsheet into Mix’n’Match, you need four things: the unique ID that you want to add to Wikidata items (which should be a Wikidata property), a Name for each person, place or thing, and a short Description. The most important things for descriptions of people are birth and death dates, nationality, and profession. Only about 40 words or 250 characters of description will be displayed, so no need to go nuts here. Each ID should have a corresponding URL somewhere, so person 1234 should be found at something like: http://www.mysite.org/people/personlist.1234

The Alexander Turnbull Library, part of the National Library of New Zealand, maintains an authority list for authors in their collection. Last year I proposed a Wikidata property, Alexander Turnbull Library ID (P6683) – you can read the proposal discussion here. Once the property was approved, anyone could add that ID manually to a person in Wikidata. But to speed up the process, Jay Gattuso at the Turnbull extracted a spreadsheet of 2000 names for Mix’n’Match. To increase the chance of finding matches, he picked the 2000 people with the most connections to items in the collections, which should be a reasonable proxy for Wikidata notability.

The 2000 most-connected people in the Alexander Turnbull Library unpublished-works database

The 2000 most-connected people in the Alexander Turnbull Library unpublished-works database

To prepare the dataset for Mix’n’Match, I merged the names with =CONCATENATE(B2,C2), merged the dates and bio with a space between them with =CONCATENATE(D2,” “,E2), and deleted the XML and relations columns. We could keeps the URL column but everything follows the same schema, so we’ll just show Mix’n’Match how to assemble URLs from the IDs later.

The cleaned up spreadsheet, ready for export

The cleaned up spreadsheet, ready for export

Mix’n’Match wants a tab-delimited text file (cell contents separated by tabs, a paragraph mark at the end of each line). In LibreOffice, you do this by saving the data as a test CSV file (CSV stands for “comma-separated values”; yes, it’s confusing), and in the next dialogue box setting the delimiter to be tabs, not commas.

Screen Shot 2020-05-28 merged.png

So now we go to Mix’n’Match, click “import”, and fill in the form. Make up a short name and 1-sentence description for your dataset. Clicking on this description sends them to the catalogue URL (in this case, Tiaki). Pick a vaguely-appropriate category, and add your ID's Wikidata property (or go propose one and come back when it’s approved). The URL pattern for this particular upload is https://tiaki.natlib.govt.nz/#details=ethesaurus.$id (where $id is our Entry ID). And these are all “instances of humans”, so they’re Q5.

Screen Shot 2020-05-28 at 9.01.19 PM.png

You can upload your tab-delimited file, or just paste the text into a box. So I do so, and oh no! An error: “empty cells in line 200”. I go back to the spreadsheet and realise what the problem is: Dennis Huggard’s description is broken into paragraphs, and extra paragraph marks are a no-no in a CSV file (each paragraph is treated as a new line of data).

Screen Shot 2020-05-28 at 9.29.50 PM.png

So I scroll through the database and delete extra paragraphs in descriptions where I see them (they’re not hard to spot). I could do a find-and-replace to turn them into spaces, but I cannot make LibreOffice’s regular expressions special-character search work, sigh. Luckily there isn’t much to fix. I export a fresh CSV, try uploading again, get another error, and repeat once more.

Screen+Shot+2020-05-28+at+9.57.36+PM.jpg

Success!

Screen Shot 2020-05-28 at 9.59.53 PM.png

It looks like hardly anything matches, but don’t be fooled: the software is busily making comparisons behind the scenes, and it will take a few minutes to finish.

Screen Shot 2020-05-28 at 10.35.39 PM.png

Now I’ve started confirming those matches: most are a doddle (“Yes, that Igor Stravinsky…Yup, that Keith Holyoake…”). It looks like the software has found a match for most of the items. When I check the unmatched, some are just cases where the Turnbull uses a person’s full name (John Alan Edward Mulgan) but that full name’s not recorded in their Wikidata item (John Mulgan). So I can find them in Wikidata, add the name, and link them manually in Mix’n’Match by clicking on Set Q. Beginners can stick to the preliminary matches, people who know a little Wikidata can resolve the unmatched and create Wikidata items where needed.

So there you go! I hope this has encouraged beginners to try uploading their own Mix’n’Match dataset. Do spread the word to volunteers about this fun and slightly nerdy activity; something to calm our minds in stressful times, and more socially useful than jigsaws.

Previous
Previous

Questions about your photo archive

Next
Next

A Wikimedia strategy for a radio station