Batch-Downloading from Wikimedia servers

Submitted by Christoph on 17 February, 2007 - 23:18

Today I justed wanted to download some images from Wikimedia Commons. As it turned out, it's not that simple: How to get all images from one category (in this case the category of the chinese stroke order project)?

How to get the pages from one category?
Well, there is an API for doing stuff like this, you just have to know how to use it (remove the line break):
http://commons.wikimedia.org/w/query.php?action=query &what=category&cptitle=Chinese_stroke_order_in_animated_GIF &cpnamespace=6&format=xmlfm&cplimit=400
The limit might have to be changed, if the category is bigger. Too big values seem not to supported. The format "xmlfm" gives me some nice formated output, "xml" will do the same without embedding in html. Then grep the urls of the picture description pages: cat pictures.xml | grep title | sed 's/\s*<title>/http:\/\/commons.wikimedia.org\/wiki\//' | sed 's/<\/title>\s*//' | grep -e "Image:.*"
How to get the pictures from all the picture description pages?
Run the following on the file with the xml output wget -qO - -i strokeorder_url | grep fullImageLink | sed 's/^.*src="\([^"]*\)".*/\1/' to get all the urls.

How to download these files?

wget didn't work, it messed up 2/3 of all filenames. You might want to try with python:


#!/usr/bin/python
# -*- coding: utf8 -*-

import urllib
import sys
import re
import codecs

class AppURLopener(urllib.FancyURLopener):
    version="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

urllib._urlopener = AppURLopener()

listFilename = sys.argv[1]
file = codecs.open(listFilename)
for url in file:
    url = urllib.unquote(url)
    matchObj = re.search("/([^/]+)$", url)
    if matchObj:
        fileName = matchObj.group(1).strip("\n")
        print "getting", fileName
        urllib.urlretrieve(url, fileName)
file.close()

Notice the user agent string, without that I would constantly receive an "access denied" warning.

That's it. Leave a comment if that was helpful to you.

Christoph's blog

using wget

Submitted by Christoph on 5 January, 2008 - 12:26.

wget seems not to work out of the box for this case as it seems to do some backward compatibility. See http://thread.gmane.org/gmane.comp.web.wget.general/7127.
They say "--restrict-file-names=nocontrol" should do the trick, though I didn't test that yet.

Christoph's CJK-centered concerns

Navigation

tags in site content

Archive

Blogs I read

Batch-Downloading from Wikimedia servers

using wget