Batch-Downloading from Wikimedia servers

Today I justed wanted to download some images from Wikimedia Commons. As it turned out, it's not that simple: How to get all images from one category (in this case the category of the chinese stroke order project)?
  1. How to get the pages from one category?

    Well, there is an API for doing stuff like this, you just have to know how to use it (remove the line break):

    http://commons.wikimedia.org/w/query.php?action=query &what=category&cptitle=Chinese_stroke_order_in_animated_GIF &cpnamespace=6&format=xmlfm&cplimit=400

    The limit might have to be changed, if the category is bigger. Too big values seem not to supported. The format "xmlfm" gives me some nice formated output, "xml" will do the same without embedding in html. Then grep the urls of the picture description pages: cat pictures.xml | grep title | sed 's/\s*<title>/http:\/\/commons.wikimedia.org\/wiki\//' | sed 's/<\/title>\s*//' | grep -e "Image:.*"
  2. How to get the pictures from all the picture description pages?

    Run the following on the file with the xml output wget -qO - -i strokeorder_url | grep fullImageLink | sed 's/^.*src="\([^"]*\)".*/\1/' to get all the urls.

  3. How to download these files?

    wget didn't work, it messed up 2/3 of all filenames. You might want to try with python:

    
    #!/usr/bin/python
    # -*- coding: utf8 -*-
    
    import urllib
    import sys
    import re
    import codecs
    
    class AppURLopener(urllib.FancyURLopener):
        version="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
    
    urllib._urlopener = AppURLopener()
    
    listFilename = sys.argv[1]
    file = codecs.open(listFilename)
    for url in file:
        url = urllib.unquote(url)
        matchObj = re.search("/([^/]+)$", url)
        if matchObj:
            fileName = matchObj.group(1).strip("\n")
            print "getting", fileName
            urllib.urlretrieve(url, fileName)
    file.close()
    
    Notice the user agent string, without that I would constantly receive an "access denied" warning.

That's it. Leave a comment if that was helpful to you.

using wget

wget seems not to work out of the box for this case as it seems to do some backward compatibility. See http://thread.gmane.org/gmane.comp.web.wget.general/7127.
They say "--restrict-file-names=nocontrol" should do the trick, though I didn't test that yet.