Wikipedia Offline Server 0.2

With this, I am officially releasing the Wikipedia Offline Server 0.21 into the world (thanks Liam and Espen B-P for helping to test 0.1 and 0.2).

UPDATE: I’ve made a static page about the Wikipedia Offline Server, and released version 0.22. Go to the new page.

UPDATE: The new, significantly changed project, now lives on Gitorious.org.

Download the Wikiserver

Here are the release notes:

The WIKIPEDIA OFFLINE SERVER (temporary name)

This is a ruby script designed to enable you to browse offline HTML-dumps of the Wikipedia for any language (downloadable from http://download.wikipedia.org). The script uses 7zip to selectively uncompress only the files needed (it unpacks a few css files etc at the beginning), keeps a cache of all files already read, and implements an internal webserver (Webrick) which can be viewed at http://localhost:2000/wiki. Currently it is very unfinished, but still quite useable. I am putting it out there for people to try - and I think it could become a great tool with some help (especially with 7zip).

## THE PROBLEM WITH 7ZIP
The wikipedia-(language version)-html.7z is a compressed files that contains all the individual html files (generated from the Wikipedia sql) that make up a language wiki. This can run up to hundreds of thousands of files (or millions for the bigger Wikis). If we uncompressed these to the harddrive, they would take up very much space, both because they are big, and because there is a very large amount of small files, that still take up on block (depending on file system). We want to be able to use the 7zip file directly. Currently I use the off-the-shelf 7zip decompressor, but it’s slow. It was not optimized for finding one file within 100,000. Depending on your computer, on my 2 year old iBook it can take up to 14 seconds to locate one file from the Norwegian Wikipedia. This is independent of my script.

Proposed solution: I am quite sure that it would be easy (a few hours) to use the freely available source code for 7zip to implement a small tool that indexes a 7zip compressed file (writes all the filenames, and their offsets, in a separate file). If we then modified the uncompressor so that it reads the offset from this file, and not by going through the entire file from start to finish, I think we could get the time to uncompress down to almost nothing. This would radically improve the useability of this project (and might be useful for other projects too). My problem is that I know nothing of C, and I don’t really have the time to learn. I would REALLY really appreciate some help here!!

## LIMITATIONS (UPDATE)
It seems that using links that have non-ASCII characters (including the last three in the Norwegian alphabet) do not work at all on Mac or Windows (could be my sh setup that is bad), but has been reported to work on Linux. Does this suck insanely? YES! Is it easy to fix? Not necessarily, because I need to get the unicode characters correctly from the webbrowser, to the webserver, through the commandline, to the 7zip commandline utility. This is obviously a high priority for me, also because it renders the Chinese stuff completely unusable. Note that it can display articles with unicode perfectly, just cannot deal with the filenames.

Second limitation: No installation routine… But I’ll give you some hints.

## INSTALLATION
Nothing to it really, but you need a few things. Ruby and 7zip. After that, you should be able to run ruby wiki-html.rb from within the directory where the file resides, and if there are any “wikipedia-*-html.7z” files present it will work. Use a web browser (I assume you can figure out how to download a web browser) to view http://localhost:2000/wiki and it will either show a list of available wikis (afterwards available through http;//localhost:2000/wiki_list), or if there is only one wiki dump file present, it will display the start page for that wiki. Happy surfing (but be prepared to be patient).

## GETTING RUBY
If you are on Windows, and have not installed Ruby, do it now. The easiest is the One-click installer at
http://rubyforge.org/projects/rubyinstaller/

## OTHER PROGRAMS
You also need 7zip. If you are on Mac, download p7zip from Sourceforge here:
http://p7zip.sourceforge.net/

If you are on Linux, your package manager should have it, or you might have it installed already (try typing 7z or 7za in a shell window).

Note that the WikiServer expects you to have a program called 7za in the path. If you are on a PC, locate the executable 7z.exe in the 7zip folder in Program Files (or equivalent), and copy it to the path, or to the directory you are launching WikiServer from.

On Mac and Linux, the ruby executable should already be in the path. If you are on a PC, you can add the ruby directory to your path, or just copy the source files you just got from me (and 7za) into the ruby/bin directory, and launch from there. (A messy, but simple solution if you want to try this out).

## GETTING DUMP FILES
Static HTML dumps, needed by the server, can be downloaded from http://static.wikipedia.org/. Currently the last is from November, but the December dump is in progress. Click on the Download link, choose your language, and download the filename ending in .7z. These should be in the same directory as the server files.

## THE WAY FORWARD
In addition to making this an easy to use way of viewing Wikipedia offline for Windows, Linux and Mac, I am also planning to make CD/DVD distributions that contain all the necessary programs (7zip, Ruby etc), and can run directly off the CD. (Ideally, I’d get the guys who sell pirated DVDs on street-corners in Indonesia to start hawking legal Wikipedia!). Before I start doing any of that though, I need to get the 7zip issue sorted out.

For someone with another approach to packaging Wikipedia, have a look at the MoulinWiki: http://www.moulinwiki.org/.

## FEEDBACK, BUGS
Any feedback is welcome. This is very new, in progress, has barely been tested on other computers than my own (Mac and PC), and might very well not run on other configurations. Paste the output and email to me. Thanks.

## THANKS
Thanks a lot to Liam Doherty and Espen Beer-Prydz for helping me test the first version and giving me feedback on instructions on different platforms. Keep testing it guys!

Stian, Jakarta, 2007 - shaklev@gmail.com

Download the Wikiserver
Stian

Similar posts that might interest you:

5 Responses to “Wikipedia Offline Server 0.2”

  1. Walter Vermeir
    February 19th, 2007 @ 18:41

    Hi,

    Have tested it and it works fine. Also very fast. But that can be becuase I have used the dump of a very small wiki (3,5mb)

    System;
    GNU/Linux Ubuntu 06.10
    Installion done of “ruby” and “p7zip-full” (did not work with “p7zip”) with synaptic

    It did not worked at first because I had not put the dump in the same subdirectory.

    That did give the output;
    Initializing…
    ./wiki-lib.rb:29:in `initialize’: undefined method `+’ for nil:NilClass (NoMethodError)
    from ./wiki-html.rb:49

    If that can maybe put a notice in it like ***Make sure you have put the wiki-dump.7z file in the same directory as wiki-html.rb ***

  2. Houshuang
    February 19th, 2007 @ 22:50

    Hi Walter,

    thanks a lot for testing it and getting back to me.

    Your comment is noted - I definitively need to upgrade the
    documentation a bit. i started out with this syntax

    ruby wiki-html.rb /wikipedia-no-html.7z no

    and as far as I know, this will still work (although I am thinking
    about cleaning this up). I then started working on cache, automatic
    pre-extraction of the most important things, etc, and interlanguage
    links, and it made so much more sense to assume that they were in the
    base directory with a specific file name format. But I will specify
    this more clearly.

    You are right - it is snappy because the file is small :) With a 60 MB
    file it’s quite different - because 7zip was never optimized for this.
    We’re working on it though :)

    Let me know if there are any other places I should publicize this, I
    am still worried that people this might be useful for might not know
    about it. (Of course, once I get 7zip to work better, and make better
    installation routines, I might try to make more of a splash - this was
    more to get developer/power-user input anyway)

    Thanks a lot
    Stian

  3. Serpicozaure
    March 1st, 2007 @ 12:38

    Hi Houshuang

    Have tested with wikipedia-id-html.7z and wikipedia-fr-html.7z , works with both.

    System :

    Xubuntu 6.10
    i had “p7zip-full” already
    didn’t know !!!ANYTHING!!! about ruby

    I had the same problem than Walter , dump and ( wiki-html.rb + wiki-lib.rb ) were not in the same directory.

    And at the begining i didn’t know really what to do about ruby, i read README.txt and went in Synaptic and didn’t know what to install , then i started with :

    ruby1.9 wich download and install the followings :

    libruby1.9 (1.9.0+20060609-1)
    ruby1.9 (1.9.0+20060609-1)

    but still didn’t know what to do with this , tryed “man ruby” and “ruby -h” in a terminal
    reply “command not found” or something like that !!!!

    i found this page :

    http://www.rubyist.net/~slagell/ruby/getstarted.html

    and with this command

    ruby -v

    i discovered that i don’t have any ruby installed !!!

    ( later i discovered also how to launch wiki-html.rb with this , in a terminal
    ” ruby /full_path/wiki-html.rb ” )

    finally i decided to run

    ” sudo apt-get install ruby ” in a terminal

    i can’t remember exactly what was installed ( though i suppose it was libruby1.8 and ruby1.8 and may be ruby )

    and then it worked !!!!

    About “where to publicize this” , i found the link here http://meta.wikimedia.org/wiki/Static_version_tools

    But it’s true that beetween meta and MediaWiki web pages it’s quite messy to find up-to-date documentation about those topics

    did u try any MediaWiki mailing lists ?? ( dunno which exactly )

    Anyway your tool seems quite simple to install finally then Bravo !!!!!

    Serpico

  4. Houshuang
    March 3rd, 2007 @ 2:59

    Hi Serpico,

    thanks a lot for testing it out and reporting on your findings! I am sorry that it was so difficult to get it installed. For the next version, or possibly the next version + 1, I am planning to try to wrap the script with Ruby and 7zip in one package, so that everything you’ll need will be included. I’ve also made the script simpler, and I will simplify the instructions.

    I hope you find it useful, and feel free to spread it to your friends. Keep an eye on this space for an updated and quite improved version within a few days.

    Thanks
    Stian

  5. Random Stuff that Matters » Blog Archive » Screencast of Wikipedia offline (zip-doc)
    April 11th, 2008 @ 10:33

    [...] I have mentioned my Wikipedia Offline project before (here and here), not to mention it its previous, very different, incarnation. The project is 95% functional, but is still waiting for assistance for someone who is better at Ruby or Python than me. Today I wrote up a number of bullet points, and posted a brief screencast, to share with people that might help out - I thought I’d post it here as well. The screencast (1 minute) is here, and the source is available. I still think it would be too bad to just let all this code go to waste, when it’s so close to being finished. Note: I write this with full respect for the other people working on, and having already produced versions of Wikipedia offline. [...]

Leave a Reply