Prowling the Digital Forest: 2008-03

2008-03-31

Why Wikipedia is harder to write than Linux

Interesting article at Wikipedia Review.

The article ends in asking:

It’s basically a failure of leadership. There are plenty of models of successful open source projects. Debian and Ubuntu, for example, are exemplary social contract communities with good project leadership.

For reasons unbeknownst to me, Wikipedia eschewed that proven organizational model in favor of a cultish enterprise with way too much anonymity and way too little organizational vision, and negligible attention to an ethical value system.

I suspect I have a vague understanding of why that is. I'll try to get a thorough overview of the issues together, but for the time being, here's the executive summary:

Jimbo, having seen only small wikis in action by the beginning of 21st century, and being a staunch follower of the cult of Objectivism, irrationally believed in the concept of objective consensus.
Some people around him bought it up, for reasons ranging from the idea's apparent neatness to personal charisma.
History happened.

2008-03-28

Seth Finkelstein on Wikipedia

Three months old, but worth saving: Inside, Wikipedia is more like a sweatshop than Santa's workshop.

2008-03-27

WiFi routers: silent, blinking death?

A nice cartoon about the dangers of tumor-shooting, baby-eating WiFi routers.

2008-03-24

Setting up SpamAssassin to work with Exim on Debian

A nice summary on setting up SpamAssassin to work with Exim on Debian. It also mentions F-Prot.

Russian patrilinear heritage

An interesting genetic study's report, 'Two Sources of the Russian Patrilineal Heritage in Their Eurasian Context' is out.

2008-03-21

Sibuyeh runtime support

Another runtime library. Sibuyeh itself is a system for constructing parsers.

Civilised file creation in Java

Here's my Outscribe library for a (somewhat) reasonable text file creation in Java. Again, no documentation yet.

The current Java-based prototype of Fabricator uses this Outscribe library.

Caveat: currently, some Java, XML and TeX processing facilities are inside the Outscribe library. It is very likely they'll be moved out in the future.

Some Java utilities

Java being an evil language, it inevitably needs quite a bit of padding and upholstery before becoming usable. I've been collecting such for some time now, but here's the part I'm ready to publish:

No documentation at this time. Some documentation will be added when I get around to extending Fabricator. Since it's nowhere near 'release-quality', either, there's no point in fancy versioning, either.

Working around stupid picture restrictions

Google has managed to be evil again. :-P

Blogger does not support file uploads. Alone, this would not be a problem -- in context of a blog engine, it could be a mere oversight, or 'unneeded complexity'. Since Blogger supports image uploads, one would presume that concatenating a random image to a zip file would do the trick. (It needs to be a zip file, because -- unlike most other compression formats -- zips are end-headered, allowing any zip tool to recognise a zip stream in the end of another file. It's useful, for example, in creating and processing self-extracting zips.)

And one would be wrong. It turns out Google actively reprocesses uploaded images, removing any such tails.

Anyway, the next step of a workaround is not quite as simple, but still, reasonably obvious: encode the file data into an image.
Here's a pair of complementary Perl scripts -- using the Perl GD bindings -- that perform such an encoding and decoding. First, an image containing both inside a zip:

And here's the Perl code to perform decoding:


#! /usr/bin/perl -w

use strict;
use GD;

my $png;
undef $/;
$png .= $_ while <>;
my $image = GD::Image->newFromPngData($png);
my ($width, $height) = $image->getBounds;
my $data = '';
for my $offset (0 .. $width * $height - 1) {
    my $byte = $image->getPixel($offset % $width,
        int($offset / $width));
    die "Too many colours" if $byte >= 256;
    vec($data, $offset, 8) = $byte;
}
my ($magic, $length) = unpack 'A4N', $data;
die "Unknown magic" unless $magic eq 'FiGP';
die "Too little data (image incomplete?)"
    unless $length <= length($data) - 8;
binmode STDOUT;
print substr $data, 8, $length;

Technical details

Blogger accepts images in three common formats: JPEG, GIF, and PNG. JPEG is generally a lossy format, so it's unsuitable for this usage; this leaves GIF and PNG. Little experimentation shows that internally, Blogger converts GIFs into PNGs, which implies that PNGs have a somewhat lesser chance of getting hosed in the process. Thus, we use PNG as the container format.

A PNG file can be viewed as a matrix of pixels, picked from a palette specified in the file. (Actually, more complex structures are possible, but this is good enough for our purpose.) Thus, an obvious choice is to be assigning to each possible byte in the plaintext one colour in the palette, and then to convert each plaintext byte into a pixel. For simplicity, we're not even finding the minimum set of all possible bytes in the plaintext, but using a constant set of 0-255. Since we don't really need the colours, only their indices -- which means that pretty much the only restriction is that the colour needs to be distinct for each byte --, we construct a palette in which Nth entry has the red-green-blue values of (N, N, N). Note that GD's png and gif construction subsystems assign colour indices in the order they're assigned, and that Perl's map works in the order of the given list.

Next, there's the issue of PNG being a two-dimensional matrix, and a file being a one-dimensional stream. We could resolve this by constructing a PNG that would be one pixel high or wide, but such a PNG would be inconvenient to handle. Thus, we use larger widths -- by default, 256 pixels --, and map the plaintext to the image by rows. However, this raises an issue of padding, at the very least when the plaintext's original length is a prime, or does not have convenient factors. This could be resolved in several ways; for example, we could use a special, 257th, palette entry to denote unused pixels. However, the way I chose is to add a special 8-pixel (or 8-byte) header to the image, to keep the useful size of the data, and to allow 'format identification' via a magic prefix. This format's magic prefix is 'FiGP', for 'file in grayscale picture'.

Finally, if we used a fixed width and calculated the height based on actual need, small files would produce wide and short images that, again, would be hard to handle. A workaround would be using a minimum height, but small files padded to this minimum height would look ugly, and random-padding to combat ugliness would certainly be overkill. Thus, our solution is to use progressively smaller widths if the height would otherwise be too small, and if that fails, apply the minimum height of 32 pixels.

Note that the decoder doesn't care what the image size is; it just walks it by rows. It also doesn't care what the image RGB values is, only using their indices. It checks the magic header, however.

2008-03-17

Traces of famine in Egypt

BBC reports that Egypt is increasing production of state-subsidised bread, apparently due to the rather poor population being more and more unable to purchase bread at rising unsubsidised prices.

NewsMax versus Wikipedia

Danny Wool writes about an interesting incident back in April 2006:

1. This week, the Foundation received a legal threat, termed by Brad
"the most serious legal threat we have received so far." The basis of
this threat was the statement by a very serious Florida-based group
with a very serious New York attorney that the Wikimedia Foundation
alone is legally liable for the actions of its admins, as they are
working on behalf of the Foundation.
[...]
4. At Brad's request, the two articles in question were stubbed and
protected. Brad wanted this to be done specifically by me, as an
employee of the Foundation, and not as an admin, which would feed into
the argument of those threatening us. The nature of the threat was,
and is, to be kept highly confidential. While the threat may have been
resolved in this instance, it could be used again, and we do NOT as
yet have a satisfactory answer to it.

In the comments for the post, he mentions this post as relevant explanation. From his contribution list of the time, it's rather obvious the articles in question are about NewsMax, a right-wing faux news website, and Christopher Ruddy, its founder. Apparently, the last pre-protection versions are correspondingly NewsMax Media #48758172 and Christopher Ruddy #48365799

If I understand the events properly, it seems that back in 2006, NewsMax folks took offence at something in the articles, threatened to sue Wikimedia Foundation if they're not censored, and Wikipedia obeyed.

TeX on MacOSX

http://tug.org/mactex/ offers a TeX system for MacOSX. But beware, the package is very large, about 750 megabytes.

Hyperlinks in pdfLaTeX

The search wasn't, alas, trivial, thus the results are worth writing down.

A package for comfortable hyperlink processing is 'hyperref', documented reasonably well at http://en.wikibooks.org/wiki/LaTeX/Packages/Hyperref.

2008-03-15

Microsoft to automatically classify news into 'liberal' and 'conservative'

from Washington Post, via Slashdot

This has interesting implications. For example, as it catches on -- and I'm sure it will --, it will contribute to the 'realities' of people of differing political position getting even more apart from each other than they already are (cue the famed PIPA study: Misperceptions, the Media and the Iraq War).

On another side -- and not entirely unrelatedly -- this kind of classification in a crowd-sourced website is one of the few ways people of irreconcilable beliefs can peacefully coëxist in the "same" website. One of the major problems with Wikipedia is that its Founding Crowd did not realise this, instead opting for an unattainable "consensus" concept.

2008-03-07

Article on Pirahã and Dan Everett

New Yorker ran an interesting article on the Pirahã language almost a year ago.

Prowling the Digital Forest