Prowling the Digital Forest: 2008

2008-09-25

Implementing imperation in Prolog

Two blog posts: this presents the core of a simple Algol-like language implemented in Prolog, and this attempts to fix errors in it.

2008-05-29

Another step towards human-faced Java: single binary

I've found a nice way to produce classical 'single binary' applications developed in Java without sacrificing their portability.
Java apps are typically packaged as JAR files. Format-wise, a JAR is actually a ZIP. ZIP file format is end-anchored; a ZIP file remains processible if random bytes are prepended to it.
Therefore, if we package a Java application as a standard standalone JAR file, make sure the Main-Class manifest attribute is properly set up, and prepend these two lines to the file:

#! /bin/sh
exec java -jar "$0" "$@"

we get a file that is, at the same file, a JAR containing all the original content, and a shell script that, when executed, invokes JVM instructing it to then interpret the JAR as a standard Java app. It also passes on all the original arguments and the exit code of the JVM, as expected.
The result behaves pretty much like a standard binary. The primary restrictions are that JRE needs to be installed in a PATH-findable way, and suid doesn't work on such pseudo-binaries on most systems. The primary advantage is that such a binary does not depend on the host architecture or even the OS (as long as it is a Unix, that is). An interesting secondary advantage is the standard resource handling mechanism inherent in the concept of using JARs for applications.

2008-04-22

Legislative Programming

Don Knuth's Literate Programming has shown us how to match the way a program's entrails are documented with the way the programmer thinks of them, time-wise. When the programmer thinks well, this can result in better programs than what would happen when the programmer would strictly follow the language-compelled structure. Unfortunately, the original WEB's paradigm comes from a simpler era, when a program was just a program, and the programmer's primary work was to write the program.

But within the intervening 30 years, the scope of a programmer's work has widened considerably. Programmer doesn't just write The Program anymore; programmers develop systems via applying the art of programming. This means they also:

design inter-program interfaces
write add-ons to existing programs
even write throwaway code for the express purpose of generating "real" code
construct white box test cases

All of that work is a part of programmer's work, may introduce complicated constructs that should be documented, and merits documented along with The Program itself. Thus, while Knuth's original intention --

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.

-- is good, it is not complete enough.

Given that the scope of a programming job is considerably wider than the construction of a source file, I believe the scope of a literate programming document should be expanded to transcript of whatever the programmer needs to do to resolve this problem. The main difference from a mere log of programmer's activities would be that the document wouldn't need to describe false starts or bugs in detail, although it could caution against the more devious ones. Because such a document could easily be thought of in the style of An Act of the College of Programmers to Develop a System to Gleefully Greet the Globe, I'm jokingly calling this Legislative Programming for now, but a better name will probably be found eventually.

A legislative programming system should be capable of both tangling and weaving, as is Knuth's original system and almost all of its known descendants. However, it should go somewhat further:

It should allow specifying how external commands -- such as the C translator -- are to be invoked. Integration with make or an analogous tool might be desirable. This would define a process of executing the job; the results of such an execution would be all the files the programmer sought to produce -- including the executable, possibly within a container such as JAR or WAR for Java projects.
It should allow specifying several text files in various languages. Many WEB's descendants, such as noweb, already do this.
While it's essential that the programming job be weavable into a single, all-encompassing architectural document, the system should also support creation of daughter documents which should, of course, appropriately appear in the architectural document. Such documents should be used to document specific interfaces of the software system in question -- such as the command line, any configuration parameters or extensibility.
It should provide a convenient approach for building on earlier projects. WEB's change mechanism solves a special case of this problem, offering a way to adapt the underlying system to specific platforms. When a Program Is Just a Program, but there are many systems, this is appropriate. However, by 2008, everybody and his grandma runs some sort of POSIX/Java/ANSI C capable system, which provides for fairly common shared platforms, so adaptation is no longer the primary way of code reuse. Nowadays, the primary ways of code reuse are software extension and aggregation, and neither is easy (or even appropriate) to do via WEB's change mechanism.
It should provide a convenient approach for building on later projects. For example, consider Forth, a language commonly used for software extensibility. Let's suppose I've built a framework providing the common Forth services. A later project exploiting this framework would necessarily want to add new variables, words or perhaps a new interface to virtual files. To that end, the framework project should be structured to use clean interface for such new definitions, and it's the responsibility of the legislative programming system to facilitate this. The best way to attain this I currently know involves treating every 'programming job' as an object of a Self-like programming language, and defining composition and prototype-based "inheritance" on these objects.

In Knuth's system, there is a considerable emphasis on prettyprinting. While it's a useful feature, I do not consider it essential, as its influence on the way the programmer works is not very significant. Furthermore, supporting a wide number of languages makes proper prettyprinting expensive to implement.

I used to consider it important to reveal the structure of the literate programming 'source' file through a DOM-like API to client code, but experience has now convinced me this, too, is a luxury of minor practical utility. Besides, most benefits of such a system can be achieved by a good throwaway script support.

2008-04-17

How to fix Wikipedia, part 1: edit wars

(This post is a part of the series How to fix Wikipedia.)

What's the problem?

From the horse's mouth:

Edit warring occurs when individual editors or groups of editors repeatedly revert content edits to a page or subject area. Such hostile behavior is prohibited, and considered a breach of Wikiquette. Since it is an attempt to win a content dispute through brute force, edit warring undermines the consensus-building process that underlies the ideal wiki collaborative spirit.

Analysis

The necessary ingredients for a sustained edit war are:

There are two or more persons or factions with strong yet conflicting beliefs about a particular topic.
There's only one article per point of conflict to contain the belief.
The persons or factions involved persist.
Wikipedia's policies deliberately lack a terminal resolution mechanism for such conflicts.

Removing any of these four factors would make edit wars unsustainable. For example, removing The Other Factions through blocks and bans would guarantee a particular side a victory, and end the particular edit war. (This is a popular pastime for many Wikipedians, and a fundamental reason for most of the Wikipedia politicking.) Of course, this solution has the important downside of reducing the pool of Wikipedians.

Interesting previously used solutions

As I pointed out above, many people on Wikipedia like to use the solution of getting The Other Faction blocked or banned.

A more "orderly" approach sometimes used on Wikipedia is page protection. A well-known quip is that protection results in the Wrong Version of the page remaining unchanged and unchangeable for extended periods of time.

Some people have forked Wikipedia in its entirety, trying to retain the mission, impose their additional guidelines or rules, and replace the content. Consider Conservapedia and Veropedia, for example.

Citizendium is another fork that involves thorough redesign of the underlying sociopolitical system.

My solution

I believe the best way to permanently end edit wars, with the least actual downsides, is to discard the concept of article uniquity, and replace it with a fork-at-will scheme. This would mean:

For every article there is, every user can create a new version he believes is improved over the previous version.
Such a new version would not replace the previous version but appear parallel to it.
Every reader would be able to see all the outstanding branch versions.

An obvious problem with implementing only this bare minimum is that there are a lot of articles, and many of them have a lot of versions -- and this can easily lead to information overload. This can be tackled in two ways:

Machine-assisted comparison: the engine should provide a clear and readable overview on how do the versions differ.
Personal approvals and network trust: users should be able to indicate which version (or which versions) they believe is best. Furthermore, the engine should enable automated processing of the social trust network by allowing users to delegate preference to other people whom they trust to make good decisions. This formal preference should give the engine data it could use to prioritise the branches by relevance, and according to the particular user's preferences.

This would result automatic formal factionalising on all controversial topics, allowing each side to work on a version of their preference on any article without disruption from other sides. Furthermore, because the trust network is formalised and accessible to the engine, the engine might be able to assess the sizes of various factions, the similarities and differences between various people, and even match people with similar understandings on What Should Be in the Wiki. Like Amazon's "Other people who bought these books also bought ..." feature, the wiki engine could offer users "Other people who approved these versions also trusted these people ..." feature.

Solved issues

This approach obsoletes page protection and replaces it by withholding approval of new version. If such withholding is automatic unless explicit (although possibly indirect) trust has been invoked, it also obsoletes the weird social construct of vandalism patrol as nobody would approve random vandalism (or if they did, it would be easy to distrust them). Of course, organised trolling groups such as GNAA might be able to pull of more notable things, but again, it would be easy to keep them off your network of trust. In essence, every singular troll would become a faction of one, and organised trolls would just become their own factions.

This approach allows the users the benefits of their information being properly gatekept, but does it without forcing any gatekeeper and his biases on any user. Everybody who is unsatisfied with the articles there are could become a gatekeeper of their own, or nominate a gatekeeper for them.

One persistent kind of conflict in Wikipedia is the continuous battle between the so-called 'deletionists' and 'inclusionists'. One camp wants to enforce a high bar of entry for Wikipedia's articles; the other, in turn, wants to enforce a very low bar of entry. If we recognise "this article shouldn't appear as all" as one possible versions, allowing people to "prefer not having an article", we should be able to make both inclusionists and deletionists if not happy then at least content.

When Wikipedia started, it imported the bulk of the 1911 edition of Encyclopedia Britannica. This has had a number of interesting consequences, and some problems. Similarly, when Citizendium started, it pondered for quite a while on whether to start from a carte blanche or import the then-current state of Wikipedia. A wiki engine conducive to branching would solve this dilemma by offering the possibility of including any earlier encyclopedic work as one fork. Where, say, the 1911 Encyclopedia of Britannica is still adequate, people could check it over and approve its version; where it isn't -- such as nuclear physics --, new articles could be written, and the knowledgeable people's approval of these over the old encyclopædia.

Downsides

Fork-at-will will lead to compartmentalisation among people, not consensus. On its face value, it's true that such a proposal will enable high level of compartmentalisation or many parallel Walled Gardens. However, the ability to fork is not the cause of that.

We already live in a world of many, sometimes conflicting, beliefs. There have already been forks of Wikipedia. It appears it's unreasonable to expect for, say, a staunch Muslim and a devoted Catholic to find a consensus on the position of Jesus' status. At best, they may be able to agree to disagree, but this is not always the case -- especially on Wikipedia. Forking-at-will will facilitate such agreements to disagree, by allowing people of different beliefs to have their own -- necessarily different -- wiki in the same space.

Furthermore, when it is possible for different factions to work out a consensus -- for example, by developing a joint version of an article that would discuss all involved sides' approaches --, I fully believe they'd do it even with fork-at-will. (For example, it's not inconceivable that a Catholic and a Presbyterian will find a consensus on Jesus' status, even if they differ widely on the relevance of Mary's sainthood.) Thus, I wouldn't say that fork-at-will leads to new compartmentalisation; instead, it will provide a new layer of flexibility.

Conspiracy theorists, pseudoscientists, and other fringers will be able to push their message. To an extent, this is true. However, to an extent, this is also an inevitable aspect of crowd-sourcing. In consolation, the non-conspiracy theorists will always be able to disapprove of conspiracy theories, and will not have to suffer from them. For pseudoscience, I anticipate a science team arising to check out on science articles, approve the good ones, disapprove the bad ones, and allow other people to trust them. Should -- hypothetically -- the team somehow be overtaken by, say, homœopaths or other woos, the science-minded people's trust in them would wane, and be carried over to a new and better science team.

Unsolved issues

How best represent trust? I believe this is the issue that will require most further thought (and careful experimentation) in this whole solution. In the beginning, I would give each user the buttons "I consider this version good" and "I consider this version bad" as well as "I expects this user's new edits will be good/bad" and "I trust this user to have good judgement on article goodness". First further developments would allow users to form groups, to trust groups, and to limit trust to particular topics ("I trust this user to have good judgement on article goodness for astronomy-related articles"). Further developments should depend on how well this works in practice.

What to offer to random passersby? One of the strong points of Wikipedia is that it makes a handy one-stop link for bloggers, making its articles very high among most search engine's results. This strength depends, to a considerable degree, on article uniquity, and discarding this uniquity may thus handicap the solution. However, considering that we don't have to discard the uniquity of article topics but only of the best versions, I suspect this problem can be overcome in a combination of two approaches. First, as I mentioned before, the software should be able to easily point out the differences between major versions. (I admit that 'major' is somewhat vague here, but 'approved by largest amount of users' should make a good first approximation until we figure out a better approach.) Second, once we have groups of users and group approvals, it'll be easy for the link author's to specify which group's version he wants to link to. Thus, WingNutDaily could always link to the Conservapedia-like version of the article on, say, evolution.

Social issues

If the basic mission of Wikipedia is to be followed, initial groups on natural sciences, arts, and history should be set up. Of course, the spirit of non-exclusivism will allow other groups of similar scope to also be formed. For example, there will almost certainly be a general pseudoscience and ufology group.

Coöperation between various factions will still be useful, but since non-exclusivism will avoid forcing it on people, the social system may have to provide stronger incentives towards coöperation.

If the system becomes popular, there will probably be quite a number of groups of different structure.

Depending on the way groups are formed, some may become suspectible to hostile takeovers. Experience will hopefully teach us how to deal with this danger.

How to fix Wikipedia, part 0

Wikipedia being broken beyond repair, it can't be fixed, per se. However, a thorough understanding of how exactly it is broken may aid people in developing a better replacement.

I have studied a number of incidents, and I think I now understand a large part, although not yet all, of the problems that make Wikipedia suck. For some, I can recommend remedies; for some -- not yet. Although a lot of the problems are social or epistemological (or misological, as the case may be), and not technical, my focus is on technical solutions. Some of them should be able to fix some of the social problems, but there are still issues worth considering.

I'm planning to cover roughly these topics:

Edit wars
Class conflicts
Misgovernance
Cabalism
Scope of crowd-sourcing
Lack of gnomes
Fringe faiths
The fallacy of consensus
Systemic misology
Consistency

However, I'm not yet done with all the writing, and the structure may yet to change. This post should become an index, though.

2008-04-08

How to Buy a Standard in 10 Days

A nice overview of the whole OOXML saga.

2008-03-31

Why Wikipedia is harder to write than Linux

Interesting article at Wikipedia Review.

The article ends in asking:

It’s basically a failure of leadership. There are plenty of models of successful open source projects. Debian and Ubuntu, for example, are exemplary social contract communities with good project leadership.

For reasons unbeknownst to me, Wikipedia eschewed that proven organizational model in favor of a cultish enterprise with way too much anonymity and way too little organizational vision, and negligible attention to an ethical value system.

I suspect I have a vague understanding of why that is. I'll try to get a thorough overview of the issues together, but for the time being, here's the executive summary:

Jimbo, having seen only small wikis in action by the beginning of 21st century, and being a staunch follower of the cult of Objectivism, irrationally believed in the concept of objective consensus.
Some people around him bought it up, for reasons ranging from the idea's apparent neatness to personal charisma.
History happened.

2008-03-28

Seth Finkelstein on Wikipedia

Three months old, but worth saving: Inside, Wikipedia is more like a sweatshop than Santa's workshop.

2008-03-27

WiFi routers: silent, blinking death?

A nice cartoon about the dangers of tumor-shooting, baby-eating WiFi routers.

2008-03-24

Setting up SpamAssassin to work with Exim on Debian

A nice summary on setting up SpamAssassin to work with Exim on Debian. It also mentions F-Prot.

Russian patrilinear heritage

An interesting genetic study's report, 'Two Sources of the Russian Patrilineal Heritage in Their Eurasian Context' is out.

2008-03-21

Sibuyeh runtime support

Another runtime library. Sibuyeh itself is a system for constructing parsers.

Civilised file creation in Java

Here's my Outscribe library for a (somewhat) reasonable text file creation in Java. Again, no documentation yet.

The current Java-based prototype of Fabricator uses this Outscribe library.

Caveat: currently, some Java, XML and TeX processing facilities are inside the Outscribe library. It is very likely they'll be moved out in the future.

Some Java utilities

Java being an evil language, it inevitably needs quite a bit of padding and upholstery before becoming usable. I've been collecting such for some time now, but here's the part I'm ready to publish:

No documentation at this time. Some documentation will be added when I get around to extending Fabricator. Since it's nowhere near 'release-quality', either, there's no point in fancy versioning, either.

Working around stupid picture restrictions

Google has managed to be evil again. :-P

Blogger does not support file uploads. Alone, this would not be a problem -- in context of a blog engine, it could be a mere oversight, or 'unneeded complexity'. Since Blogger supports image uploads, one would presume that concatenating a random image to a zip file would do the trick. (It needs to be a zip file, because -- unlike most other compression formats -- zips are end-headered, allowing any zip tool to recognise a zip stream in the end of another file. It's useful, for example, in creating and processing self-extracting zips.)

And one would be wrong. It turns out Google actively reprocesses uploaded images, removing any such tails.

Anyway, the next step of a workaround is not quite as simple, but still, reasonably obvious: encode the file data into an image.
Here's a pair of complementary Perl scripts -- using the Perl GD bindings -- that perform such an encoding and decoding. First, an image containing both inside a zip:

And here's the Perl code to perform decoding:


#! /usr/bin/perl -w

use strict;
use GD;

my $png;
undef $/;
$png .= $_ while <>;
my $image = GD::Image->newFromPngData($png);
my ($width, $height) = $image->getBounds;
my $data = '';
for my $offset (0 .. $width * $height - 1) {
    my $byte = $image->getPixel($offset % $width,
        int($offset / $width));
    die "Too many colours" if $byte >= 256;
    vec($data, $offset, 8) = $byte;
}
my ($magic, $length) = unpack 'A4N', $data;
die "Unknown magic" unless $magic eq 'FiGP';
die "Too little data (image incomplete?)"
    unless $length <= length($data) - 8;
binmode STDOUT;
print substr $data, 8, $length;

Technical details

Blogger accepts images in three common formats: JPEG, GIF, and PNG. JPEG is generally a lossy format, so it's unsuitable for this usage; this leaves GIF and PNG. Little experimentation shows that internally, Blogger converts GIFs into PNGs, which implies that PNGs have a somewhat lesser chance of getting hosed in the process. Thus, we use PNG as the container format.

A PNG file can be viewed as a matrix of pixels, picked from a palette specified in the file. (Actually, more complex structures are possible, but this is good enough for our purpose.) Thus, an obvious choice is to be assigning to each possible byte in the plaintext one colour in the palette, and then to convert each plaintext byte into a pixel. For simplicity, we're not even finding the minimum set of all possible bytes in the plaintext, but using a constant set of 0-255. Since we don't really need the colours, only their indices -- which means that pretty much the only restriction is that the colour needs to be distinct for each byte --, we construct a palette in which Nth entry has the red-green-blue values of (N, N, N). Note that GD's png and gif construction subsystems assign colour indices in the order they're assigned, and that Perl's map works in the order of the given list.

Next, there's the issue of PNG being a two-dimensional matrix, and a file being a one-dimensional stream. We could resolve this by constructing a PNG that would be one pixel high or wide, but such a PNG would be inconvenient to handle. Thus, we use larger widths -- by default, 256 pixels --, and map the plaintext to the image by rows. However, this raises an issue of padding, at the very least when the plaintext's original length is a prime, or does not have convenient factors. This could be resolved in several ways; for example, we could use a special, 257th, palette entry to denote unused pixels. However, the way I chose is to add a special 8-pixel (or 8-byte) header to the image, to keep the useful size of the data, and to allow 'format identification' via a magic prefix. This format's magic prefix is 'FiGP', for 'file in grayscale picture'.

Finally, if we used a fixed width and calculated the height based on actual need, small files would produce wide and short images that, again, would be hard to handle. A workaround would be using a minimum height, but small files padded to this minimum height would look ugly, and random-padding to combat ugliness would certainly be overkill. Thus, our solution is to use progressively smaller widths if the height would otherwise be too small, and if that fails, apply the minimum height of 32 pixels.

Note that the decoder doesn't care what the image size is; it just walks it by rows. It also doesn't care what the image RGB values is, only using their indices. It checks the magic header, however.

2008-03-17

Traces of famine in Egypt

BBC reports that Egypt is increasing production of state-subsidised bread, apparently due to the rather poor population being more and more unable to purchase bread at rising unsubsidised prices.

NewsMax versus Wikipedia

Danny Wool writes about an interesting incident back in April 2006:

1. This week, the Foundation received a legal threat, termed by Brad
"the most serious legal threat we have received so far." The basis of
this threat was the statement by a very serious Florida-based group
with a very serious New York attorney that the Wikimedia Foundation
alone is legally liable for the actions of its admins, as they are
working on behalf of the Foundation.
[...]
4. At Brad's request, the two articles in question were stubbed and
protected. Brad wanted this to be done specifically by me, as an
employee of the Foundation, and not as an admin, which would feed into
the argument of those threatening us. The nature of the threat was,
and is, to be kept highly confidential. While the threat may have been
resolved in this instance, it could be used again, and we do NOT as
yet have a satisfactory answer to it.

In the comments for the post, he mentions this post as relevant explanation. From his contribution list of the time, it's rather obvious the articles in question are about NewsMax, a right-wing faux news website, and Christopher Ruddy, its founder. Apparently, the last pre-protection versions are correspondingly NewsMax Media #48758172 and Christopher Ruddy #48365799

If I understand the events properly, it seems that back in 2006, NewsMax folks took offence at something in the articles, threatened to sue Wikimedia Foundation if they're not censored, and Wikipedia obeyed.

TeX on MacOSX

http://tug.org/mactex/ offers a TeX system for MacOSX. But beware, the package is very large, about 750 megabytes.

Hyperlinks in pdfLaTeX

The search wasn't, alas, trivial, thus the results are worth writing down.

A package for comfortable hyperlink processing is 'hyperref', documented reasonably well at http://en.wikibooks.org/wiki/LaTeX/Packages/Hyperref.

2008-03-15

Microsoft to automatically classify news into 'liberal' and 'conservative'

from Washington Post, via Slashdot

This has interesting implications. For example, as it catches on -- and I'm sure it will --, it will contribute to the 'realities' of people of differing political position getting even more apart from each other than they already are (cue the famed PIPA study: Misperceptions, the Media and the Iraq War).

On another side -- and not entirely unrelatedly -- this kind of classification in a crowd-sourced website is one of the few ways people of irreconcilable beliefs can peacefully coëxist in the "same" website. One of the major problems with Wikipedia is that its Founding Crowd did not realise this, instead opting for an unattainable "consensus" concept.

2008-03-07

Article on Pirahã and Dan Everett

New Yorker ran an interesting article on the Pirahã language almost a year ago.