Archive for the ‘Computer Science’ Category

Accidental Pilish: Unintentionally Constrained Writing in English Literature

Posted March 26th, 2010 on Bespoke

Background:

This post is a little late for Pi Day, but it’s never a bad time for discourse related to everyone’s favourite mathematical constant. Twas on Pi Day of this year that I somehow came across this site, which describes the Constrained Writing task of Pilish, in which the length of each word in letters corresponds to the digits of pi:

The first word in this sentence has 3 letters, the next word 1 letter, the next word 4 letters, and so on, following the first fifteen digits of the number π.  A longer example is this poem with ABAB rhyme scheme from Joseph Shipley’s 1960 book Playing With Words:

But a time I spent wandering in gloomy night;

Yon tower, tinkling chimewise, loftily opportune.

Out, up, and together came sudden to Sunday rite,

The one solemnly off to correct plenilune.

Michael Keith, the author of the above website, has created several works in Pilish, including a full-length book covering the first 10,000 digits of pi!

Trying to write under such constraints can feel extremely awkward, but this made me wonder: How often would strings of words adhering to the constraints of Standard Pilish occur unintentionally? Afterall, with the amount of text out there – the sheer rate at which words are being put together by people all over the world every second of every day – it is to be expected that these things should occur with some frequency p > 0. Such is the Law of Large Numbers.

In order to determine this, I would need a large data set. Luckily, such things are readily available. I settled upon the Project Gutenberg ebook catalog – specifically the union of the July 2006 DVD (17,000 books) and the March 2007 Science Fiction Bookshelf CD (most of PG’s Sci-Fi titles). Altogether, this gave me almost 9GB of text (although I later discovered this contained many duplicates, it’s still a hell of alot of words!)

Next I hacked together a small python script which would find, for each file, the longest string of Standard Pilish. Code for this can be checkedout from my SVN repository: http://svn.nfitz.net/pilish

Results:

Somewhat disappointingly, the longest of any Pilish string was 8 digits of pi. The vast majority of books had a longest Pilish string of around 3-5 words. See the histogram below (note the logarithmic scale in the y-axis).

Five books achieved this 8-digit benchmark, listed below, with the section of Pilish text bolded:

Dismounting and throwing the reins over his horse’s head he came to her smiling, sombrero in hand. “Buenas dias, Senorita. Please may I have a drink?”

“Certainly, Mr. Holmes ; help yourself.” She pointed to the olla hanging in the shade of the ramada.

I was weary of the humdrum life of idling on shore or aimless sailing up and down the channel. The admiral’s was a peaceful mission, and no fighting was expected, but I felt a great curiosity to behold new scenes.

And I have a great Objection to firing with powder only amongst People who know not the difference, for by this they would learn to despise fire Arms and think their own Arms superior, and if ever such an Opinion prevailed they would certainly attack you, the Event of which might prove as unfavourable to you as them.

One was part of the empire, the other was enclosed in Poland, and they were separated by Polish territory. They did not help each other, and each was a source of danger for the other. They could only hope to exist by becoming stronger. That has been, for two centuries and a half, a fixed tradition at Berlin with the rulers and the people. They could not help being aggressive, and they worshipped the authority that could make them successful aggressors.

With the most ambitious of the longer poems–”The Four Monarchies”– and one from which her readers of that day probably derived the most satisfaction, we need not feel compelled to linger. To them its charm lay in its usefulness. There were on sinful fancies; no trifling waste of words, but a good, straightforward narrative of things it was well to know, and Tyler’s comment upon it will be echoed by every one who turns the appallingly matter-of-fact pages…

That last one is the only of the five to have one word of double-digit length, thus covering two digits of pi (‘straightforward = 15 letters = ’15′).

Future Work:

I would like to do a similar analysis of an even larger dataset of more modern language. One possibility is a full archive of Wikipedia. I wonder what is the longest string of unintentional Pilish ever produced?

Another interesting question is how the maximum length of Pilish sections in a document scales with the length of the document, and how well this can be modelled with a simple statistical model such as a Markov Chain.


Human Computation: Harnessing the Power of Procrastination

Posted February 26th, 2009 on Terry

In order to comment on a Terry post, you have to fill out a form which looks like this:

recaptcha-example

If you’ve spent any time at all on the interwebs, you’re probably fairly familiar with this kind of thing. They’re called CAPTCHAs, which stands for “Completely Automated Public Turing-test to tell Computers and Humans Apart”, a rather awkward acronym which nonetheless admirably describes their function of screening the automated scripts which might otherwise be hawking their manhood-enhancing wares on our poor, unsuspecting readers. The basic idea behind CAPTCHAs is to use a test which is easy for humans, but impossible for current AI systems - such as reading highly distorted text.

But this specific CAPTCHA used by Terry, and many other sites around the web, is of a very special variety. It is called ReCAPTCHA, and is the work of a group at CMU lead by the brilliant Luis Von Ahn, whose goal is to harness the work people do when filling out CAPTCHAs into a useful purpose. Believe it or not, every day an estimated 150,000 man-hours are spent world-wide filling out these infernal boxes! Wouldn’t it be great if that time could be spent doing something useful? That’s the idea behind ReCAPTCHA…

What ReCAPTCHA does is to combine bot-filtering with another useful project - the digitization of hard-copy texts such as old books. Modern OCR is highly accurate (>99%), but there are still cases where an OCR is unable to ready a given word accurately - usually the result of some damage or distortion to the text itself. In these cases, the given word is converted into an image for ReCAPTCHA and fed to a human, who can succeed where the computer failed. So every time you fill out a ReCAPTCHA, you are helping to digitize and preserve old books!

This is an incredibly clever example of a new field of development called Human Computation (also called crowdsourcing). The idea is to out-source certain elements of computation to humans, who can perform these tasks better than the computer. The challenge comes in creating the incentive for a human to participate. One technique, as used in ReCAPTCHA, is to harness work which is done by humans anyways, such as filling out CAPTCHAs. Another used by Von Ahn’s group is to turn the work into a game - making the incentive fun! To this end they created gwap (Games With A Purpose)- a site devoted to games which accomplish useful work, from image-tagging (useful for improving image web-search and making the web more accessible to the visually-impaired) to text-summary. By their estimates, if their image-tagging game ESP was played the same amount as popular flash games such as Bejewelled, all images on the internet would be completely tagged in a matter of months. The power of procrastination, properly managed, is truly a wonder to behold!

But the power of the Human Computation paradigm extends beyond those application in which it is explicitly designed. Many examples of internet social networking sites can be seen as a form of Human Computation. For example, sites such as Digg or StumbleUpon act as a powerful filter for vast sea of content available online - only the best content bubbles to the top (in theory anyway…). Furthermore, the large data collection of Audioscrobbler and Last.fm acts as a form of music-similarity algorithm, simply by clustering artists based on the people who listen to them. Dave previously wrote about the power of Google Trends to predict flu epidemics. There is exciting potential here.

Human Computation is an incredibly powerful idea which will continue to develop more interesting and useful applications as the techniques are developed further. If anyone can think of ways we could harness the power of procrastination to solve the problems we discuss on Terry, we could really be in business!

Was it good for you, too?

Posted February 13th, 2009 on Bespoke

(follow up to this)

1234567890

OH-EM-GEE: An Epoch to Remember

Posted February 10th, 2009 on Bespoke

Drop what you’re doing and pay attention: The Interwebs have just informed me of something spectacular. Just over 70 hours from this moment, UNIX time will read 1234567890. Watch the countdown here. As far as I can figure this is the last “cool” time we will see before the “Unix Millenium Bug” in 2038.

400 Word Essay 3: Computer Models of Cognitive Processes

Posted January 27th, 2009 on Bespoke

I seriously considered not posting this one. Two things went wrong: I didn’t get my choice of topic in time, meaning I was assigned the negative side of a topic I would usually argue affirmatively on. Secondly, I was super busy this weekend and didn’t leave myself enough time to do a proper job. But I thought in order to maintain the intellectual honesty of this series I should post the less stellar examples along with the ones I am more proud of. It was an interesting exercise the try and argue a position I am opposed to. It’s something everyone should try at least once; I think if you don’t find it difficult you should question how secure your positions really are. So as a last disclaimer, I’m not sure how effective the following arguments are. You decide!

Topic: Human cognitive processes can be investigated by creating computational models. (CON)

Computer models, though important to the study of cognition itself, are of limited use in studying specifically human cognitive processes.

In order to be useful in the study of specifically human cognitive processes, computer models would have to function in exactly the same way – same structure, same speed and, to a certain extent, the same material. Since this is not the case with a computer simulation, we should not expect to garner much serious insight by using these techniques. The differences between a computer model and the real functioning of a human brain are currently too large. Even the most highly parallelized computers nowadays have nowhere near the parallelism necessary to properly emulate the functioning of a human brain. They also run at different speeds, which makes the way they process information vastly different. These problems mean the knowledge gained from computer models based on human will be suspect at best.

Attempting to investigate cognitive processes by building computer models creates a serious boot-strapping problem. The models created will only be as good as the knowledge we have, and so will not be very helpful in gaining new knowledge. We cannot build a good model without good research to base it on – research which must come from primary investigation of humans themselves. At best, computer models can help us evaluate the validity of theories garnered from primary research, but they will not be helpful in adding new knowledge to our understanding.

Looking into the future, as our computer models become more and more accurate, and begin to take on aspects of conciousness, the ethical dilemmas involved with experimenting on such a computer model will approach those involved with experimenting on humans. Afterall, if our computer models can achieve conciousness we will be forced to really seriously consider granting them the rights which go along with such capabilities. We will have to grapple with the fact that a concious system cannot just be tampered with as we would a modern program. It is, afterall, a mind. Therefore computer models will be add nothing of value to our research, and we will be better served investigating real humans instead.

400 Word Essay 1: Public Libraries in the Digital Age

Posted January 19th, 2009 on Bespoke

This term I am taking a really interesting course – “COGS 303 – Research Methods in Cognitive Systems” – which is intended as a guide to doing successful research in Cognitive Science. One of our regular assignments will be to write opinion pieces with a strong 400 word limit – a good exercise in clarity and brevity. We chose our topic from a list and must defend it within the word limit. I’ll post my essays to the interwebs so that they can be evaluated in the harshest of battlefields. Here’s the first:


Topic: The advent of the digital age makes public libraries obsolete. (Affirmative)

Current trends in technological and cultural development make it unlikely that public libraries will survive in their traditional format.

Firstly, the book itself is becoming a thing of the past. Although ebook usage has not become widespread as quickly as many anticipated with the advent of the computer, this slow adoption is beginning to accelerate with the recent development of specialized ebook readers which use electronic-paper technology, such as the high-profile Amazon Kindle, which has sold tens of thousands of units. Just as consumers are moving away from hard-copy formats in music and videos, towards electronic files, the same will happen with books once ereader technology reaches the “killer app” level achieved by the iPod for music. With the decline of the physical book will come the necessary decline of the physical library.

Secondly, the internet is creating a culture where information and files are shared freely, negating the need for public institutions to hoard and distribute books. This has already been observed in music and videos – despite their best efforts, recording companies cannot stop the inevitable free sharing of data. The same process is under way with books – Project Gutenberg makes it possible to find almost any popular public domain classic free on-line, while Google Books is doing the same with more obscure selections. Already there is a large collection of commercial books which have been scanned into digital formats and are available for download (a short search found both textbooks for this course).

The internet presents a better way to achieve the goals of libraries than physical libraries themselves – namely free and open access to information and books. Providing free access to the internet would be a more effective way of making media available than building and supporting large buildings filled with unread books. Once this fact becomes apparent to governments, it will become difficult to justify the larger relative cost of running a traditional library. Relative environmental impact is another point in favour or switching to a digital format.

Furthermore, the internet has demonstrated its effectiveness for bringing people together in a social network to share preferences within a given domain. Last.fm is a popular music sharing and discovery resource. These types of sites are popular amongst the current generation, and are a likely candidate to replace the community fostered by traditional libraries.

Altogether, trends indicate that traditional libraries will become obsolete.

References

http://www.techcrunch.com/2008/05/14/amazon-may-sell-750-million-in-kindles-by-2010-thats-a-lot-of-kindles/

http://www.salon.com/env/ask_pablo/2008/09/08/printers/

Towards a Killer Robot Army: Part 2

Posted November 1st, 2008 on Bespoke

My diabolical plans move on apace; a new generation of infernal contraptions has sprung from the twisted machinations of my technologickal laboratory. No longer fully a slave to it’s programming, the newest addition to my mechanical throng can truly be said to be an automaton.

This most recent task in my COGS lab was once again to traverse a maze, but this time without pre-programming the route. Rather, using just two light sensors, our task was to create a vehicle capable of finding it’s own way – officially called Agent-Environment Interaction. This was a much more rewarding task than the previous Internal Representation lab…

We opted for a faster design than we’d used previously, reasoning that what the vehicle lacked in accuracy it could correct on its own. It is well known that any connected maze can be solved by simply following on wall, so this was what we tried first, but we found it to be too slow, and so went for a hybrid strategy. The robot would drive mostly straight but with a slight list to the left, which would bring it into contact with the left wall fairly regularly. From there, whenever it hit a wall the robot would reverse and correct it’s course. This worked fairly well with the exception of a oscillatory behaviour which would cause it to get stuck in corners. This was corrected with a subroutine which measured the number of times the robot was turning each direction, and would turn it farther if it determined it was stuck.

In the end our strategy proved successful, and we once again reigned victorious. This will no doubt bring us great fame. As my friend Tyler quipped, “I hear robots avoiding black tape is a booming industry. Obviously he is unaware of the recent discovery of Zebracus Alpha, a small planet in the outer solar system whose geographical features are entirely delineated by black lines. We’re hoping to win the NASA grant to develop the lander.

Artist's conception of Zebracus Alpha

Artists conception of Zebracus Alpha

Thanks again to my group members, Cam and Andrew.

Towards a Killer Robot Army: Part 1

Posted October 11th, 2008 on Bespoke

Cognitive Systems has thus far been quite good; I love the Computer Science classes, and the philosophical issues we discuss are intriguing. But it wasn’t until this year that we really started down the path which most interests me: the construction of the remorseless mechanical slave-soldiers which will one day form the cruel, unfeeling fist of my tyrannical oligarchy. It should be stated that we have started at the very beginning of that path; the robots we are currently building in COGS 300 are several generations removed from the final (diabolical) product. But it is a start none-the-less.

The labs are really great. We’re in groups of three and we have two or three weeks to complete the robot, after which we compete with the other groups. Those rankings determines the order in which groups pick debate topics, so it gets quite competitive.

For the first lab, we had to build a bot which would find the light-source an approach it…. Fairly elementary stuff; in the end most of us built some form of modified Braitenberg machine. Ours compared the values between it’s two light sensors to determine the direction to turn. Most of the challenges came from the Lego mindstorms hardware itself: one of our motors ran at different rate from the other, which caused the robot to list slightly to the left. In the end this was the difference, and we were slightly behind the second place team. I cried bitter tears to an unfeeling sky.

At the end we strapped my cellphone to the robot to get a “terminator view”. It promptly crashed into a chair and fell over.

(click video to play)

For the second lab we were to navigate a maze taped out on some bristol board. The frustrating aspect was that we were only allowed to use internal representation: the bots could have no external sensors. So it became a frustrating affair of trying to perfect the commands. Of course, random variations meant that we could never get it perfect. We minimized these with our design; a high gear-ratio and slow speed meant we were better able to track the distance it was turning with the one rotational sensor we were allowed.

Our first run was perfect; we didn’t touch the line once. Of course, I wasn’t filming for that. This is our second run, which wasn’t nearly as impressive. We won the challenge on the merits of our first attempt.

Thanks to my group members, Cam and Andrew.

From Paper-Weight to Media Server

Posted September 12th, 2008 on Bespoke

This is something I’ve always wanted to try, and now have the opportunity! My parents had an old relic of the Pentium 3 epoch, which I was able to appropriate in the name of Science. The Dell Dimension 4100 really is an old battle axe, and I’ve given the old bird a new lease on life by turning it into a Media Server. My plan is to use it for a webserver eventually as well.

Our house has not one but two high-speed cable connections, which we divide between the upstairs and downstairs people. The various routers and modems are displayed in a pleasing techno-shrine to the electro-gods in our living room:

The problem is that the shrine is downstairs, while our convenient nook-come-server-room is upstairs. Of course, being a dinosaur, the server doesn’t have a wireless card. I considered messing around with adding one, but decided it was easier to run a cable up the stairs, so after a trip to NCIX.com, to pick up a 500gb HD, 100ft CAT cable and network switch, the route was established. Of course, being a house not entirely designed around networking, some elegant solutions were in order:

The nook really is perfect for the job at hand, almost as if it was built for the purpose. The server looks quite at home, especially with careful cable management and hardware organization:

So I now have ‘er up and running, and it feels delicious. I am using NoMachine NX , a great piece of software which was used in the lab in Edinburgh. It makes remote desktops simple. A great feature is that sessions can remain open when you close the window, meaning you can have programs continuing to run when you log off. Running Ubuntu and having a window open displaying a session with another Ubuntu box creates delicious recursive fantasies. I’m wondering how many of these things I could next before the whole network crashed and/or the pentium 3 set on fire…

This greate guide explained how to stream videos over the LAN with VLC, so I can now enjoy my entire media collection without taking up HD space on my laptop! The best benefit is that I can leave it torrenting constantly without noise in my room!

Next step will be to get the web-serving up and running!

Storing and Retrieving java Bitset in MySQL database

Posted July 25th, 2008 on Bespoke

I had one of those frustrating days where you spend hours and hours searching around for what should be a simple coding solution, to no avail. Finally I was able to patch together enough disparate knowledge to achieve my goal: namely, storing and retrieving a java BitSet on a MySQL database. Below is the solution, which I hope might help any other unfortunate souls looking for this answer.

I’m using the java.sql package along with jdbcConnector. Updates are done using the update functions of ResultSet. I will store the BitSets as Blobs (hehe – Blobs!), but in order to do this they have to be converted first to byte arrays, and then the byte arrays must be converted to Blobs. Once you have a Blob it can be stored on the table with resultSet.updateBlob() or some other method. Conversely, to retrieve the BitSet you must first convert the Blob to a byte array and then to a BitSet.

The methods toByteArray and fromByteArray are purloined from here.

BitSet to Blob (For writing to database):


private static Blob bitsetToBlob(BitSet myBitSet) {
    byte[] byteArray = toByteArray(myBitSet);
    Blob blob = con.createBlob(); //con is your database connection created with DriverManager.getConnection();
    blob.setBytes(1, MACCSarr);

    return blob;
}

private static byte[] toByteArray(BitSet bits) {
    byte[] bytes = new byte[bits.length()/8+1];
    for (int i=0; i<bits.length(); i++) {
        if (bits.get(i)) {
            bytes[bytes.length-i/8-1] |= 1<<(i%8);
        }
    }
    return bytes;
}

Now that you have a Blob you can update the table with resultSet.updateBlob() or some other method.
Blob to BitSet (for reading from database):


private static BitSet blobToBitSet(Blob myBlob) {
    byte[] bytes = blob.getBytes(1, (int)blob.length());
    BitSet bitSet = fromByteArray(bytes);

    return bitSet;
}

private static BitSet fromByteArray(byte[] bytes) {
    BitSet bits = new BitSet();
    for (int i=0; i<bytes.length*8; i++) {
        if ((bytes[bytes.length-i/8-1]&(1<<(i%8))) > 0) {
            bits.set(i);
        }
    }
    return bits;
}

Hope that helps. If I can save just one person from the frustration I just went through, then I have done a good thing.