Wolfram|Alpha

Posted in Software, Web on June 8th, 2009 by Doug – 2 Comments

I’ve spent a fair amount of time using Wolfram|Alpha recently. Here are my impressions.

The weeks leading up to the launch of Wolfram|Alpha have created incredible amounts of hype surrounding the project. The creator, Stephen Wolfram, has not helped to lessen this hype. Even the project’s description sets a rather lofty goal

Wolfram|Alpha’s long-term goal is to make all systematic knowledge immediately computable and accessible to everyone.

The comparions to Google were immediate and plentiful. After all, Google describes their goal as making all the world’s information searchable and catalogued. However, Alpha is not Google. Here the term computable can nearly be thought of as making all of the data formatted in such a way that the computer can read, evaluate, and manipulate the data. A great example of this is to do something like produce a graph for a data set over time (such as population). Google, on the other hand, is best used as a catalogue that ultimately points you to data. It does very little of it’s own analysis other than what is required to build their index.

Alpha is a calculator, only a calculator that can understand natural language, access the relevant data and present it in a form that is easily understood by humans. As such, it’s an interesting tool. Alpha’s ability to display data on a webpage is particularly impressive. Take a look at what you get when you query Alpha about the International Space Station.

Alpha displaying the current location of the ISS.

Alpha displaying the current location of the ISS.

The results are impressive when Alpha knows how to handle your input. In the cases where it doesn’t (and there are many) leave much to be desired. For instance, searching for general terms does not yield much beyond basic information, and sometimes that information is only related to the core concept. Consider the next image where I searched for Computer Science. The result is no information other than some queries that Alpha can display results for.

cs

Alpha's results for the term "Computer Science."

If you are interested in results that are analytic or mathematical in nature, Alpha might be a really good resource. For encyclopedia like results, Wikipedia still rules the day. Of course, Alpha is not positioning itself to compete with Google or Wikipedia, but I think the comparisons are fair along the lines that all the mentioned services intend to provide information to people seeking information.

This brings me to the question, “exactly how useful is Wolfram|Alpha?” I though this might be the type of question that Alpha itself could help me answer. I started by searching for information about the United States’ current population.

US Population Results.

US Population Results.

These results are good and quite useful. Now, I wanted to get some information about education in the US.

education_usaNot as much as I’d like to see, and the numbers are a bit out of date. Here I’m a little dissappointed. I’d assume that certainly raw numeric data such as this could be computed upon to reach more conclusions. In my next search, I try something a little more advanced: let’s try to calculate the current portion of the population that has obtained a college degree. This should be a simple calculation, given the data. The term for this value is “education attainment,” something that I learned through a Google search that lead to the US Census Bureau’s page.

edu_attainmentNothing. Ok, maybe I need to be more verbose and use terms that I know already lead to data. So, I searched for “education usa, population usa.”

edu_comparisonWhat? The calculation isn’t even complete. Education enrollment has no result, even though I’ve already found those numbers before. Even if all the numbers had appeared, no useful calculation or comparison has been done.

The number that I’m actually searching for is 28%. This represents the portion of the population over the age of 25 that has obtained at least Bachelor’s degree. This information comes from a Census Bureau page that turned up in a Google search. It would have saved me a lot of time if Alpha had produced this number. Given the data that it has access too (US Census Bureau is listed as one of Alpha’s resources), the calculation is fairly straight forward as well.

The reason I’m searching for this number, 28%, is that I believe that the majority of people interested in the types of results Alpha can provide likely have a background in mathematics or science. Most of the impressive results are related to these fields. This is not hard to understand when you realize that “computable” human knowledge is much more likely to come from these fields than say, literature. Compare the results when searching for Poisson distribution to Gulliver’s Travels.

Search: "poisson distribution"

Search: "poisson distribution"

Search: "Gulliver's Travels"

Search: "Gulliver's Travels"

I think you can see the vast difference in the quality of the result returned. Now, back to that 28% number. I’d assume that people with a background in math or science are likely to have a college degree. Of the people with a college degree, only a portion will have studied math and/or science, but let’s just keep using the 28% number. What is 28% of the US population? Alpha can answer that.

college_popThe number: 85.63 million people. I’d conclude that this is a high estimate of the number of people who might ever be interested in the Alpha project. I think the majority of the population might type a query, get a graph, determine this is not what they want and move on.

People who are interested in these types of results will likely want to dig in deeper, as I tried to do above. I think at that point they will become disappointed in Alpha’s inability to understand what information they are trying to elicit. Further more, if they are able to obtain just the data they need, I wonder of what use it is other than as a cure for their own curiosity? It is difficult to impossible to determine where, exactly, Alpha got the source information it needed to reach its conclusions. Even if this source information was available, the user still has no idea exactly what equations, algorithms, or process was used on the data. Without this information, I find it difficult to believe that anyone would be able to use Alpha’s results in any sort of document or report.

I think a fundamental improvement to Alpha would be to return precise source information that ultimate points to the raw data along with a bit of Mathematica (or other) code that performs the calculation. Then, the user could take this information, verify its validity, modify it as necessary, and incorporate it into their own work. Until this is possible, I’m afraid Alpha will only be used for curiosity’s sake.

Despite these criticisms, I find Alpha to be a fascinating project, if only for it’s data presentation and cataloguing engine. The project is in an early stage, so perhaps the flaws I see are simply due to this. Perhaps the desired features are already in the pipeline. I hope that they are, because I’m a fan of what Wolfram is trying to do here. If they can improve the project to the point where these points are no longer an issue, I can see Alpha becoming the world changing product what Wolfram would have you believe it is. Until then however, you might be better off searching somewhere else.

This Stack goes to Infinity

Posted in Programming on May 10th, 2009 by Doug – Be the first to comment

In the official CPython implementation, the Python interpreter mixes the state of Python’s stack with the state of the underlying C code’s stack. Under normal operation, this design choice has little impact, but if one looks more closely there are some major drawbacks.

One such drawback is that Python’s stack depth is now limited by the maximum stack depth supported by the underlying CPython process. This might not seem so bad until you consider that one function call in Python may cause many C functions to be pushed onto the stack.

A solution to this problem is to separate Python’s stack from C’s stack. This is exactly what Stackless Python does. Now, Python’s stack size is only limited by the amount of memory in the machine.

I’ve implemented the Sieve of Eratosthenes example from last time using the tasklets and channels that Stackless provides. I haven’t really written any Stackless code until now, but I feel this is a decent translation.

import stackless

def ints(n, ch):
    while 1:
        ch.send(n)
        n = n + 1

def filter_mul(n, i, o):
    while 1:
        num = i.receive()
        if (num % n):
            o.send(num)

def sieve(i, o):
    while 1:
        head = i.receive()
        o.send(head)
        n = stackless.channel()
        stackless.tasklet(filter_mul)(head, i, n)
        i = n

def printer(i):
    while 1:
        print i.receive()

def gather(i, o, n):
    l = []
    for x in range(int(n)):
        l.append(i.receive())
    o.send(l)

def main():
    intch = stackless.channel()
    intt = stackless.tasklet(ints)(2, intch)
    sieveo = stackless.channel()
    sievet = stackless.tasklet(sieve)(intch, sieveo)
    #printert = stackless.tasklet(printer)(sieveo)
    gathero = stackless.channel()
    gathert = stackless.tasklet(gather)(sieveo, gathero, 1e4)
    print len(gathero.receive())

    stackless.run()

if __name__ == "__main__":
    main()

Running this results in decent performance:

--( 9:22 : $)-- ~/local/stackless/bin/python --version
Python 2.6.2
--( 9:24 : $)-- time ~/local/stackless/bin/python sless.py
10000

real    0m47.369s
user    0m46.123s
sys     0m0.076s

I tried running the same example to find the millionth prime, but after hours of execution time I killed the process. While the memory consumption was reasonable, the execution time suffered from, what I believe to be, much overhead from object creation and switching between tasklets. Still, it’s encouraging that this version didn’t fail due to stack depth problems.

Stackless Python is an interesting experiment, and I wish the features implemented here were part of the official Python release. If you haven’t heard of Stackless before, I encourage you to read about what it offers to Python and why this can be useful beyond the toy example I have here.

To Infinity

Posted in Programming on May 5th, 2009 by Doug – 6 Comments

In the programming language Scheme the concept of infinite lists can be implemented using streams. A stream is really nothing more than a cons cell that contains a value in the cdr field and a pointer to some code to generate the cdr field. While simple in design, the concept is powerful, allowing the programmer to create infinite sequences by simply defining a function to generate the next value in the sequence.

Python supports a similar concept in the form of generator functions. The implementation here is a little more complex, but it can be thought of as a function that returns a result, but maintains its current state for later execution.

Both can be used to write rather elegant code. Take for instance the Sieve of Eratosthenes. This can be implemented in Scheme like so (borrowed from SICP):

(define (integers-starting-from n)
(cons-stream n (integers-starting-from (+ n 1))))

(define (divisible? x y) (= (remainder x y) 0))

(define (sieve stream)
(cons-stream
(stream-car stream)
(sieve (stream-filter
(lambda (x)
(not (divisible? x (stream-car stream))))
(stream-cdr stream)))))

(define primes (sieve (integers-starting-from 2)))

In this code, a new stream is created to filter out the unwanted multiples from the previous stream.

Similarly, it can be expressed in Python using generators:

def integers(n):
while 1:
yield n
n = n + 1

def filter_mul(n, s):
for i in s:
if (i % n):
yield i

def sieve(stream):
while 1:
head = stream.next()
yield head
stream = filter_mul(head, stream)

The expression is elegant and theoretically allows us to represent an infinite list that is evaluated lazily. However, the reality strays from the theory. While this code works just fine to find small primes, it breaks for larger values:

;Loading "sieve.scm"... done

1 ]=> (stream-ref primes 1000)

;Aborting!: out of memory
>>> from sieve import *
>>> primes = sieve(integers(2))
>>> [primes.next() for x in range(1000)]
RuntimeError: maximum recursion depth exceeded

In both cases, we’ve exhausted the language implementation’s ability to hold the data structures necessary to represent the lazy lists. The result is elegant code that fail rather quickly. Now, we could always buy ourselves a little more execution time by increasing the heap for Scheme and the recursion depth for Python, but that will only take us so far.

So, that leaves me wondering: is there a similarly elegant way to write code of this sort that won’t expire so quickly? To clarify, I’d like to keep using streams and generators, but find some way to limit the overhead incurred from the creation of so many filters (1 for each prime!) without resorting to a different algorithm.

Update:
I want to address the alternative primes calculation listed in SICP:

(define primes
(cons-stream
2
(stream-filter prime? (integers-starting-from 3))))

(define (prime? n)
(define (iter ps)
(cond ((> (square (stream-car ps)) n) true)
((divisible? n (stream-car ps)) false)
(else (iter (stream-cdr ps)))))
(iter primes))

This procedure is still recursive (primes uses prime? which uses primes), but has much less memory overhead — I managed to generate primes much larger than the 1000th without hitting memory limits. However, this solution strikes me as not quite the same in spirit as the first. Here, a given integer is tested by comparing it to the current list of generated primes while the first definition is building up a function to generate the next prime without having to explicitly scan the list of the already generated primes.