April 28, 2008

Weekend update to Newsbrew

A couple weeks ago, I blogged about a Google App Engine news aggregator I wrote called Newsbrew (original post). A few days later, Google announced the availability their RESTful feed API. Since the trickiest part of Newsbrew was the aggregation code, I decided to refactor the application to use Google's new service.

Newsbrew still stores feed and post data, but rather than retrieving and parsing the feeds myself (a surprisingly complex and error-prone process), Newsbrew now uses Google's REST feed API. All data, regardless of the underlying syndication format, is returned in a nice normalized JSON format.

With just a few hours of work, I was able to make Newsbrew much more robust, and because I no longer have code for parsing nine different formats of RSS and Atom, Newsbrew is much less error prone, and the code base is simpler.

I also added Ajaxian to the blogroll at the request of Dion Almaer. If you know of any other sites I should add, let me know.

Posted by cantrell at 11:32 AM. Link | Comments (2) | References

April 18, 2008

Newsbrew is now a Google Application

I wrote a news aggregator a couple of years ago called Newsbrew which I primarily used for my own news-reading needs. I think I took it down when I got tired of paying for the server, and worrying about keeping it up.

After Google launched the Google App Engine, I decided to take a little break from Flex and AIR and rewrite Newsbrew in Python to get a good feel for the GAE experience. You can see the current beta version here.

Overall, I was very pleased GAE. It took me about five days to write this version of Newsbrew, but that included learning Python, Django, and everything about GAE. The application is fairly comprehensive, consisting of a user interface, aggregation service, and a secure administrator section. Unfortunately, I didn't get to all the features and bug fixes that I wanted, but the app still seems to work reasonably well.

It's still going to be a little while before we're writing real-world production apps on GAE as it still has several rough patches, bugs, and missing functionality. But it's very clear where Google is going with this, and there's no doubt that GAE is a very powerful concept and platform. I'm certainly going to keep my eye on GAE, and use it as much as I can.

Posted by cantrell at 11:16 AM. Link | Comments (2) | References

November 7, 2006

Firefox 2 Live Titles

I'm sure most Firefox users are familiar with Live Bookmarks by now -- bookmarks that point to RSS feeds, and update themselves as the feed is updated. Firefox 2 has now introduced Live Titles -- page titles that also update themselves. If you're using Firefox 2 and you want to see an example, go to woot.com and bookmark it. Rather than the typical name text field, you'll see a combo box that lets you enter a static title, or choose one of woot's two Live Titles, which is a brief description of the item currently being sold. Makes perfect sense for both woot and woot fans.

Live Titles are actually "microsummaries" which are easily implemented using a link tag with a rel value of "microsummary". For more information, check out this page on the Mozilla wiki. You can also find a list of microsummary-enabled sites here. And finally, check out the Firefox 2 Release Notes for more information on what's new in Firefox 2.

Posted by cantrell at 10:28 AM. Link | Comments (2) | References

January 31, 2006

Programmatically Determining a Site's Language

I was having a conversation with a couple guys the other day about data aggregation, and the topic of language came up. They wanted to know how you can programmatically determine what language a site is written in (language as in spoken and written language, not computer language). Off the top of my head, I guessed one could uncover clues in the site's HTTP headers, character encoding, or by geocoding the site's IP address.

It turns out to be a harder problem to solve than I initially thought. HTTP headers are really no more help than a site's character encoding which really isn't much help at all since UTF-8 can pretty much encode any language there is. And geocoding an IP address is really nothing more than a hint for all the regular reasons geocoding IP address doesn't always work, and for the additional reason that a server being in a particular country doesn't really tell you anything about the language the sites on the server are written in (I used to live in Japan, but never posted a single thing in Japanese).

I did a little research, and it looks like folks like Google use very complex techniques for determining a site's language like comparing characters and words against known sets of characters and words in a database. This seems like a reasonable approach, but not one that I could implement in a reasonable amount of time (like a couple of hours), so I did what I always do when faced with a very complex problem: I looked for an obvious and simple solution.

What I eventually decided was that the sites out there with the most content, and content which is updated most frequently (and therefore are the most interesting sites to index) actually almost always tell you what language they are written in through their RSS or Atoms feeds. Of course, a research paper isn't likely to have an RSS feed, but most news sites and just about all blogs certainly do.

I tested the theory by writing a Ruby script that crawls sites and their feeds looking for things like xml:lang attributes and other language-related tags. I ran the finished product against a sample of 50 non-English blogs from MXNA and determined that the technique is about 60% accurate. Not great, but not too bad, either. I also rediscovered a lesson I'd already learned many times over when writing aggregators, which is that you should never trust data you don't control since all but one of the sites that my script got wrong actually lied and claimed to be a language it wasn't (in every case, they claimed to be English rather than the language they were actually written in). How do I explain the fact that about 40% of blogs seem to lie about their language? I'm sure it's an innocent mistake. Most people don't really know much about how RSS and Atom work, and just trust their blogging software to do the right thing. Even if they write the software themselves, they probably don't really know what all the RSS/Atom tags actually mean. RSS is sort of the new HTML: as long as it mostly works, it's good enough for most people.

Like I said, the script works by crawling sites and their RSS or Atom feeds. If you're interested in the source, you can grab it here. I also threw together a CGI wrapper for it, so you can test it out online yourself here. If you're new to Ruby and don't feel like decoding all the regular expressions, here is a brief description of how it actually works.

Some lessons learned: