Contextual Logging in Log4j

Just caught an interesting post come through JavaBlogs regarding Atlassian‘s approach to providing additional context when logging errors in Confluence.

It brought something to my attention that was previously unknown, that is the Log4j mapped diagnostic contexts (not to be confused with the nested diagnostic contexts or NDCs). See the log4j wiki entry on NDCvsMDC.

Evidently I live in a cave (or perhaps have been using commons-logging for too long) but basically a MDC is a thread-local’d structure exposing a map-like interface that, depending on your logging pattern, can be included in your logged messages. Particularly useful stuff if you’re developing a multi-threaded system (be it swing, j2ee or some other inherently multi-threaded framework) and want to provide useful debugging information.

UltraLightClient has an entry in their code community providing a logging example using MDC and Log4j.

Nice and simple.

Google Gears: What I was waiting for

Awhile back I blogged about how nice it would be to have offline support built into Google Reader. At the time, I anticipated some form of integration with Google Desktop.

Fortunately, the folk at Google smiled down on me (and many others) today and provided the functionality in part to show off their new Gears framework. Basically, developers can use Google Gears to easily build offline support into their web applications. Sure there are existing ways to do this, but being from Google, Gears is sure to get a lot of traction.

The Google Gears framework is consists of a smallish (700K) browser plugin for IE and FireFox. Client applications (including but not limited to Google Reader) communicate with it using a series of new APIs accessible via JavaScript.

The impact on Google Reader is quite good IMO. Beyond the obvious benefit of being able to read offline, the fact that you’re limited to 2000 recent posts provides a reasonably diverse selection of content making it easier to stay up to date across many feeds instead of just the popular ones.

For those interested parties, SQLite is used as the backing datastore. There’s also an API for a ThreadPool-like (or WorkerPool as Google calls it) structure allowing you to easily manage long running tasks.

The extension is available under the non-viral BSD license. Ajaxian.com has a couple of related stories including an interview with Brad Neuberg of the Dojo framework.

Pretty cool.

The Anatomy of a Screen Scraper : Part 1

This is the first part detailing my experiences writing a basic html screen scraper using a combination of bash, python and structured grep to retrieve, massage and present data. The results can be seen at onesecondshy.com. Code will eventually be published there as well.

First off, motivation. The motivation for this little project was to mine the data from gastips.com and present it more directly (and with less advertisements). GasTips.com is a grassroots site allowing individuals to submit gas prices on a community-by-community basis. Data is presented in well-formed tables accompanied by a plethora of google advertisements.

In an ideal world, we would have APIs that provide data is easily consumable formats. However, as long as there is value attached to data, there’ll be a desire to keep it private. The following is not meant as a step-by-step guide to scraping data and graphing it in python. Instead, my goal is to simply explain one approach to solving what is a common problem.

Step 1: Data Retrieval

Nothing too fancy here. A simple call to wget to download raw html.

Part 2 of this series will move beyond this rather simplistic approach and cover data retrieval and aggregation using a web spider.

Step 2: Massaging

Once you’ve got raw html, the next logical step is to reduce it to a format amenable to manipulation… namely csv. Sgrep (or Structured Grep) a tool for searching and indexing html (amongst other things) easily gets the job done.

sgrep -g html 'stag("TABLE") containing                       (attribute("SUMMARY") containing "XXX") .. etag("TABLE")' input.html
<html><body><table summary="XXX"><tr><td>ABC</td></tr></table></body></html>

Running the above sgrep command on the html snippet will yield the following result:

<table summary="XXX"><tr><td>ABC</td></tr></table>

You could easily take it further and retrieve only the contents of a particular table column or row. When parsing the gastips.com data, I used sgrep to get the data into a row-delimited form before running a simple python script that converted to csv. The row delimited form was as follows (where *’s denote a column header):

*Header1*
*Header2*                       
Row1Column1
Row1Column2
Row2Column1
...

The final csv output was:

Header1,Header2,
Row1Column1,Row1Column2,
Row2Column1,Row2Column2,
...

Step 3: Presentation

Sticking with the python theme established in the previous step, I investigated a few python graphing and charting libraries. I settled for the Matplotlib, vastly overkill for my needs but a fun challenge nonetheless.

60 lines of hacky python later and I had something that could parse a csv file and plot resulting data.

Understandably, the devil is in the details. Plotting a simple x,y graph is trivial, it’s slightly more difficult to create something with some semblance of polish.

Step 4: Tying It All Together

The last step in all this is to create a suitable front-end. I chose a simple for loop in bash.

for x in urls
1. download raw html
2. massage it (html -> csv)                                                                                                                                              
3. plot it (csv -> html) 

Making a conscious distinction between the crawling (downloading), indexing (massaging) and presenting (plotting ) allows increased opportunity for parallel operations. Once you have the raw data, you can execute multiple indexing and plotting operations. A more scalable alternative to the common monolithic approach encompassing data retrieval, transformation and presentation in a single all-in-one package.

 

That’s it for Part 1. As mentioned earlier, you can see the results at onesecondshy.com. The plots will actually mean something should you live on Vancouver Island. Future plans include writing a web spider and aggregation of data over a longer period of time (than the week provided by gastips.com) amongst other things.