Saturday, January 5, 2008

50 years of Baseball Play-by-play data mashed with 50 years of hourly weather data FTW

Note: I found this sitting in my drafts folder, unpublished. It actually dates from October.

I've had two interesting realizations from the Retrosheet Baseball data vs. Hourly Weather information mashup I've implemented. The first is how my two favorite scripting languages (Python and Perl) compare. The second is how the hard parts of this process is actually the stupidest part... there's four steps in doing an interesting visualization of open data. In order of steps as well as decreasing difficulty and decreasing stupidity:

  • Bring the data from behind its bureaucratic barriers
  • Unlock it into a universal format
  • Process and digest the data
  • Actually explore, visualize and share the data

The hardest and least justifiable steps are the first two, a problem we have to fix.[Edit: this is why I'm starting infochimp.org]

Here's a longer description of how I did the baseball games / weather data mashup.

Several significant parts of this project were written in Perl, for its superior text handling and for the ease of XML::Simple (which I love); several other parts were done in Python, for its more gracious object-orientation.

To suck in the Hourly Weather Data files, you have to click through a 4-screen web form process to prepare a query. Although it sends the final form submission as a POST query, the backend script does accept a GET url (you know, where the data is sent in the URL form.pl?param=val&param2=val&submit=yay instead of in the HTTP request). There's an excellent POST to GET bookmarklet that will take any webpage form and make the parameters appear in the URL. No guarantees that the backend script will accept this, but it's always worth a twirl for screenscraping webpages or just trying to understand what's going on behind the curtain.

Now I need to know what queries to generate. First I needed the location of each major league baseball stadium: Brian Foy posted a Google Earth index of Major League Stadiums, a structured XML file with latitude, longitude and other information. I used the Perl XML::Simple package to bring in this file. These simple routines just pull in the XML files and create a data structure (hashes and arrays of hashes) that mirror the XML tree. The stream-based (SAX) parsers are burlier and more efficient, but for this one-off script, who cares?

Next I needed the locations of all the weather stations. Perl and Python both have excellent flat-file capabilities. The global weather station directory is held in a flat file (meaning that each field is a fixed number of characters that line up in columns). Here's the column header, a sample entry, and numbers showing the width of each field:

USAF   NCDC  STATION NAME                  CTRY  ST CALL  LAT    LON     ELEV*10
010010 99999 JAN MAYEN                     NO JN    ENJA  +70933 -008667 +00090
123456 12345 12345678901234567890123456789 12 12 12 1234  123123 1234123 123456

To break this apart, you just specify an 'unpack' format string. 'A' means an (8-bit) ASCII character; 'x' means a junk character:

A6    xA5   xA29                          xA2xA2xA2xA4  xxA6    xA7     xA6
The result is an array holding each interesting (non-'x') field. The Perl code snippet:
    # Flat file format
    my $fmt    = "A6x    A5x   A29x                          A2xA2xA2xA4xx  A6x  A7x   A6";
    my @fields = qw{id_USAF id_WBAN name region country state callsign lat lng elev};
    # Pull in each line
    for my $line () {
        next if length($line) < 79; chomp $line;
        # Unpack flat record
        my @flat = unpack($fmt, $line);
        # Process raw record 
        ...
    }

I also grabbed the station files for Daily weather reports, since that data goes back much farther (generally, we have since ~1945 for Hourly and since ~1900 for Daily).

Then I score each station by (Proximity and Amount-of-Date), and select the five best stations for each stadium.

Now, I could of course use Perl to generate the POST request using the HTTP modules, but it was simpler to mindlessly just control click on a dozen links at a time and then answer each form. and spit out an HTML file with a big matrix of URLs for each station, for a subset of years. P (You can see the linkdump file here: http://vizsage.com/apps/baseball/results/weather/ParkWeatherGetterDirectory.html)

I also use perl to clean up the XML generated by the MySQL Query Browser -- which returns a flat XML file with all fields as content, not attributes. I just suck the file in with XML::Simple, walk down the resultant hash to create a saner (and semantic) data structure, then spit back out as XML.

The python parts are not terribly interesting. I pull in the flat file, clean up a few data fields and convert in-band NULLs into actual NULLs (they use 99999 to represent a null value in a 5-digit field, for instance) then export the data as a CSV file (for a MySQL LOAD DATA INFILE query). I chose python for this part because I find its object model cleaner -- it's easier to toss structured records around -- and the CSV module is a tad nicer.

The idea I find most interesting is that we're starting to get enough rich data on the web to make these cross-domain data mashups easy and fun -- I did all this in less than a week. With the effortless XML handling and text processing of modern scripting languages (and relieved from any efficiency concerns) it's easy to see forward to a future where we'll have all these datasets sitting at our fingertips. This data set lets you examine ideas such as "How does the break distance of curveballs change with atmospheric temperature and pressure for a full baseball season?" "Effectiveness of pitchers against gametime temperature, stratified by age of pitcher or inning?" "Batting average on fly balls vs. ground balls against % of total cloud cover?". It's easy to come up with a variety of other "This Rich Dataset vs. That Rich Dataset" opportunities. Stock price and Earnings of Harley-Davidson vs. average household income, unemployment and percent of the population that has reached retirement age? Year-by-year movie attendance at comedies compared to dramas, Attendance at Baseball Games, and Sales of Fast Food vs. Consumer Satisfaction Index, national Suicide Rate, and Persons treated for mental health/substance abuse? Presidential approval rating vs. gasoline prices and Consumer Price Index? Amazon.com sales rank, # mentions on Technorati blogs and # of mentions in mainstream media vs. time?

The hard part is actually the stupidest part: to unlock the data from behind bureaucratic barriers (the first script I described), then to convert into a universal semantically rich data format (the second set of scripts I described). Once one person has unlocked this data, however, it's there for the whole world to enjoy, and tools will evolve to capitalize on this bounty of rich, semantically tagged and freely available information.

Labels: , , , , , , , , , , , , , , , , , ,

Wednesday, December 5, 2007

Moving from Perl to Python with XML and Templating

Mr. XKCD is correct in this. (My friend Dr. Larsson has been saying this all along). As I'm moving from data munging to data working-with, I've been moving from perl to python. Recommended:
  • lxml is a beautiful interface for dealing with XML in Python. You get XPath and validation and namespaces and all that hooha but you don't have to think hard and you don't have to write SAX stream parsers or walk a DOM path. You just say crap like
    from lxml    import etree   
    from urllib2 import urlopen
    # Load file
    uri   = "http://vizsage.com/apps/baseball/results/parkinfo/parkinfo-all.xml"
    parks = etree.ElementTree(file=urlopen(uri))
    # for each park (<park> tag anywhere in document)
    for (idx, park) in enumerate(parks.xpath('//park')): 
      # dump its id, time of service and name (@attr is XPath for 'corresponding attribute')
      print ' -- '.join(
        [ s+': '+','.join(park.xpath('@'+s)) 
          for s in ('parkID', 'beg', 'end', 'games', 'name',) 
        ])
    
    and you get this in return
    parkID: MIL01 -- beg: 1878-05-14 -- end: 1878-09-14 -- games: 25   -- name: Milwaukee Base-Ball Grounds
    parkID: MIL02 -- beg: 1884-09-27 -- end: 1885-09-25 -- games: 14   -- name: Wright Street Grounds
    parkID: MIL03 -- beg: 1891-09-10 -- end: 1891-10-04 -- games: 20   -- name: Borchert Field
    parkID: MIL04 -- beg: 1901-05-03 -- end: 1901-09-12 -- games: 70   -- name: Lloyd Street Grounds
    parkID: MIL05 -- beg: 1953-04-14 -- end: 2000-09-28 -- games: 3484 -- name: County Stadium
    parkID: MIL06 -- beg: 2001-04-06 -- end: NULL       -- games: 486  -- name: Miller Park
  • lxml.objectify is the replacement for perl's XML::Simple we've all been looking for. You just say gimme and it pulls in an XML file as the corresponding do-what-I-mean data structure (identical elements become arrays, tree leaves become atoms, tree structures become maps).
  • Kid Templating is a great solution for XML transmogrifying, and I think I like it much better than XSLT. It looks perfect for your "Anything => XML" purposes, which is the hard part. I suppose XSLT can do the "XML => anything" tasks but those always look like stunts; the whole point of XML is that "Turn XML into whatever" tasks are easy, especially given a simple API like lxml or lxml.objectify.

Labels: , , , , , , , ,

Wednesday, July 25, 2007

Emacs modes for Flex

  • Emacs modes for Flex:
    • XML: nXML-mode for Emacs from James Clark Using Emacs for XML documents
    • Actionscript: actionscript-mode.el for editing actionscript files in emacs.
    • At least right now it seems you want this xml mode and this actionscript-mode.el.
    • Then, add
      (setq auto-mode-alist (append (list
       '("\\.as\\'"   . actionscript-mode)
       '("\\.\\(xml\\|xsl\\|rng\\|xhtml\\|mxml\\)\\'" . nxml-mode)
       ;; add more modes here
       ) auto-mode-alist))
      
      ;;
      ;; ------------------ Magic for XML Mode ----------------
      ;;
      
      (setq nxml-mode-hook
          '(lambda ()
       (setq tab-width        2
             indent-tabs-mode nil)
             (set-variable 'nxml-child-indent     2)
             (set-variable 'nxml-attribute-indent 2)
             ))
      
    • You can use M-x customize-group RET nxml-highlighting-faces RET to fix your colors the way you like 'em.
  • Setting up asdoc to work within Flex Builder:
    • First, install ant support (Ant is an offshoot of apache and is like Makefile only more betterer.)
    • Then set up a build.xml in your docs/ directory to build the documentation set.
    • I had to modify mine a bit: I added <property name="Templates.dir" location="${FlexSDK.dir}/asdoc/templates/"/> <arg line='-templates-path ${Templates.dir}'/>
    • I also linked the flex home to a no-funny-characters dir:
          ln -s "/Applications/Applications/Adobe Flex Builder 2" /work/ProgramStores/Flex
         cd /work/ProgramStores/Flex
         ln -s "Flex SDK 2" FlexSDK
      Then I exported the location for the asdoc file:
          export FLEX_HOME=/work/ProgramStores/Flex/FlexSDK
      or else I got the error message:
          Exception in thread "main" java.lang.NoClassDefFoundError: Flex

Labels: , , , , , , ,

Tuesday, July 24, 2007

Adobe Flex and Custom Namespace / manifest.xml

Create a file "manifest.xml" and add the following:
<?xml version="1.0"?>
<?componentpackage>
<?!--
URI http://vizsage.com/vzg
namespace gg
package com. vizsage
-->
<component id="widget1" class="com.vizsage.controls.widget1"/>
<component id="widget2" class="com. vizsage.controls.widget2"/>
<?/componentpackage>
Notes:
  • The class part should give the full path (with / turned into .) to the corresponding .as files.
  • You don't need one <component/> for each .as file, just one for each component.
  • The comment part, like most comments (and many goggles), does nothing: all you need is a <component/> for each widget.
  • You don't have to follow the tld.domain.clevername.widgetname format, but it's what all the cool kids are doing. Just make sure the dotted path matches your files' path.
  • The dotted path and the namespace URL don't have anything to do with each other.
  • In fact, the namespace URL is completely made up: it doesn't have to exist; the compiler doesn't look for it; hell, adobe's URL doesn't even exist. It's just a tag for uniquely identifying a namespace. All that matters is that the namespace in your compiler flags and your mxml files match up.

If you use Flex Builder, go into the library project's properties, into the "Library Compiler" field -- add the namespace and manifest.xml into the respective fields. If you use the standalone package, you'll have to add an option for the Component Compiler

-include-namespaces="http://vizsage.com/vzg"  -namespace "http://vizsage.com/vzg" manifest.xml

  • You need to include the namespace and define it
  • The -namespace flag takes two arguments (a namespace and a manifest.xml)
  • The URI here has to match the ns:URL in your .mxml file.
Now your .mxml files (which can be anywhere, and not in that project) start off like
<?xml version="1.0" encoding="utf-8"?>
<mx:Application
xmlns:mx="http://www.adobe.com/2006/mxml"
xmlns:gg="http://vizsage.com/vzg"
layout="absolute" width="100%" height="100%"
viewSourceURL="srcview/index.html">
<?!-- ... your mxml file ... -->
  • Make damn sure the xmlns URI matches what you used before. I spent 30 minutes figuring out that http://www.vizsage.com/vzg and http://vizsage.com/vzg weren't the same thing.
  • In Flex Builder 2, you need to get your project's properties, go into "Flex Build Path", then the "Library Path" pane, and "Add SWC" (the one you built with your custom components).
  • For the command-line tools, add a flag
    -library-path+=/abs/olute/path/to/library.swc
    Make sure that's a += there.
  • Either way, applications (as opposed to libraries) don't need any compiler flags or manifest.xml nothing. The library uniquely identifies itself within a namespace, and provides files in the right .com.foo.bar hierarchy. When your .mxml file (asserts a namespace) and (includes the file) everything turns out right.
For more about namespaces see here, with one caveat: I think you're better off using the -load-config+= trick (to just tack on your changes) than hacking stuff into the main flex-config.xml file.

Labels: , , , , , , , ,