Saturday, January 5, 2008

50 years of Baseball Play-by-play data mashed with 50 years of hourly weather data FTW

Note: I found this sitting in my drafts folder, unpublished. It actually dates from October.

I've had two interesting realizations from the Retrosheet Baseball data vs. Hourly Weather information mashup I've implemented. The first is how my two favorite scripting languages (Python and Perl) compare. The second is how the hard parts of this process is actually the stupidest part... there's four steps in doing an interesting visualization of open data. In order of steps as well as decreasing difficulty and decreasing stupidity:

  • Bring the data from behind its bureaucratic barriers
  • Unlock it into a universal format
  • Process and digest the data
  • Actually explore, visualize and share the data

The hardest and least justifiable steps are the first two, a problem we have to fix.[Edit: this is why I'm starting infochimp.org]

Here's a longer description of how I did the baseball games / weather data mashup.

Several significant parts of this project were written in Perl, for its superior text handling and for the ease of XML::Simple (which I love); several other parts were done in Python, for its more gracious object-orientation.

To suck in the Hourly Weather Data files, you have to click through a 4-screen web form process to prepare a query. Although it sends the final form submission as a POST query, the backend script does accept a GET url (you know, where the data is sent in the URL form.pl?param=val&param2=val&submit=yay instead of in the HTTP request). There's an excellent POST to GET bookmarklet that will take any webpage form and make the parameters appear in the URL. No guarantees that the backend script will accept this, but it's always worth a twirl for screenscraping webpages or just trying to understand what's going on behind the curtain.

Now I need to know what queries to generate. First I needed the location of each major league baseball stadium: Brian Foy posted a Google Earth index of Major League Stadiums, a structured XML file with latitude, longitude and other information. I used the Perl XML::Simple package to bring in this file. These simple routines just pull in the XML files and create a data structure (hashes and arrays of hashes) that mirror the XML tree. The stream-based (SAX) parsers are burlier and more efficient, but for this one-off script, who cares?

Next I needed the locations of all the weather stations. Perl and Python both have excellent flat-file capabilities. The global weather station directory is held in a flat file (meaning that each field is a fixed number of characters that line up in columns). Here's the column header, a sample entry, and numbers showing the width of each field:

USAF   NCDC  STATION NAME                  CTRY  ST CALL  LAT    LON     ELEV*10
010010 99999 JAN MAYEN                     NO JN    ENJA  +70933 -008667 +00090
123456 12345 12345678901234567890123456789 12 12 12 1234  123123 1234123 123456

To break this apart, you just specify an 'unpack' format string. 'A' means an (8-bit) ASCII character; 'x' means a junk character:

A6    xA5   xA29                          xA2xA2xA2xA4  xxA6    xA7     xA6
The result is an array holding each interesting (non-'x') field. The Perl code snippet:
    # Flat file format
    my $fmt    = "A6x    A5x   A29x                          A2xA2xA2xA4xx  A6x  A7x   A6";
    my @fields = qw{id_USAF id_WBAN name region country state callsign lat lng elev};
    # Pull in each line
    for my $line () {
        next if length($line) < 79; chomp $line;
        # Unpack flat record
        my @flat = unpack($fmt, $line);
        # Process raw record 
        ...
    }

I also grabbed the station files for Daily weather reports, since that data goes back much farther (generally, we have since ~1945 for Hourly and since ~1900 for Daily).

Then I score each station by (Proximity and Amount-of-Date), and select the five best stations for each stadium.

Now, I could of course use Perl to generate the POST request using the HTTP modules, but it was simpler to mindlessly just control click on a dozen links at a time and then answer each form. and spit out an HTML file with a big matrix of URLs for each station, for a subset of years. P (You can see the linkdump file here: http://vizsage.com/apps/baseball/results/weather/ParkWeatherGetterDirectory.html)

I also use perl to clean up the XML generated by the MySQL Query Browser -- which returns a flat XML file with all fields as content, not attributes. I just suck the file in with XML::Simple, walk down the resultant hash to create a saner (and semantic) data structure, then spit back out as XML.

The python parts are not terribly interesting. I pull in the flat file, clean up a few data fields and convert in-band NULLs into actual NULLs (they use 99999 to represent a null value in a 5-digit field, for instance) then export the data as a CSV file (for a MySQL LOAD DATA INFILE query). I chose python for this part because I find its object model cleaner -- it's easier to toss structured records around -- and the CSV module is a tad nicer.

The idea I find most interesting is that we're starting to get enough rich data on the web to make these cross-domain data mashups easy and fun -- I did all this in less than a week. With the effortless XML handling and text processing of modern scripting languages (and relieved from any efficiency concerns) it's easy to see forward to a future where we'll have all these datasets sitting at our fingertips. This data set lets you examine ideas such as "How does the break distance of curveballs change with atmospheric temperature and pressure for a full baseball season?" "Effectiveness of pitchers against gametime temperature, stratified by age of pitcher or inning?" "Batting average on fly balls vs. ground balls against % of total cloud cover?". It's easy to come up with a variety of other "This Rich Dataset vs. That Rich Dataset" opportunities. Stock price and Earnings of Harley-Davidson vs. average household income, unemployment and percent of the population that has reached retirement age? Year-by-year movie attendance at comedies compared to dramas, Attendance at Baseball Games, and Sales of Fast Food vs. Consumer Satisfaction Index, national Suicide Rate, and Persons treated for mental health/substance abuse? Presidential approval rating vs. gasoline prices and Consumer Price Index? Amazon.com sales rank, # mentions on Technorati blogs and # of mentions in mainstream media vs. time?

The hard part is actually the stupidest part: to unlock the data from behind bureaucratic barriers (the first script I described), then to convert into a universal semantically rich data format (the second set of scripts I described). Once one person has unlocked this data, however, it's there for the whole world to enjoy, and tools will evolve to capitalize on this bounty of rich, semantically tagged and freely available information.

Labels: , , , , , , , , , , , , , , , , , ,

Friday, October 26, 2007

Hourly Weather data for each Retrosheet game

I noticed some suspect entries for game conditions in the eventfiles and realized I could not only fix it but add a pretty useful dimension to the retrosheet collection. The National Climate Data Center makes available "Global Hourly Surface Data" -- several dozen physical and observational characterizations of the current weather, taken hourly. This data goes back to the forties and sometimes to the start of the century.

Please enjoy this preliminary dataset giving the hourly weather data for each game in Fenway since 1957: http://vizsage.com/apps/baseball/results/weather/

(open the WeatherData-BOS07.* file of your choice) I don't have all the data in hand yet, but I thought I'd get your thoughts and see if anyone would like to help with some of the drudge work.

I'm excited about doing some fun things with the data, like see knuckleball effectiveness vs. humidity or elderly pitchers vs. temperature. Combined with the MLB gameday pitch trajectory info you could do physics "experiments": show the break distance of all curveballs vs. atmospheric pressure.

Email me back if you're interested or with comments.

-----------------------
DATA FIELDS AVAILABLE
-----------------------


The fields I've spit out are

-- game_ID, gamedate, gamenum_in_day, start_time, daygame_flag from
the cwgame output.
- temp deg C
The temperature of the air in degrees Celsius.
- press_atmos HPa
The atmospheric pressure at the observation point.
- press_sealvl HPa

The air pressure relative to Mean Sea Level (MSL).
- press_altim HPa
The pressure value to which an aircraft altimeter is set so that it
will indicate the altitude relative to mean sea level of an aircraft
on the ground at the location for which the value was determined.
- press_chg_3hr_del HPa
The absolute value of the quantity of change in atmospheric pressure
measured at the beginning and end of a three hour period.
- press_chg_3hr_obs --

The code that denotes the characteristics of an
ATMOSPHERIC-PRESSURE-CHANGE that occurs over a period of
three hours.
- wind_dir deg
The angle, measured in a clockwise direction, between true north and
the direction from which the wind is blowing.
- wind_obs --
The code that denotes the character of the WIND-OBSERVATION.
- wind_speed m/s

The rate of horizontal travel of air past a fixed point.
- wind_gust_speed m/s
The rate of speed of a wind gust.
- cloud_cover_low (frac)
The code that represents the fraction of the celestial dome covered
by all low clouds present. If no low clouds are present; the code
denotes the fraction covered by all middle level clouds present.
- vis_dist m
The horizontal distance at which an object can be seen and identified.

- sunshine_time min
The quantity of time sunshine occurred over the reporting period.
- wea_pr_m_obs_1 --
The code that denotes a specific type of weather observed manually.
- wea_pr_m_obs_2 --
The code that denotes a specific type of weather observed manually.
- wea_pr_m_obs_3 --
The code that denotes a specific type of weather observed manually.
- groundcond --

The code that denotes a type of Ground condition
- precip_hist_contin bool
The code that denotes whether precipitation is continuous (true) or
intermittent (false).
- precip_lq1_depth mm
The depth of LIQUID-PRECIPITATION that is measured at the time of an
observation. Unit:Millimeters
- precip_lq1_period hours
The quantity of time over which the LIQUID-PRECIPITATION was measured.

---------- WHAT I DID ----------

I used Brian Foy's Google Earth index of Major League Stadiums:
http://www252.pair.com/comdog/google_earth/major_league_baseball_stadiums.kml and the NCDC ISH-HISTORY file (gives locations for each weather station) ftp://ftp.ncdc.noaa.gov/pub/data/inventories/
to find the closest station with continuous data. (Turns out I could have saved a ton of trouble by just using the nearest airport -- in almost every case it was the best match.)

Then I pulled down data sets from http://cdo.ncdc.noaa.gov/pls/plclimprod/poemain.accessrouter?datasetabbv=DS3505 (If you're interested in replicating any of this I have a script that sends a GET url to help automate the weather data collection.) The last step is to match games with stadiums with locations, and dates and times with hourly observations.

I could be clever and subtle and use the start time and game duration to grab only the hours of gameplay, but instead I just pull in the records from 10:00am to 11:59pm for day games, and 5:00pm to 11:59pm for night games. I suppose I'll fix it to see if a game overhangs midnight and get the post-12am data for those only.

----------------------- WHAT YOU CAN DO TO HELP -----------------------

Geolocation for the rest of the stadiums

Inspect the data for consistency and correctness

If you have access to a computer at a .edu or .k12.us, or fancy GIS data, help me grab the rest of the weather files.

Email me if you'd like to help.

Labels: , , , , , , , , , , , ,

Retrosheet Eventfile Inconsistencies II

I've found a few more inconsistencies and minor inaccuracies in the retrosheet event files and game logs.

I made a diff (applied using the 'patch' tool) to mechanically recreate these corrections: http://vizsage.com/apps/baseball/results/rseventfiles_20070923_patch.diff

I pulled these out by whipping up a few simple scripts (one-liners, mostly) that extracts all unique values for each event file field. For example, the only values for the "info,pitches" field are 'count, 'none' and 'pitches' -- just as promised in the documentation. The "info,temp" field, however, has not only normal temperatures ("78", or "104", or "0" for [unknown]) but also spurious values of '670' and '700' (wrong), '8/7' (ill-formed) and '' (differs with the format documentation).

I'll posting all the dubious entries (event files version 2007 Sep 23) I find at http://vizsage.com/blog/2007/10/retrosheet-eventfile-inconsistencies.html as comments.

==================== Incorrect Data ====================

In 1993MIL.EVA:
info,start,spieb001,"Bim| Spiers",1,9,4
should be
info,start,spieb001,"Bill Spiers",1,9,4

These temperatures need fixing:
1988MON.EVN,info,temp,670
1988MON.EVN,info,temp,700
1964NYA.EVA,info,temp,8/7

I looked at a few suspiciously short games (< 60 minutes):
This should be 1:58, according to the NYT box score:

http://select.nytimes.com/gst/abstract.html?res=FB0614F73D59107B93C4A8178FD85F4C\
8585F9
1958BOS.EVA,info,timeofgame,58
These two are correct:
1971BAL.EVA,info,timeofgame,48 BAL197107300 -- Game called due to rain
1976BOS.EVA,info,timeofgame,57 BOS197609100 -- Game called due to rain
Another thing to look at would be suspicious game length/number of
outs ratio, but I haven't done this yet.

I also checked a few games with attendance below 1000, but these seem
to be very cold or rescheduled days. I'll taka a peak sometime soon at
"game attendance less than two and a half standard deviations from
that year's average attendance" to see what sticks out. (I also
peeked at 2.5+ above -- those look like bandwagon game)

==================== Badly Formatted ====================

These are probably correct but just ill-formatted:
1959CHN.EVN,info,timeofgame,0158
2001PIT.EVN,info,attendance, 34915
1962BOS.EVA,info,daynight,day,
1966ATL.EVN,info,howscored,"park"
1966HOU.EVN,info,howscored,"park"
1970CHA.EVA:data,er,roung101,4#
1958PIT.EVN:data,er,wills102,1y

In these files, the "howscored" field is spelled "howentered":
1990BOS.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990HOU.EVN,info,howentered,game
1990HOU.EVN,info,howentered,game
1990LAN.EVN,info,howentered,game
1990MON.EVN,info,howentered,game
1990MON.EVN,info,howentered,game
1990PIT.EVN,info,howentered,game
1990SFN.EVN,info,howentered,game
1990SFN.EVN,info,howentered,game
1990SLN.EVN,info,howentered,game
1990TEX.EVA,info,howentered,game
1990TEX.EVA,info,howentered,game

There are no "info,edittime" records -- is this purposeful?

==================== Inconsistent with Documentation ====================

In the 2003TBA.EVA file, the umpires are given by name and not by ID.

These are supposed to use 0 as the unknown value but in a few places
use a blank.
1990NYA.EVA,info,temp,
1978ATL.EVN,info,attendance,
1978NYA.EVA,info,attendance,
1979SDN.EVN,info,attendance,
2000PIT.EVN:info,windspeed,

There are some "info,ump[...],(None)" fields, and there are some
"info,ump[...]," fields. Does one indicate "unknown" and the other
indicate "none"? Or is this a formatting inconsistency?

These files have a bunch of "info,windspeed,unknown" fields (the dox
say "An unknown windspeed is indicated by -1."):
1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN
1970ATL.EVN 1970HOU.EVN
These files have an "info,temp,unknown" field (the dox say "An unknown
temp is indicated by 0."):
1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN 1970ATL.EVN
1970HOU.EVN 1990NYA.EVA

These lines have trailing spaces, which is harmless but still
shouldn't be there:
1958CHA.EVA:info,save,
1957BOS.EVA:com,"xwas a lot of action. Had this game been played
today, it no doubt"
1957BRO.EVN:com,"$In addition to 12,559 paid, 6000 knothole,"
1957CLE.EVA:com,"xCC4 changed E9/F.2-3;BX2(9)# to 9/F.2-3(E9)#"
1957MLN.EVN:com,"xCC4 per film, TSN 26 is DP"
1958CLE.EVA:com,"$ Strong wind to left; cool"
1958KC1.EVA:com,"xScoresheet scores DP as 142. I Checked with newspaper"
1958NYA.EVA:com,"$Total attendance: 13323"
1958SFN.EVN:com,"$paper box and Cin s/s has Cepeda and Sauer reversed"
1958SFN.EVN:com,"$paper box has stats that match SF s/s not Cin s/s"

Here are all the well-formed windspeed values:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24
25 26 27 28 29 30 31 32 33 35 36 37 38 40 59 60 66 67 68 69 74 78 87
What are the units on these? If this is in MPH, 39 is Gale force
("Difficult to walk against wind. Twigs and small branches blown off
trees."), 55 is Storm ("Trees uprooted, structural damage likely") and 64
is ("Trees uprooted, structural damage likely").

Here are games with windspeeds over 40:
id,CHA197408270|windspeed,67
id,MIN198008190|windspeed,87
id,TOR198208030|windspeed,68
id,CHN198307042|windspeed,74
id,TOR198307270|windspeed,87
id,LAN199006050|windspeed,78
id,DET199506160|windspeed,87
id,CLE199609141|windspeed,69
id,COL199606150|windspeed,59
id,DET199704300|windspeed,66
id,TEX200104220|windspeed,40
id,SLN200610010|windspeed,60

The SLN200610010 event file gives a wind speed of 60mph (from baseball-reference and ESPN), but a) that's crazy and b) the weather report from that day doesn't confirm it:

http://www.wunderground.com/history/airport/KSTL/2006/10/1/DailyHistory.html?req\ _city=NA&req_state=NA&req_statename=NA Which gives 83F, 9mph SSW wind, clear

See also my next message, about getting weather data for each game.

The BGAME.exe documentation says "WindSpeed: 0 Unknown, 1 Known, other value is the wind speed" but I think it should be "WindSpeed: -1 Unknown other value is the wind speed in miles per hour".

Labels: , , , , , , , , , , ,