Wednesday, January 16, 2008

The power of a good visualization

I just found a program called Grand Perspectivethat present your disk usage as an interactive mipmap (see pic on right). Helping web nerds save hard drive space isn't finding hidden heart defects or keeping planes in the air, but I was struck by how well this program demonstrates the power of intelligent data exploration tools. Here are the Tufte criteria for information presentation:
Documentary · Comparative · Causal · Explanatory · Quantified · Multivariate · Exploratory · Skeptical
Each box is a file, and each top-level directory takes a continuous rectangular portion of the view. Scanning a 350GB disk with a /lot/ of tiny files (5+ million for just the far top left corner, the MLB gameday dataset) took < 5 minutes. You may highlight any box in a segment and navigate "down" to make that segment fill the screen, and may choose to color files by location, depth, name or extension (exploratory, multivariate).
The giant orange box in the top left was 15GB of pure junk -- apparently a CGI-script generating some page I was screenscraping went crazy and sent me 15GB of junk data, the same line repeated almost billions of times. I had /no/ idea it was sitting there. That dataset was supposed to be huge, so I had never drilled into the directory beyond my standard du -sc | sort -n on the containing directory. The picture, however, showed at a glance what a table of numbers dramatically failed to do: that the directory consumed twice as much as it should. The simple metaphor of diskspace=area and the whole-disk view (explanatory, documentary) - highlighted something important I'd never noticed. The giant cluster in the bottom right corner is a huge (~51GB) collection of video ephemera I only kinda cared about. I planned, someday, to sort them -- but for that effort and 51GB usage, it was clearly not worth it. By enforcing comparisons, the data display made me reconsider the value vs. resource consumption of that project and make a more sound decision. In all, I freed up almost 100GB and put a few bucks in his tip jar. Joe Bob says Check it out. (Similar programs exist for Linux (Baobab) and Windows (WinDirStat) too.)

Labels: , , , , , , , , , , , , ,

Monday, January 14, 2008

The 2007 Feltron Annual Report

The 2007 Feltron Annual Report is available now. In a series of elegant infographics, see the ambit of places he walked to in Brooklyn and Manhattan, review how many albums Mr. Feltron bought in the year (12 CDs, 1LP and 98 download tracks), and how often he visited bars in October (6 times; he made 57 total bar visits in the year, down 39% from 2006). My print copy is on the way. (last year's report). Metadata is the new Eyeballs (which is the old Interaction).

Labels: , , , , , , , , , ,

Sunday, January 6, 2008

Owning my Metadata

Dear Lazyweb,

I'd like someone to invent a 'Metadata reclaimer': a program to screenscrape all my amazon ratings, flickr tags, facebook posts, etc.

I try, as far as possible, to only use apps that let me keep ownership of my metadata. As our friend pud has remarked, all successful internet enterprises share the same business model: either

  • People pay to Enter Data into your Database (eBay, Google AdWords, Flickr, Second Life, World of Warcraft, IMDB pro, Craigslist), or less defensible,
  • People Enter Data into Your Database For Free while Other People Pay to Get it Out (rapidshare, iTunes Music Store, Pud's Internal Memos; with youtube, myspace, epinions etc viewers pay with the tenuous currency of their ad brain).

There's nothing wrong with that; all these companies levelled their playing field in some fundamental and important way. (Well, nothing wrong unless you're the loathsome gracenote.com (formerly cddb), who turned an open community-generated resource into a closed database, without even the courtesy of a copy to fork from.)

But it's fair to ask that I be able to export my copy of the data I've added to their business asset, and to do so easily.

Sites that play well with others:
  • my del.icio.us tags and bookmarks
  • my bloglines/google reader feeds
  • my librarything.com everything
  • my last.fm history
  • my iTunes playcounts, tags and ratings: mostly, I think?
  • Firefox bookmarks and history

Sites with an 'I gave up my metadata and all I got was this stupid webpage' policy:

  • facebook posts, friends, photos, everthing
  • flickr tags &c
  • amazon recommendations
  • Google calendar mostly no (at least, the last time I tried to sync my address books it was a Giant Pain in the Ass: nothing was durably id'd and recurring events were semantically incorrect. (Yes, I'd love to have 96 separate entries for my Grandmother's birthday!)
  • eBay bids, purchases, ratings
  • Blogger: Blogs, yes if you remote host your site. However, you can't even /list/ the blogger comments you've made, let alone export them.
  • I believe Myspace's engineers can't even spell XML
(I could be wrong about any of these except the last one).

I'm picturing something with a plugin architecture -- the main app handles the screenscraping, authentication, form submission, web crawling and file export details; the plugin supplies URL wildcards and regexp's the data back into semantic structure. With XML export, a motivated plugin author or well-itched user could supply a decent XSLT stylesheet to represent that metadata in a useful local fashion (and with helpful links back to the main site). It would be useful to have plugins (trivial) and stylesheets (no more or less so) even for sites like Last.fm and Library Thing that Do The Right Thing by granting transparent access to your metadata.

Much of this may exist in some form or another; for example the Aperture/iPhoto plugin will apparently sync your flickr and iPhoto tags, and embed the result into the app database. But going from XML => app is more flexible -- and possibly easier -- than the other way 'round.

I one off'ed this a while back for my Amazon ratings, but I just saw where I'd gone from ~350 to ~650 'things rated' since then. I'm hoping the LazyWeb has solved my problem, since I'm not sure where I put those scripts. (Ironic, considering my previous post.)

Labels: , , , , , , , , ,

Saturday, January 5, 2008

50 years of Baseball Play-by-play data mashed with 50 years of hourly weather data FTW

Note: I found this sitting in my drafts folder, unpublished. It actually dates from October.

I've had two interesting realizations from the Retrosheet Baseball data vs. Hourly Weather information mashup I've implemented. The first is how my two favorite scripting languages (Python and Perl) compare. The second is how the hard parts of this process is actually the stupidest part... there's four steps in doing an interesting visualization of open data. In order of steps as well as decreasing difficulty and decreasing stupidity:

  • Bring the data from behind its bureaucratic barriers
  • Unlock it into a universal format
  • Process and digest the data
  • Actually explore, visualize and share the data

The hardest and least justifiable steps are the first two, a problem we have to fix.[Edit: this is why I'm starting infochimp.org]

Here's a longer description of how I did the baseball games / weather data mashup.

Several significant parts of this project were written in Perl, for its superior text handling and for the ease of XML::Simple (which I love); several other parts were done in Python, for its more gracious object-orientation.

To suck in the Hourly Weather Data files, you have to click through a 4-screen web form process to prepare a query. Although it sends the final form submission as a POST query, the backend script does accept a GET url (you know, where the data is sent in the URL form.pl?param=val&param2=val&submit=yay instead of in the HTTP request). There's an excellent POST to GET bookmarklet that will take any webpage form and make the parameters appear in the URL. No guarantees that the backend script will accept this, but it's always worth a twirl for screenscraping webpages or just trying to understand what's going on behind the curtain.

Now I need to know what queries to generate. First I needed the location of each major league baseball stadium: Brian Foy posted a Google Earth index of Major League Stadiums, a structured XML file with latitude, longitude and other information. I used the Perl XML::Simple package to bring in this file. These simple routines just pull in the XML files and create a data structure (hashes and arrays of hashes) that mirror the XML tree. The stream-based (SAX) parsers are burlier and more efficient, but for this one-off script, who cares?

Next I needed the locations of all the weather stations. Perl and Python both have excellent flat-file capabilities. The global weather station directory is held in a flat file (meaning that each field is a fixed number of characters that line up in columns). Here's the column header, a sample entry, and numbers showing the width of each field:

USAF   NCDC  STATION NAME                  CTRY  ST CALL  LAT    LON     ELEV*10
010010 99999 JAN MAYEN                     NO JN    ENJA  +70933 -008667 +00090
123456 12345 12345678901234567890123456789 12 12 12 1234  123123 1234123 123456

To break this apart, you just specify an 'unpack' format string. 'A' means an (8-bit) ASCII character; 'x' means a junk character:

A6    xA5   xA29                          xA2xA2xA2xA4  xxA6    xA7     xA6
The result is an array holding each interesting (non-'x') field. The Perl code snippet:
    # Flat file format
    my $fmt    = "A6x    A5x   A29x                          A2xA2xA2xA4xx  A6x  A7x   A6";
    my @fields = qw{id_USAF id_WBAN name region country state callsign lat lng elev};
    # Pull in each line
    for my $line () {
        next if length($line) < 79; chomp $line;
        # Unpack flat record
        my @flat = unpack($fmt, $line);
        # Process raw record 
        ...
    }

I also grabbed the station files for Daily weather reports, since that data goes back much farther (generally, we have since ~1945 for Hourly and since ~1900 for Daily).

Then I score each station by (Proximity and Amount-of-Date), and select the five best stations for each stadium.

Now, I could of course use Perl to generate the POST request using the HTTP modules, but it was simpler to mindlessly just control click on a dozen links at a time and then answer each form. and spit out an HTML file with a big matrix of URLs for each station, for a subset of years. P (You can see the linkdump file here: http://vizsage.com/apps/baseball/results/weather/ParkWeatherGetterDirectory.html)

I also use perl to clean up the XML generated by the MySQL Query Browser -- which returns a flat XML file with all fields as content, not attributes. I just suck the file in with XML::Simple, walk down the resultant hash to create a saner (and semantic) data structure, then spit back out as XML.

The python parts are not terribly interesting. I pull in the flat file, clean up a few data fields and convert in-band NULLs into actual NULLs (they use 99999 to represent a null value in a 5-digit field, for instance) then export the data as a CSV file (for a MySQL LOAD DATA INFILE query). I chose python for this part because I find its object model cleaner -- it's easier to toss structured records around -- and the CSV module is a tad nicer.

The idea I find most interesting is that we're starting to get enough rich data on the web to make these cross-domain data mashups easy and fun -- I did all this in less than a week. With the effortless XML handling and text processing of modern scripting languages (and relieved from any efficiency concerns) it's easy to see forward to a future where we'll have all these datasets sitting at our fingertips. This data set lets you examine ideas such as "How does the break distance of curveballs change with atmospheric temperature and pressure for a full baseball season?" "Effectiveness of pitchers against gametime temperature, stratified by age of pitcher or inning?" "Batting average on fly balls vs. ground balls against % of total cloud cover?". It's easy to come up with a variety of other "This Rich Dataset vs. That Rich Dataset" opportunities. Stock price and Earnings of Harley-Davidson vs. average household income, unemployment and percent of the population that has reached retirement age? Year-by-year movie attendance at comedies compared to dramas, Attendance at Baseball Games, and Sales of Fast Food vs. Consumer Satisfaction Index, national Suicide Rate, and Persons treated for mental health/substance abuse? Presidential approval rating vs. gasoline prices and Consumer Price Index? Amazon.com sales rank, # mentions on Technorati blogs and # of mentions in mainstream media vs. time?

The hard part is actually the stupidest part: to unlock the data from behind bureaucratic barriers (the first script I described), then to convert into a universal semantically rich data format (the second set of scripts I described). Once one person has unlocked this data, however, it's there for the whole world to enjoy, and tools will evolve to capitalize on this bounty of rich, semantically tagged and freely available information.

Labels: , , , , , , , , , , , , , , , , , ,

Thursday, December 13, 2007

Old-School Shop Guide

I rediscovered this super-compact reference-and-tool-and-measuring device while looking for a tool. It is jam-packed with handy information for anyone doing things mechanical or woodworking. I got this from a family friend of a family friend -- I bought their lathe after her husband, an avid (and skilled) woodworker, had passed away. She wanted his tools to go to someone who would love them and use them, which was me, which I do. The lathe is good, but I've discovered after the fact that the throwins were the best part. The chisels are *top notch*, but still pale in comparison to getting "his old woodworking magazines." This turned out to be almost every issue of Fine Woodworking magazine, beginning in its first year of publication; somewhere in the stack was this nifty Shop Guide.
I think this thing is so neat -- so much information in such a small space. My own mechanical data reference table (more here) has more numbers but less intrinsic functionality ... What's really neat about this shop guide is how they used the shape of the guide itself as a tool. Print this onto heavy cardstock and punch brother punch with care... enjoy!

Labels: , , , , , , , , , , ,

Leveraging the Bittorrent Underground for semantic data and media

I just ran across a pretty interesting site called coverbrowser.com, which uses a variety of image APIs to pull in comic book, game, book, music, movie and other cover art. (Read the technical details here).

It reminded me of an idea I had while back but which I will never get around to implementing --- maybe you will, or for all I know someone's already been doing for years. (Sidenote: I've had some people express interest in this, and have worked out some parts of it, but just don't have the time to complete it right now. If you'd like to help develop it get in touch).

Many of the movie and music torrents on the, ahem, "Unauthorized Evaluation Copy" bittorrent sites contain hi-res scans of their cover art, and all of the major bittorrent sites maintain topic-specific RSS feeds.

As long as the torrent indexes the files individually (as not as an opaque .zip or .rar) -- and most do index individually -- you can target specific files within the torrent. I don't know whether you could chop all the large-file-size copyright-problematic files that you don't want out of the torrent, or whether you'd have to hack Azureus or other bittorrent client (instructing it to get only *.{png,gif,jpg,jpeg,bmp,tiff,tff} or what have you). Either way, you would then only be pushing out the bandwidth required to grab the photos and not the accompanying multi-megabyte file, and you would only be getting the information to which you assumedly have fair use rights for.

So you'd set up a daemon process that would

  • watch the Movies and the Music RSS feeds off whichever or all of the sites,
  • identify albums whose cover art you lack,
  • pull in the bittorrent,
  • but download only the cover art
  • and perhaps also process any of the accompanying semantic data
You might have to get yourself a seedbox to make this work, but they're not unaffordable.

I think this would lead to a large stream of incoming cover art for music and other media files, complete with a reasonable amount of semantic information.

There's probably a lot of other crowdsourced semantic data flowing through the underground, if someone actually created such a torrenting robot. (And yes, I feel yucky using "crowdsourced" and "semantic data" in the same sentence).

Labels: , , , , , , , , , , , , ,

Friday, October 26, 2007

Hourly Weather data for each Retrosheet game

I noticed some suspect entries for game conditions in the eventfiles and realized I could not only fix it but add a pretty useful dimension to the retrosheet collection. The National Climate Data Center makes available "Global Hourly Surface Data" -- several dozen physical and observational characterizations of the current weather, taken hourly. This data goes back to the forties and sometimes to the start of the century.

Please enjoy this preliminary dataset giving the hourly weather data for each game in Fenway since 1957: http://vizsage.com/apps/baseball/results/weather/

(open the WeatherData-BOS07.* file of your choice) I don't have all the data in hand yet, but I thought I'd get your thoughts and see if anyone would like to help with some of the drudge work.

I'm excited about doing some fun things with the data, like see knuckleball effectiveness vs. humidity or elderly pitchers vs. temperature. Combined with the MLB gameday pitch trajectory info you could do physics "experiments": show the break distance of all curveballs vs. atmospheric pressure.

Email me back if you're interested or with comments.

-----------------------
DATA FIELDS AVAILABLE
-----------------------


The fields I've spit out are

-- game_ID, gamedate, gamenum_in_day, start_time, daygame_flag from
the cwgame output.
- temp deg C
The temperature of the air in degrees Celsius.
- press_atmos HPa
The atmospheric pressure at the observation point.
- press_sealvl HPa

The air pressure relative to Mean Sea Level (MSL).
- press_altim HPa
The pressure value to which an aircraft altimeter is set so that it
will indicate the altitude relative to mean sea level of an aircraft
on the ground at the location for which the value was determined.
- press_chg_3hr_del HPa
The absolute value of the quantity of change in atmospheric pressure
measured at the beginning and end of a three hour period.
- press_chg_3hr_obs --

The code that denotes the characteristics of an
ATMOSPHERIC-PRESSURE-CHANGE that occurs over a period of
three hours.
- wind_dir deg
The angle, measured in a clockwise direction, between true north and
the direction from which the wind is blowing.
- wind_obs --
The code that denotes the character of the WIND-OBSERVATION.
- wind_speed m/s

The rate of horizontal travel of air past a fixed point.
- wind_gust_speed m/s
The rate of speed of a wind gust.
- cloud_cover_low (frac)
The code that represents the fraction of the celestial dome covered
by all low clouds present. If no low clouds are present; the code
denotes the fraction covered by all middle level clouds present.
- vis_dist m
The horizontal distance at which an object can be seen and identified.

- sunshine_time min
The quantity of time sunshine occurred over the reporting period.
- wea_pr_m_obs_1 --
The code that denotes a specific type of weather observed manually.
- wea_pr_m_obs_2 --
The code that denotes a specific type of weather observed manually.
- wea_pr_m_obs_3 --
The code that denotes a specific type of weather observed manually.
- groundcond --

The code that denotes a type of Ground condition
- precip_hist_contin bool
The code that denotes whether precipitation is continuous (true) or
intermittent (false).
- precip_lq1_depth mm
The depth of LIQUID-PRECIPITATION that is measured at the time of an
observation. Unit:Millimeters
- precip_lq1_period hours
The quantity of time over which the LIQUID-PRECIPITATION was measured.

---------- WHAT I DID ----------

I used Brian Foy's Google Earth index of Major League Stadiums:
http://www252.pair.com/comdog/google_earth/major_league_baseball_stadiums.kml and the NCDC ISH-HISTORY file (gives locations for each weather station) ftp://ftp.ncdc.noaa.gov/pub/data/inventories/
to find the closest station with continuous data. (Turns out I could have saved a ton of trouble by just using the nearest airport -- in almost every case it was the best match.)

Then I pulled down data sets from http://cdo.ncdc.noaa.gov/pls/plclimprod/poemain.accessrouter?datasetabbv=DS3505 (If you're interested in replicating any of this I have a script that sends a GET url to help automate the weather data collection.) The last step is to match games with stadiums with locations, and dates and times with hourly observations.

I could be clever and subtle and use the start time and game duration to grab only the hours of gameplay, but instead I just pull in the records from 10:00am to 11:59pm for day games, and 5:00pm to 11:59pm for night games. I suppose I'll fix it to see if a game overhangs midnight and get the post-12am data for those only.

----------------------- WHAT YOU CAN DO TO HELP -----------------------

Geolocation for the rest of the stadiums

Inspect the data for consistency and correctness

If you have access to a computer at a .edu or .k12.us, or fancy GIS data, help me grab the rest of the weather files.

Email me if you'd like to help.

Labels: , , , , , , , , , , , ,

Retrosheet Eventfile Inconsistencies II

I've found a few more inconsistencies and minor inaccuracies in the retrosheet event files and game logs.

I made a diff (applied using the 'patch' tool) to mechanically recreate these corrections: http://vizsage.com/apps/baseball/results/rseventfiles_20070923_patch.diff

I pulled these out by whipping up a few simple scripts (one-liners, mostly) that extracts all unique values for each event file field. For example, the only values for the "info,pitches" field are 'count, 'none' and 'pitches' -- just as promised in the documentation. The "info,temp" field, however, has not only normal temperatures ("78", or "104", or "0" for [unknown]) but also spurious values of '670' and '700' (wrong), '8/7' (ill-formed) and '' (differs with the format documentation).

I'll posting all the dubious entries (event files version 2007 Sep 23) I find at http://vizsage.com/blog/2007/10/retrosheet-eventfile-inconsistencies.html as comments.

==================== Incorrect Data ====================

In 1993MIL.EVA:
info,start,spieb001,"Bim| Spiers",1,9,4
should be
info,start,spieb001,"Bill Spiers",1,9,4

These temperatures need fixing:
1988MON.EVN,info,temp,670
1988MON.EVN,info,temp,700
1964NYA.EVA,info,temp,8/7

I looked at a few suspiciously short games (< 60 minutes):
This should be 1:58, according to the NYT box score:

http://select.nytimes.com/gst/abstract.html?res=FB0614F73D59107B93C4A8178FD85F4C\
8585F9
1958BOS.EVA,info,timeofgame,58
These two are correct:
1971BAL.EVA,info,timeofgame,48 BAL197107300 -- Game called due to rain
1976BOS.EVA,info,timeofgame,57 BOS197609100 -- Game called due to rain
Another thing to look at would be suspicious game length/number of
outs ratio, but I haven't done this yet.

I also checked a few games with attendance below 1000, but these seem
to be very cold or rescheduled days. I'll taka a peak sometime soon at
"game attendance less than two and a half standard deviations from
that year's average attendance" to see what sticks out. (I also
peeked at 2.5+ above -- those look like bandwagon game)

==================== Badly Formatted ====================

These are probably correct but just ill-formatted:
1959CHN.EVN,info,timeofgame,0158
2001PIT.EVN,info,attendance, 34915
1962BOS.EVA,info,daynight,day,
1966ATL.EVN,info,howscored,"park"
1966HOU.EVN,info,howscored,"park"
1970CHA.EVA:data,er,roung101,4#
1958PIT.EVN:data,er,wills102,1y

In these files, the "howscored" field is spelled "howentered":
1990BOS.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990HOU.EVN,info,howentered,game
1990HOU.EVN,info,howentered,game
1990LAN.EVN,info,howentered,game
1990MON.EVN,info,howentered,game
1990MON.EVN,info,howentered,game
1990PIT.EVN,info,howentered,game
1990SFN.EVN,info,howentered,game
1990SFN.EVN,info,howentered,game
1990SLN.EVN,info,howentered,game
1990TEX.EVA,info,howentered,game
1990TEX.EVA,info,howentered,game

There are no "info,edittime" records -- is this purposeful?

==================== Inconsistent with Documentation ====================

In the 2003TBA.EVA file, the umpires are given by name and not by ID.

These are supposed to use 0 as the unknown value but in a few places
use a blank.
1990NYA.EVA,info,temp,
1978ATL.EVN,info,attendance,
1978NYA.EVA,info,attendance,
1979SDN.EVN,info,attendance,
2000PIT.EVN:info,windspeed,

There are some "info,ump[...],(None)" fields, and there are some
"info,ump[...]," fields. Does one indicate "unknown" and the other
indicate "none"? Or is this a formatting inconsistency?

These files have a bunch of "info,windspeed,unknown" fields (the dox
say "An unknown windspeed is indicated by -1."):
1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN
1970ATL.EVN 1970HOU.EVN
These files have an "info,temp,unknown" field (the dox say "An unknown
temp is indicated by 0."):
1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN 1970ATL.EVN
1970HOU.EVN 1990NYA.EVA

These lines have trailing spaces, which is harmless but still
shouldn't be there:
1958CHA.EVA:info,save,
1957BOS.EVA:com,"xwas a lot of action. Had this game been played
today, it no doubt"
1957BRO.EVN:com,"$In addition to 12,559 paid, 6000 knothole,"
1957CLE.EVA:com,"xCC4 changed E9/F.2-3;BX2(9)# to 9/F.2-3(E9)#"
1957MLN.EVN:com,"xCC4 per film, TSN 26 is DP"
1958CLE.EVA:com,"$ Strong wind to left; cool"
1958KC1.EVA:com,"xScoresheet scores DP as 142. I Checked with newspaper"
1958NYA.EVA:com,"$Total attendance: 13323"
1958SFN.EVN:com,"$paper box and Cin s/s has Cepeda and Sauer reversed"
1958SFN.EVN:com,"$paper box has stats that match SF s/s not Cin s/s"

Here are all the well-formed windspeed values:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24
25 26 27 28 29 30 31 32 33 35 36 37 38 40 59 60 66 67 68 69 74 78 87
What are the units on these? If this is in MPH, 39 is Gale force
("Difficult to walk against wind. Twigs and small branches blown off
trees."), 55 is Storm ("Trees uprooted, structural damage likely") and 64
is ("Trees uprooted, structural damage likely").

Here are games with windspeeds over 40:
id,CHA197408270|windspeed,67
id,MIN198008190|windspeed,87
id,TOR198208030|windspeed,68
id,CHN198307042|windspeed,74
id,TOR198307270|windspeed,87
id,LAN199006050|windspeed,78
id,DET199506160|windspeed,87
id,CLE199609141|windspeed,69
id,COL199606150|windspeed,59
id,DET199704300|windspeed,66
id,TEX200104220|windspeed,40
id,SLN200610010|windspeed,60

The SLN200610010 event file gives a wind speed of 60mph (from baseball-reference and ESPN), but a) that's crazy and b) the weather report from that day doesn't confirm it:

http://www.wunderground.com/history/airport/KSTL/2006/10/1/DailyHistory.html?req\ _city=NA&req_state=NA&req_statename=NA Which gives 83F, 9mph SSW wind, clear

See also my next message, about getting weather data for each game.

The BGAME.exe documentation says "WindSpeed: 0 Unknown, 1 Known, other value is the wind speed" but I think it should be "WindSpeed: -1 Unknown other value is the wind speed in miles per hour".

Labels: , , , , , , , , , , ,

Wednesday, October 17, 2007

Retrosheet Eventfile Inconsistencies

Here are a few inconsistent records in the retrosheet.org event files of 2007 Sep 23. I'm using chadwick and not the retrosheet DOS utils, but I think I've source all these to the original event files. Weird Attendance in gamelog GL1941.TXT:
  WS1194107220 (WS1 vs DET) has '1500 e' as its attendance
Weird Start Time in eventfiles: Many daynight records lack an AM or PM. I assume the time mapping of times are as follows:
   daynight  start_time   24hr Time
   D or N    0            Unknown
    D        1000..1259   1000h to 1259h
    D        100..459     1300h to 1659h
    N        500..1150    1700h to 1359h
In that case, here are some weird start times reported by cwgame:
  - Negative start time:
      2003 D 0  -195 SEA 2003 04 15        SEA200304150    info,starttime,-2:05PM   info,daynight,day
  - No daynight flag:
      1998 D 0   506 LAN 1998 08 30        LAN199808300    info,starttime,5:06      -- no daynight --
  - Plainly inconsistent daynight flag:
      1985 D 1   605 CIN 1985 06 21        CIN198506211    info,starttime,6:05PM    info,daynight,day
      1960 N 0   135 BOS 1960 04 19        BOS196004190    info,starttime,1:35PM    info,daynight,night
  - Second half of a double header, listed as a day game despite 5pm or later start:
      1966 D 2   507 BAL 1966 10 02        BAL196610022    info,starttime,5:07PM    info,daynight,day
      2001 D 2   500 PHI 2001 05 27        PHI200105272    info,starttime,5:00PM    info,daynight,day
      2001 D 2   519 PIT 2001 06 03        PIT200106032    info,starttime,5:19PM    info,daynight,day
      2001 D 2   625 MIN 2001 05 26        MIN200105262    info,starttime,6:25PM    info,daynight,day
      2001 D 2   719 CHA 2001 09 04        CHA200109042    info,starttime,7:19PM    info,daynight,day
      2001 D 2   738 CHN 2001 08 20        CHN200108202    info,starttime,7:38PM    info,daynight,day
      2001 D 2   752 PIT 2001 09 03        PIT200109032    info,starttime,7:52PM    info,daynight,day
      2001 D 2   753 SLN 2001 08 03        SLN200108032    info,starttime,7:53PM    info,daynight,day
  - Start times that appear to be after midnight (this could be correct):
      1996 N 1    35 CIN 1996 06 25        CIN199606251    info,starttime,0:35      info,daynight,night
      1998 N 0   105 LAN 1998 06 13        LAN199806130    info,starttime,1:05      info,daynight,night
      1966 N 2  1207 BAL 1966 06 08        BAL196606082    info,starttime,12:07AM   info,daynight,night
 
These eventfile games have more than one "info,daynight" record
  ATL197004150    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197004160    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197005260    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197006191    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197006192    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197006200    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197006210    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197007031    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197007032    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197007050    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009220    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009230    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009240    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009250    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009260    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009270    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197006220    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197008031    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197008032    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197008040    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197009010    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197009110    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197009130    info,starttime,0:00PM   info,daynight,day       info,daynight,night
This eventfile game is missing an "info,daynight" record:
  LAN199808300    info,starttime,5:06
File Structure in eventfile 2001HOU.EVN:
  2001HOU.EVN lacks a trailing newline (unix commands hate this).
Here are the unix commands I used to dump all that info. Sorry for the one-linerism.
# How many have a negative starttime?
grep 'info,starttime,-' *.EV*

# How many have missing or extra "info,daynight" fields?
# -- pull out the info, daynight and starttime records in order
# -- slurp the whole file as one giant string with internal linebreaks;
# -- split each stretch following an id,XXXX record into one line
# -- dump lines that have none or more than one daynight record
  cat *.EV* | egrep '^(id,|info,daynight|info,starttime)' | \
    perl -e '$_ = join(" ",<>); s/[\r\n]+/!!!/g; @games= (split /id,/, $_);
      shift @games;
      for $game (@games) {
          $game =~ s/!!!/\t/g; print "$game\n" if (($game !~ m/daynight/) || ($game =~ m/daynight.*daynight/));
      }'

# How many have a start_time and daynight_flag that disagree?
# -- use cwgame to pull off the gameID,start_time,daynight_flag records;
#    put it into a temporary file    
# -- Use a big stupid regex to find
#    . start_time that is >  500 and marked day
#    . start_time that is <  500 and marked night 
#    . start_time that is > 1200 and marked night 
#    . start_time that is <  100 
#    . start_time that is negative
( for ((year=1957;$year<=2006;year++)) ; do \
     for teamfile in ${year}*.[Ee][Vv]* ; do \
     cwgame -y $year -f '0-0,4-4,6-6' $teamfile 2>/dev/null ; \
     done; \
  done ) > /tmp/starttimeIDs.txt
cat /tmp/starttimeIDs.txt | \
  perl -ne '(m/"(\w\w\w)(\d\d\d\d)(\d\d)(\d\d)(\d)",(12\d\d|[1234]\d\d|\d\d|[1-9]|-\d+),"(N)"/ ||
    m/"(\w\w\w)(\d\d\d\d)(\d\d)(\d\d)(\d)",((?:5|6|7)\d\d|.*-.*|\d\d|[1-9]),"(D)"/)    &&
    printf "%s %s %5d %s %s %s %s\n", $7, $5, $6, $1, $2, $3, $4;' | sort

Labels: , , , , , , , ,

Friday, September 7, 2007

Rules of thumb for Rack Leave in Scrabble

This isn't exactly within the ambit of this blog but at least it's about data. While I should have been doing work, I instead made an awesome spreadsheet to find rules of thumb for what the best Scrabble rack leaves are. (rules of thumb below, tables here: http://vizsage.com/other/scrabble/RackLeaveRules.html) The computer program Quackle is one of the strongest scrabble players in the world. It uses the following 'Superleave' valuation: To find "synergies" and "anti-synergies" (dysphoria?), I calculated the marginal valuation for each combo. Basically, how much of the value for each two-leave is explained by the valuation of the component one-leaves, etc? For example,
- From S (7.35) and M (0.08), the joint valuation of MS is 7.44, a marginal gain of 0.1: the joint valuation is almost entirely from S&M. (<-- will lead to interesting google hits). This combination has no synergy.
- From Q (-9.0) and U (-5.1), the joint valuation of QU is 0.2, a marginal gain of 14.3. This is by far the largest synergy; next is ZO at +3.2.

I also played with three-letter synergies -- 3-leave valuations marginally different from the most explanatory 2-leave. General Lessons:

  • Get a feel for the 1-leave list, and the learn these:
    • Synergy: QU OZ JU CH GN WY IN DE JK ER EV GIN JKY JKU ERS KWY HWY ?IN EST JOW ?AL ?EL ?IL IST
    • Anti-Synergy: BP CG FP MV PV CW CQ QS SX LQ BV SZ QR BC CZ VZ MQ RX GQ + most things with blank BPV CGQ BCG LQR FPV LNQ SVZ CMQ CLQ BCV BNV KTV LMQ GKT CFV GMQ FSV LNR DGT
  • Worth keeping with a blank: The letters in "Lei an orc DTM" + the following digrams IN AL IL EL CI AN ER EN AC AR IT NO QU ET DE CO AT OR LO GN OT AM DI CE IM IR DO MO GI AB AG
  • double letters are bad (duh), except FF, which is good.
See the spreadsheet at http://vizsage.com/other/data/superleaves.xls). Don't go betting the house on these results....

Tables (including 1-tile-leave values) are available here.

Labels: , ,

Monday, August 27, 2007

Subway Geography and Geometry

I've written an applet that lets you reimagine the geography of greater Washington, DC area with "distance" measured by subway-travel-time, measured by subway-travel-cost, or as the standard clarified subway wall map would deform it. This was in large part inspired by Oskar Karlin's beautifully rendered Isochronic Elephant-Castle map of the London Underground and the interactive tube mapplet from Tom Carden. Subway Maps of the world all on the same scale is pretty interesting, as is this directory of remixed London Underground maps. There's a few interesting images on wiki commons, like this geographical map within this gallery. Also, you can download the image files (very large, register with each other) from Wikipedia:

Labels: , , , , , , , , , , , , , , ,