Thursday, December 13, 2007

Leveraging the Bittorrent Underground for semantic data and media

I just ran across a pretty interesting site called coverbrowser.com, which uses a variety of image APIs to pull in comic book, game, book, music, movie and other cover art. (Read the technical details here).

It reminded me of an idea I had while back but which I will never get around to implementing --- maybe you will, or for all I know someone's already been doing for years. (Sidenote: I've had some people express interest in this, and have worked out some parts of it, but just don't have the time to complete it right now. If you'd like to help develop it get in touch).

Many of the movie and music torrents on the, ahem, "Unauthorized Evaluation Copy" bittorrent sites contain hi-res scans of their cover art, and all of the major bittorrent sites maintain topic-specific RSS feeds.

As long as the torrent indexes the files individually (as not as an opaque .zip or .rar) -- and most do index individually -- you can target specific files within the torrent. I don't know whether you could chop all the large-file-size copyright-problematic files that you don't want out of the torrent, or whether you'd have to hack Azureus or other bittorrent client (instructing it to get only *.{png,gif,jpg,jpeg,bmp,tiff,tff} or what have you). Either way, you would then only be pushing out the bandwidth required to grab the photos and not the accompanying multi-megabyte file, and you would only be getting the information to which you assumedly have fair use rights for.

So you'd set up a daemon process that would

  • watch the Movies and the Music RSS feeds off whichever or all of the sites,
  • identify albums whose cover art you lack,
  • pull in the bittorrent,
  • but download only the cover art
  • and perhaps also process any of the accompanying semantic data
You might have to get yourself a seedbox to make this work, but they're not unaffordable.

I think this would lead to a large stream of incoming cover art for music and other media files, complete with a reasonable amount of semantic information.

There's probably a lot of other crowdsourced semantic data flowing through the underground, if someone actually created such a torrenting robot. (And yes, I feel yucky using "crowdsourced" and "semantic data" in the same sentence).

Labels: , , , , , , , , , , , , ,

Friday, October 26, 2007

Hourly Weather data for each Retrosheet game

I noticed some suspect entries for game conditions in the eventfiles and realized I could not only fix it but add a pretty useful dimension to the retrosheet collection. The National Climate Data Center makes available "Global Hourly Surface Data" -- several dozen physical and observational characterizations of the current weather, taken hourly. This data goes back to the forties and sometimes to the start of the century.

Please enjoy this preliminary dataset giving the hourly weather data for each game in Fenway since 1957: http://vizsage.com/apps/baseball/results/weather/

(open the WeatherData-BOS07.* file of your choice) I don't have all the data in hand yet, but I thought I'd get your thoughts and see if anyone would like to help with some of the drudge work.

I'm excited about doing some fun things with the data, like see knuckleball effectiveness vs. humidity or elderly pitchers vs. temperature. Combined with the MLB gameday pitch trajectory info you could do physics "experiments": show the break distance of all curveballs vs. atmospheric pressure.

Email me back if you're interested or with comments.

-----------------------
DATA FIELDS AVAILABLE
-----------------------


The fields I've spit out are

-- game_ID, gamedate, gamenum_in_day, start_time, daygame_flag from
the cwgame output.
- temp deg C
The temperature of the air in degrees Celsius.
- press_atmos HPa
The atmospheric pressure at the observation point.
- press_sealvl HPa

The air pressure relative to Mean Sea Level (MSL).
- press_altim HPa
The pressure value to which an aircraft altimeter is set so that it
will indicate the altitude relative to mean sea level of an aircraft
on the ground at the location for which the value was determined.
- press_chg_3hr_del HPa
The absolute value of the quantity of change in atmospheric pressure
measured at the beginning and end of a three hour period.
- press_chg_3hr_obs --

The code that denotes the characteristics of an
ATMOSPHERIC-PRESSURE-CHANGE that occurs over a period of
three hours.
- wind_dir deg
The angle, measured in a clockwise direction, between true north and
the direction from which the wind is blowing.
- wind_obs --
The code that denotes the character of the WIND-OBSERVATION.
- wind_speed m/s

The rate of horizontal travel of air past a fixed point.
- wind_gust_speed m/s
The rate of speed of a wind gust.
- cloud_cover_low (frac)
The code that represents the fraction of the celestial dome covered
by all low clouds present. If no low clouds are present; the code
denotes the fraction covered by all middle level clouds present.
- vis_dist m
The horizontal distance at which an object can be seen and identified.

- sunshine_time min
The quantity of time sunshine occurred over the reporting period.
- wea_pr_m_obs_1 --
The code that denotes a specific type of weather observed manually.
- wea_pr_m_obs_2 --
The code that denotes a specific type of weather observed manually.
- wea_pr_m_obs_3 --
The code that denotes a specific type of weather observed manually.
- groundcond --

The code that denotes a type of Ground condition
- precip_hist_contin bool
The code that denotes whether precipitation is continuous (true) or
intermittent (false).
- precip_lq1_depth mm
The depth of LIQUID-PRECIPITATION that is measured at the time of an
observation. Unit:Millimeters
- precip_lq1_period hours
The quantity of time over which the LIQUID-PRECIPITATION was measured.

---------- WHAT I DID ----------

I used Brian Foy's Google Earth index of Major League Stadiums:
http://www252.pair.com/comdog/google_earth/major_league_baseball_stadiums.kml and the NCDC ISH-HISTORY file (gives locations for each weather station) ftp://ftp.ncdc.noaa.gov/pub/data/inventories/
to find the closest station with continuous data. (Turns out I could have saved a ton of trouble by just using the nearest airport -- in almost every case it was the best match.)

Then I pulled down data sets from http://cdo.ncdc.noaa.gov/pls/plclimprod/poemain.accessrouter?datasetabbv=DS3505 (If you're interested in replicating any of this I have a script that sends a GET url to help automate the weather data collection.) The last step is to match games with stadiums with locations, and dates and times with hourly observations.

I could be clever and subtle and use the start time and game duration to grab only the hours of gameplay, but instead I just pull in the records from 10:00am to 11:59pm for day games, and 5:00pm to 11:59pm for night games. I suppose I'll fix it to see if a game overhangs midnight and get the post-12am data for those only.

----------------------- WHAT YOU CAN DO TO HELP -----------------------

Geolocation for the rest of the stadiums

Inspect the data for consistency and correctness

If you have access to a computer at a .edu or .k12.us, or fancy GIS data, help me grab the rest of the weather files.

Email me if you'd like to help.

Labels: , , , , , , , , , , , ,