tag:blogger.com,1999:blog-42016138028716428892008-02-01T12:27:58.865-06:00VizsageflipBlogger28125tag:blogger.com,1999:blog-4201613802871642889.post-83504760707124508862008-01-27T19:42:00.000-06:002008-01-27T20:15:47.258-06:00Rails Lessons Learned the Hard WayThings I've learned the hard way in Rails:
<ul>
<li>Layouts run <strong>inside</strong> views, not the other way round. Set an instance variable in app/views/monkeys/show.html.erb and it will be defined in app/views/layouts/monkey.html.erb but not vice versa.
<ul>
<li>set instance vars in view<br />
<code>@foo_val = find_foo_val</code>
</li>
<li>pass variables to partials using<br />
<code><%= render :partial => "root/license", :locals => { :foo => @foo_val } -%></code>
</li>
<li>use the instance var freely in the layout; it will take the value defined in the view</li>
</ul>
<li>Dump an object for bobo debugging through the console or log:<br />
<code>$stderr.puts tag_list.to_yaml</code>
</li>
<li>In a migration, if you define a unique index on an attribute, make sure both the index AND attribute are <code>:unique => true</code>, or else you'll get no uniqueness validation from Rails:<br />
<pre><code>
create_table :monkeys do |t|
# set :unique here
t.string :name, :default => "", :null => false, :unique => true
end
# if you have :unique here
add_index :datasets, [:name], :name => :name, :unique => true
</code></pre>
</li><li>If you scaffold a User or other object with private data, MAKE SURE you <a href="http://blog.wolfman.com/articles/2007/06/26/rest-scaffold_resource-security-warning">strip out fields you don't want a user setting or viewing</a>:<br />
<ul>
<li>Set attr_accessible, which controls data coming *in* -- prevents someone setting an attribute by stuffing in a form value. </li>
<li>In each view (.html.erb &c) and render method (to_xml), strip out fields you don't want anyone to see using the <code>:only => [:ok_to_see, :this_too]</code> parameter.</li>
<li>Set filter_parameter_logging, which controls what goes into your logs. (Logs should of course be outside the public purview, but 'Defense in Depth' is ever our creed.)</li>
</ul>
Using the the restful-authentication generator as an example:<br />
<ul>
<li>In the model, whitelist fields the user is allowed to set (this excludes things like confirmation code or usergroup):<br />
<code>attr_accessible :login, :email, :password, :password_confirmation</code></li>
<li>In the controller file, whitelist only the fields you wish to xml serialize:<br />
<code>format.xml { render :xml => @user.to_xml(:only => [:first_name, :last_name]) }</code></li>
<li>Obviously,In the show.html.erb and edit.html.erb strip out fields that shouldn't be seen.</li>
<li>In the model file, blacklist fields from the logs:<br />
<code>filter_parameter_logging :password, :salt, "activation-code"</code>
</liin></ul></li>
<li>I won't even tell you how often this happens to me: If you edit or install code in a plugin, <strong>restart the server</strong>.
</li>
</ul><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-53660617236710710752008-01-27T18:34:00.000-06:002008-01-27T20:06:16.943-06:00Parsing Names with Honorifics<p>In <a href="http://railscasts.com/episodes/16">Railscast #16</a>, Ryan Bates goes over Virtual Attributes in Rails, using the standard example of storing first and last names but getting/setting full names. He uses the following simple snippet:</p>
<pre><code>
def full_name=(name)
split = name.split(' ', 2)
self.first_name = split.first
self.last_name = split.last
end</code></pre>
<p>Which -- given that the focus was on virtual attributes -- is fine for explanation. However, that snippet will fail on names like "Franklin Delano Roosevelt" (last name of "Delano Roosevelt"). Here's a method which our 32d President will like better:</p>
<pre><code>
def clean(n, re = /\s+|[^[:alpha:]\-]/)
return n.gsub(re, ' ').strip
end
# Returns [first_name, last_name] (or '' if there isn't any).
# Leading/trailing spaces ignored.
def first_last_from_name(n)
parts = clean(n).split(' ')
[parts.slice(0..-2).join(' '), parts.last]
end
names = [
"Bill! Merkin,PhD.",
"Jim Thurston Howell III ",
"Charo",
"Heywood Jablowmie",
"Sergei Rodriguez-Ivanoviv",
"Polly Romanesq. ",
" ",
"",
]
p names.map { |n| first_last_from_name n }
# => [["Bill", "Merkin,PhD"], ["Jim Thurston Howell", "III"], ["", "Charo"], ["Heywood", "Jablowmie"], ["Sergei", "Rodriguez-Ivanoviv"], ["Polly", "Romanesq"], ["", nil], ["", nil]]
</code></pre>
<p>A <a href="http://www.regular-expressions.info/tutorial.html">regex</a> is more extensible, and makes more sense for Perl refugees like me.</p>
<pre><code>
# Returns [first_name, last_name] (or nil if there isn't any).
# Leading/trailing spaces ignored.
def first_last_from_name_re(n)
n = clean(n);
(n =~ / /) ? (n.scan(/(.*)\s+(\S+)$/).first) : [nil, n]
end
p names.map { |n| first_last_from_name_re n }
# => [["Bill", "Merkin,PhD"], ["Jim Thurston Howell", "III"], [nil, "Charo"], ["Heywood", "Jablowmie"], ["Sergei", "Rodriguez-Ivanoviv"], ["Polly", "Romanesq"], [nil, ""], [nil, ""]]
</code></pre>
<p>However, as someone who can't check in at the automatic kiosks in airports because -- no joke -- the credit card thinks my last name is "IV", I like this version better.</p>
<pre><code>
# Returns [first_name, last_name, appendix]
# (first name and appendix are nil if there isn't any).
# Leading/trailing spaces ignored.
#
def first_last_appendix_from_name_re(n, appendix = nil)
n = clean(n)
appendix_re ||= %q((I|II|III|IV|(?:jr|sr|m\.?d|esq|Ph\.?D)\.?))
if (n !~ / /) then
[nil, n, nil] # with no spaces return n as last name
else
n.scan(
/\A(.*?)\s+ # everything up to the last name
(\S+?) # last name is last stretch of non-whitespace
(?: # But! there may be an appendix. Look for an optional group
(?:,\s*|\s+) # that is set off by a comma or spaces
#{appendix_re} # and that matches any of our standard honorifics.
)? # but if not, don't worry about it.
\Z/ix).first # scan gives array of arrays; \A..\Z guarantees exactly one match
end
end
p names.map { |n| first_last_appendix_from_name_re n }
# => [["Bill", "Merkin", "PhD"], ["Jim Thurston", "Howell", "III"], [nil, "Charo", nil], ["Heywood", "Jablowmie", nil], ["Sergei", "Rodriguez-Ivanoviv", nil], ["Polly", "Romanesq", nil], [nil, "", nil], [nil, "", nil]]
</code></pre>
<p>All three versions might make Japanese (and other "FamilyName GivenNames" cultures) sad.<div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-31296875502688330352008-01-22T21:15:00.000-06:002008-01-22T21:28:02.677-06:00Copyright Disputes are usually Failures of Imagination<p>Hasbro is <a href="http://www.huffingtonpost.com/2008/01/11/hasbro-tries-to-shut-down_n_81176.html">trying to shut down Scrabulous</a>, a successful online Scrabble game -- perhaps the most successful Facebook app to date.
</p><p>
On the one hand, I think that Hasbro is completely within their rights: it's a clear infringement.
</p><p>
On the other hand, it's a departure from form (they've for a long time licensed gray-market implementations), and a failure of imagination that doesn't account for important subtleties in software engineering and social networks.
</p><p>
On the software engineering end, all of the interesting computer Scrabble implementations I know of were created independently and *then* brought into the fold, to both parties' mutual benefit. Hasbro is a board game company: It doesn't, and shouldn't employ brilliant independent software engineers who create a new entry in the scrabble ecosystem. The other thing to note is that Scrabulous solves some difficult problems in a way no previous product has.
</p><p>
Here's a brief history of the important scrabble programs I know of. The first ones let you play against a computer; this requires a powerful artificial intelligence (AI) engine and an unobtrusive interface. (The hard part is the AI; note that Scrabulous was written in Flash, a very constrained programming environment). Maven was the first Scrabble program that played at an expert level (at one time it was the best scrabble player in the world). Though developed independently, it was purchased by Hasbro (or their licensee) and adopted as the AI agent in the official Hasbro Scrabble software. I don't believe that the official software has been updated for some years, it was Windows-only, and the <a href="http://www.hasbro.com/games/adult-games/scrabble/home.cfm?page=Products/catalog&sort=displayname">official scrabble site</a> has no link to it. <a href="http://web.archive.org/web/20021212213306/http://www.doe.carleton.ca/%7Ejac/scrab.html">ACbot</a>, another early implementation, was independently developed by James A Cherry and could play at a low-expert level.. A current offering is <a href="http://web.mit.edu/jasonkb/www/quackle/">Quackle</a>, a free scrabble robot developed by a student at MIT. Its AI engine is extremely strong (also one of the best players in the world) and its front end, while /quite/ rough, is useable and works on Windows/OSX/Linux. All of these programs were written outside Hasbro's aegis. They were developed by experts in computer artificial intelligence and game theory and are far superior to anything that was or could be developed in-house by a board game company.
</p><p>
Another approach lets you play against a person using the network in real time. One of the first was MarlDOoM -- a primitive (text only, pre-web technology) free online scrabble bulletin board. It was developed by <a href="http://www.math.toronto.edu/jjchew">John Chew</a>, who at the time was simply a scrabble enthusiast but is now on the official Nat'l Scrabble Association's dictionary committee and the webmaster for their site -- I believe that implementing MarlDOoM helped bring this about. There are modern programs and websites that are officially licensed and let you compete remotely. However, their price or subscription fee exceeds the cost of the physical version, and they require that *both* parties pay for the game, which the physical version does not.
</p><p>
A third approach is 'scrabble by mail' -- one move every day or so, with as much or as little time commitment and deliberation as you care to devote. If there are licensed products that allow this I'm not aware of them.
</p><p>
In all, here's what you'd like a compelling software version of a board game to offer:
<ol>
<li>Play from any computer, anywhere; simple to acquire, install and use.</li><li>Reasonable price compared to the physical game</li>
<li>Skill level:
<ul>
<li>Enjoyable for an expert player</li>
<li>Enjoyable for a casual player</li>
<li>A casual player and a strong player may enjoy a game where their focus is on socializing and not gameplay</li></ul></li>
<li>Time commitment:
<ul>
<li>Play for 10 minutes at a time -- an quick diversion.</li>
<li>Play for an hour at a time -- a leisure activity</li>
<li>Play without having to meet at the same time</li></ul></li>
<li>Social play:
<ul>
<li>Play remotely against a friend, in real time (complete a game at one sitting)</li>
<li>Play remotely in a "Chess by Mail" context: make a move every day or so, when you have time.</li></ul></li>
<li> 6. Competitve play:
<ul><li>Compete remotely against a skill-matched stranger, in real time or move-a-day</li>
<li>Track durable competitive rankings </li>
<li>Tie those ratings to a reputation system to prevent gaming the rating mechanism.</li></ul></li></ol>
</p><p>
None of the licensed programs or sites, as far as I know, cost less than the one-time-only, one-person-plays price of a physical Hasbro scrabble. Scrabulous is free, requires only a browser, and is available from any computer anywhere. It provides a simple experience that my computer-incompetent mom can enjoy. (As far as she knows, facebook *is* a scrabble program.)
</p><p>
Scrabulous is the first solution that enables me (an intermediate tournament-level player) to play remotely against any of my casual-level friends -- friends would never pay for, or seek out, or regularly visit, a scrabble-only site. My friend Jen lives in Shanghai -- no previous approach that I'm aware lets me play on my lunch break against her on her lunch break,. None let me *easily* discover when a casual friend is on: all require that you go to their sandbox when you want to play, and that all the people you'd like to compete with patronize the same sandbox. None of them let me jump in / jump out for a quick 10-minute timewaster. Since Scrabulous/Facebook is part of a compelling portal, it's natural to check in and meet friends; it understands my social network; and the play-by-turns feature lets my scale the time commitment and schedule.
</p><p>
No previous approach effectively prevent a cheater from manipulating his rating. However, in Facebook you are a person: you have friends, you have a name, you are part of a community. It's still feasible to be a troll or a sock-puppet or any of the other strategies to game or disrupt a community rating, but there are barriers and consequences for doing so.
</p><p>
If Hasbro shuts down -- rather than licenses -- Scrabulous it will be a business failure. They should be ecstatic that people are integrating scrabble into their social lives, and should see a modest halo effect in board game sales. The revenue stream from Scrabulous' share of Facebook advertising is, I believe, quite significant -- enough for Hasbro and Scrabulous to both enjoy while keeping the game free.
</p><p>
More importantly, Social Network research consistently highlights the importance of "Network Effects" in technology adoption (http://en.wikipedia.org/wiki/Metcalfe%27s_law). There are many, many social games on Facebook, and if Scrabulous is taken down the large body of casual users will move to another entry in this niche. After all, these games are only interesting if your friends also play. Any Hasbro implementation must not only match the quality of Scrabulous' implementation, but must build a network of friends who select it for their social gameplay arena -- and they must build that network against the ill-will that will accrue from shutting Scrabulous down.
</p><p>
Creating a software program (and more importantly a community) like Scrabulous has is HARD: look at all of the previous attempts that have failed to get millions of people to play online. It's hard because there are subtle and serious software engineering challenges, and it's hard because there are subtle and serious community building challenges. If Hasbro shuts down the Scrabulous guys there's no reason to think they'll be able to reproduce their success.</p><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-9057223035421464742008-01-16T19:22:00.001-06:002008-01-16T20:09:18.859-06:00The power of a good visualization<div style="text-align: left;"><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_KypMAWXENa4/R46aSQy0NqI/AAAAAAAAACw/3wOs1kBNlFA/s1600-h/work-all-diskusage.jpg"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://bp1.blogger.com/_KypMAWXENa4/R46aSQy0NqI/AAAAAAAAACw/3wOs1kBNlFA/s320/work-all-diskusage.jpg" alt="" id="BLOGGER_PHOTO_ID_5156228261922223778" border="0" /></a>I just found a program called <a href="http://grandperspectiv.sourceforge.net/">Grand Perspective</a>that present your disk usage as an interactive mipmap (see pic on right). Helping web nerds save hard drive space isn't finding hidden heart defects or keeping planes in the air, but I was struck by how well this program demonstrates the power of intelligent data exploration tools. Here are the <a href="http://www.sciam.com/article.cfm?chanID=sa006&colID=13&articleID=00033494-443B-1237-81CB83414B7FFE9F">Tufte criteria</a> for information presentation:
<blockquote><strong>Documentary · Comparative · Causal · Explanatory · Quantified · Multivariate · Exploratory · Skeptical</strong></blockquote>Each box is a file, and each top-level directory takes a continuous rectangular portion of the view. Scanning a 350GB disk with a /lot/ of tiny files (5+ million for just the far top left corner, the MLB gameday dataset) took < 5 minutes. You may highlight any box in a segment and navigate "down" to make that segment fill the screen, and may choose to color files by location, depth, name or extension (exploratory, multivariate).
</div>The giant orange box in the top left was 15GB of pure junk -- apparently a CGI-script generating some page I was screenscraping went crazy and sent me 15GB of junk data, the same line repeated almost billions of times. I had /no/ idea it was sitting there. That dataset was supposed to be huge, so I had never drilled into the directory beyond my standard <tt>du -sc | sort -n</tt> on the containing directory. The picture, however, showed at a glance what a table of numbers dramatically failed to do: that the directory consumed twice as much as it should. The <span style="font-weight: bold;">simple metaphor</span> of diskspace=area and the <span style="font-weight: bold;">whole-disk view</span> (explanatory, documentary) - highlighted something important I'd never noticed.
The giant cluster in the bottom right corner is a huge (~51GB) collection of video ephemera I only kinda cared about. I planned, someday, to sort them -- but for that effort and 51GB usage, it was clearly not worth it. By <span style="font-weight: bold;">enforcing comparisons</span>, the data display made me reconsider the value vs. resource consumption of that project and make a more sound decision.
<strong></strong>
In all, I freed up almost 100GB and put a few bucks in his tip jar. Joe Bob says <a href="http://grandperspectiv.sourceforge.net/">Check it out</a>. (<a href="http://lifehacker.com/software/disk-space/geek-to-live--visualize-your-hard-drive-usage-219058.php">Similar programs</a> exist for Linux (Baobab) and Windows (WinDirStat) too.)<div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-7013490010202164562008-01-14T20:08:00.000-06:002008-01-14T20:16:37.589-06:00The 2007 Feltron Annual Report<a href="http://feltron.com/index.php?/content/2007_annual_report/">The 2007 Feltron Annual Report</a> is available now. In a series of elegant infographics, see the ambit of places he walked to in Brooklyn and Manhattan, review how many albums Mr. Feltron bought in the year (12 CDs, 1LP and 98 download tracks), and how often he visited bars in October (6 times; he made 57 total bar visits in the year, down 39% from 2006). My print copy is on the way. (<a href="http://feltron.com/06report_index.html">last year's report</a>).
Metadata is the new Eyeballs (which is the old Interaction).<div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-83053776489521519722008-01-08T12:48:00.001-06:002008-01-27T19:40:56.893-06:00More things I wish someone else will write<p>More random software ideas:</p><ul>
<li>Google search, restricted to find bug reports only. You'd crawl usenet, sourceforge/google code, debian etc. build farms, open issue trackers, mailing list archives and blogs; extract things in 'pre' tags, and look for repeated stanzas (these indicate where bug was pasted in).</li>
<li>NTP server along the lines of the <a href="http://davidseah.com/blog/comments/a-chindogu-clock-for-procrastinators/#commentstart">procrastinator's clock</a>, that would dither the time (by extending or delaying each second) by up to a set amount fast, and never slow. You'd have to be careful with rsync, server logs, kerberos/cookie session stores/other authentication... or maybe just use it at the app level, if your clocks will use NTP themselves.</li>
</ul><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-70488438569093448562008-01-07T09:56:00.000-06:002008-01-07T09:58:01.676-06:00Reference Cards<p>Here are some pretty <a href="http://vizsage.com/other/cheatsheets/">reference cards</a> I made a while back:</p><ul>
<li><strong><a href="http://vizsage.com/other/cheatsheets/ScaleLandmarks.pdf">Scale Landmarks</a></strong>: What's something you're familiar with that is about 10 nm big? How do the speed of continental drift, a raindrop, a champion sprinter and an SR-71A Blackbird compare? What is the range between the least massive (electron) and most massive (universe) objects science can describe?</li>
<li><strong><a href="http://vizsage.com/other/cheatsheets/PeriodicTable.pdf">Periodic Table</a></strong></li>
<li><strong><a href="http://vizsage.com/other/cheatsheets/PeriodicTableFlat.pdf">Periodic Table, Flat</a></strong> -- material properties as a table and not as Mendeleev puts it.</li>
<li><strong><a href="http://vizsage.com/other/cheatsheets/MechanicalInfo-Fasteners.pdf">Mechanical, Geometric and Material properties of Screws, Bolts and Fasteners</a></strong> -
probably the most useful among these, this gives thread geometry, decimal inch/screw/metric equivalents, mechanical strengths, torque ratings and more. Super handy for machining or general shop use.</li>
<li>A similar table for <strong><a href="http://vizsage.com/other/cheatsheets/MechanicalInfo-ANHardware.pdf">AN Hardware</a></strong> (milspec fasteners used in airplanes, racecars and hot rods).</li>
<li><strong><a href="http://vizsage.com/other/cheatsheets/MechanicalInfo-DecimalEquivalents.pdf">A flat table of decimal equivalents</a></strong>: decimal and fractional inch, metric, and standard (US) screw sizes.</li>
<li><strong><a href="http://vizsage.com/other/cheatsheets/MechanicalInfo-ASCIIChart.pdf">ASCII Chart</a></strong> - easily index up hex, octal, ascii, symbol font/latin font/DOS font values for characters.</li>
</ul><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-5599559232587792832008-01-06T00:10:00.001-06:002008-01-06T21:51:25.636-06:00Owning my Metadata<p>Dear Lazyweb,</p><p>
I'd like someone to invent a 'Metadata reclaimer': a program to screenscrape all my amazon ratings, flickr tags, facebook posts, etc.</p><p>
I try, as far as possible, to only use apps that let me keep ownership of my metadata. As our friend <a href="http://pud.com">pud</a> has remarked, all successful internet enterprises share the same business model: either <ul><li><strong>People pay to Enter Data into your Database</strong> (eBay, Google AdWords, Flickr, Second Life, World of Warcraft, IMDB pro, Craigslist), or less defensible,</li><li><strong>People Enter Data into Your Database For Free while Other People Pay to Get it Out</strong> (rapidshare, iTunes Music Store, Pud's Internal Memos; with youtube, myspace, epinions etc viewers pay with the tenuous currency of their ad brain).</li></ul></p><p>
There's nothing wrong with that; all these companies levelled their playing field in some fundamental and important way. (Well, nothing wrong unless you're the loathsome gracenote.com (formerly cddb), who turned an open community-generated resource into a closed database, without even the courtesy of a copy to fork from.) </p><p>But it's fair to ask that I be able to export <strong>my</strong> copy of the data I've added to their business asset, and to do so easily.<p></p>
Sites that play well with others:<ul><li>my del.icio.us tags and bookmarks</li><li>my bloglines/google reader feeds</li><li>my librarything.com everything</li><li>my last.fm history</li><li>my iTunes playcounts, tags and ratings: mostly, I think?</li><li>Firefox bookmarks and history</li></ul>
<p>Sites with an 'I gave up my metadata and all I got was this stupid webpage' policy:</p><ul><li>facebook posts, friends, photos, everthing</li><li>flickr tags &c</li><li>amazon recommendations</li><li>Google calendar mostly no (at least, the last time I tried to sync my address books it was a Giant Pain in the Ass: nothing was durably id'd and recurring events were semantically incorrect. (Yes, I'd love to have 96 separate entries for my Grandmother's birthday!)</li><li>eBay bids, purchases, ratings</li><li>Blogger: Blogs, yes if you remote host your site. However, you can't even /list/ the blogger comments you've made, let alone export them.</li><li>I believe Myspace's engineers can't even spell XML</li></ul>(I could be wrong about any of these except the last one).</p><p>
I'm picturing something with a plugin architecture -- the main app handles the screenscraping, authentication, form submission, web crawling and file export details; the plugin supplies URL wildcards and regexp's the data back into semantic structure. With XML export, a motivated plugin author or well-itched user could supply a decent XSLT stylesheet to represent that metadata in a useful local fashion (and with helpful links back to the main site). It would be useful to have plugins (trivial) and stylesheets (no more or less so) even for sites like Last.fm and Library Thing that Do The Right Thing by granting transparent access to your metadata.</p><p>
Much of this may exist in some form or another; for example the <a href="http://connectedflow.com/flickrexport/">Aperture/iPhoto plugin</a> will apparently sync your flickr and iPhoto tags, and embed the result into the app database. But going from XML => app is more flexible -- and possibly easier -- than the other way 'round.</p><p>
I <a href="http://mrflip.com/resources/Ratings.html">one off'ed this a while back for my Amazon ratings</a>, but I just saw where I'd gone from ~350 to ~650 'things rated' since then. I'm hoping the LazyWeb has solved my problem, since I'm not sure where I put those scripts. (Ironic, considering my previous post.)<p><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-38769976157673945702008-01-05T12:38:00.000-06:002008-01-06T21:59:18.607-06:0050 years of Baseball Play-by-play data mashed with 50 years of hourly weather data FTW<p><em><small>Note: I found this sitting in my drafts folder, unpublished. It actually dates from October.</small></em></p>
<p>I've had two interesting realizations from the <a href="http://vizsage.com/blog/2007/10/hourly-weather-data-for-each-retrosheet.html">Retrosheet Baseball data vs. Hourly Weather information</a> mashup I've implemented. The first is how my two favorite scripting languages (Python and Perl) compare. The second is how the hard parts of this process is actually the stupidest part... there's four steps in doing an interesting visualization of open data. In order of steps as well as decreasing difficulty and decreasing stupidity:</p><ul>
<li>Bring the data from behind its bureaucratic barriers</li>
<li>Unlock it into a universal format</li>
<li>Process and digest the data</li>
<li>Actually explore, visualize and share the data</li></ul><p>
The hardest and least justifiable steps are the first two, a problem we have to fix.[Edit: this is why I'm <a href="http://infochimp.org">starting infochimp.org</a>] </p><p>
Here's a longer description of how I did the baseball games / weather data mashup.
</p><p>
Several significant parts of this project were written in Perl, for its superior text handling and for the ease of XML::Simple (which I love); several other parts were done in Python, for its more gracious object-orientation.</p><p>
To suck in the Hourly Weather Data files, you have to <a href="http://cdo.ncdc.noaa.gov/pls/plclimprod/poemain.accessrouter?datasetabbv=DS3505">click through a 4-screen web form process</a> to prepare a query. Although it sends the final form submission as a POST query, the backend script does accept a GET url (you know, where the data is sent in the URL form.pl?param=val&param2=val&submit=yay instead of in the HTTP request). There's an excellent <a href="https://www.squarefree.com/bookmarklets/forms.html">POST to GET bookmarklet</a> that will take any webpage form and make the parameters appear in the URL. No guarantees that the backend script will accept this, but it's always worth a twirl for screenscraping webpages or just trying to understand what's going on behind the curtain.</p><p>
Now I need to know what queries to generate. First I needed the location of each major league baseball stadium: Brian Foy posted a <a href="http://www252.pair.com/comdog/google_earth/major_league_baseball_stadiums.kml">Google Earth index of Major League Stadiums</a>, a structured XML file with latitude, longitude and other information. I used the Perl XML::Simple package to bring in this file. These simple routines just pull in the XML files and create a data structure (hashes and arrays of hashes) that mirror the XML tree. The stream-based (SAX) parsers are burlier and more efficient, but for this one-off script, who cares?</p><p>
Next I needed the locations of all the weather stations. Perl and Python both have excellent flat-file capabilities. The global weather station directory is held in a flat file (meaning that each field is a fixed number of characters that line up in columns). Here's the column header, a sample entry, and numbers showing the width of each field:<pre>
USAF NCDC STATION NAME CTRY ST CALL LAT LON ELEV*10
010010 99999 JAN MAYEN NO JN ENJA +70933 -008667 +00090
123456 12345 12345678901234567890123456789 12 12 12 1234 123123 1234123 123456</pre></p><p>
To break this apart, you just specify an 'unpack' format string. 'A' means an (8-bit) ASCII character; 'x' means a junk character:<pre>
A6 xA5 xA29 xA2xA2xA2xA4 xxA6 xA7 xA6
</pre> The result is an array holding each interesting (non-'x') field. The Perl code snippet:<pre> # Flat file format
my $fmt = "A6x A5x A29x A2xA2xA2xA4xx A6x A7x A6";
my @fields = qw{id_USAF id_WBAN name region country state callsign lat lng elev};
# Pull in each line
for my $line (<WSTNSFLAT>) {
next if length($line) < 79; chomp $line;
# Unpack flat record
my @flat = unpack($fmt, $line);
# Process raw record
...
}
</pre> </p><p>I also grabbed the station files for Daily weather reports, since that data goes back much farther (generally, we have since ~1945 for Hourly and since ~1900 for Daily).</p><p>
Then I score each station by (Proximity and Amount-of-Date), and select the five best stations for each stadium.</p><p>
Now, I could of course use Perl to generate the POST request using the HTTP modules, but it was simpler to mindlessly just control click on a dozen links at a time and then answer each form.
and spit out an HTML file with a big matrix of URLs for each station, for a subset of years. P (You can see the linkdump file here: http://vizsage.com/apps/baseball/results/weather/ParkWeatherGetterDirectory.html)</p><p>
I also use perl to clean up the XML generated by the MySQL Query Browser -- which returns a flat XML file with all fields as content, not attributes. I just suck the file in with XML::Simple, walk down the resultant hash to create a saner (and semantic) data structure, then spit back out as XML.</p><p>
The python parts are not terribly interesting. I pull in the flat file, clean up a few data fields and convert in-band NULLs into actual NULLs (they use 99999 to represent a null value in a 5-digit field, for instance) then export the data as a CSV file (for a MySQL LOAD DATA INFILE query). I chose python for this part because I find its object model cleaner -- it's easier to toss structured records around -- and the CSV module is a tad nicer. </p><p>
The idea I find most interesting is that we're starting to get enough rich data on the web to make these cross-domain data mashups easy and fun -- I did all this in less than a week. With the effortless XML handling and text processing of modern scripting languages (and relieved from any efficiency concerns) it's easy to see forward to a future where we'll have all these datasets sitting at our fingertips. This data set lets you examine ideas such as "How does the break distance of curveballs change with atmospheric temperature and pressure for a full baseball season?" "Effectiveness of pitchers against gametime temperature, stratified by age of pitcher or inning?" "Batting average on fly balls vs. ground balls against % of total cloud cover?". It's easy to come up with a variety of other "This Rich Dataset vs. That Rich Dataset" opportunities. Stock price and Earnings of Harley-Davidson vs. average household income, unemployment and percent of the population that has reached retirement age? Year-by-year movie attendance at comedies compared to dramas, Attendance at Baseball Games, and Sales of Fast Food vs. Consumer Satisfaction Index, national Suicide Rate, and Persons treated for mental health/substance abuse? Presidential approval rating vs. gasoline prices and Consumer Price Index? Amazon.com sales rank, # mentions on Technorati blogs and # of mentions in mainstream media vs. time?</p><p>
The hard part is actually the stupidest part: to unlock the data from behind bureaucratic barriers (the first script I described), then to convert into a universal semantically rich data format (the second set of scripts I described). Once one person has unlocked this data, however, it's there for the whole world to enjoy, and tools will evolve to capitalize on this bounty of rich, semantically tagged and freely available information.</p><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-42090403392480913422008-01-03T22:36:00.000-06:002008-01-06T21:28:01.232-06:00Time Machine is neat-o, but I want a Time and Space Machine for my files<p>I've long wished for a versioned home directory, but the svn/ish seem too heavyweight, and it's nice to have a live copy and not an opaque DB-ball. The right answer the stunningly elegant <a href="http://arstechnica.com/reviews/os/mac-os-x-10-5.ars/14">Time Machine</a>. The idea <a href="http://www.mikerubel.org/computers/rsync_snapshots/">isn't</a> <a href="http://rsnapshot.org/">actually</a> <a href="http://users.softlab.ece.ntua.gr/~ttsiod/backup.html">new</a>, and it can be approximated <a href="http://code.google.com/p/flyback/">across platforms</a> and <a href="http://www.macosxhints.com/article.php?story=20071220105635147">remotely</a> using standard Unix tools. Now you just need a landing spot.</p><p>
Some remote backup solutions have come to the fore lately. $Zero/year gets you 2GB remote backup from <a href="https://mozy.com/?code=T7LAHN">Mozy</a>. $60/year buys unlimited backup space at <a href="https://mozy.com/?code=T7LAHN">Mozy</a>, where 'unlimited' means the ~40 GB/month you'll see by leaving your pipe fully saturated 24/7. A price between free and $100/year gets you the more intriguing <a href="http://www.crashplan.com/">CrashPlan</a>. These are slick and easy. If you don't know what ssh is, you want one of these. If you do have ssh and you don't set your mom's computer up with the free Mozy account then you're a bad person. </p><p>
For the uber-uber-nerds, use what I'm using: an $85/year <a href="http://www.bluehost.com/track/mrflipco/text1/">bluehost account</a>. You may think of it as a 600GB ssh'able rsync'able remote backup host that, by the way, can also act as a webserver. You can install svn (as I have) to do versioning over svn+ssh. Two years of bluehost costs the same as a 500GB hard drive+cheap enclosure, and they regularly increase your diskspace allowance at the same monthly price. <small>[Note: the preceding bluehost link will give me a kickback if you sign up through it. Hit bluehost.com directly if that rubs you the wrong way.]</small></p><p>
As described below, all my various project shards get sync'ed back to my desktop PC. The desktop then pushes my files out to a bluehost account via rsync, versioned with an <a href="http://blog.interlinked.org/tutorials/rsync_time_machine.html">rsync-as-poor-man's-time machine</a> script. This gives me a live, versioned backup, accessible to me from anywhere by ssh, on Bluehosts' offsite, secure, RAID-UPS-and-diesel generator protected colocation, and at the end of a fat pipe. After about a week or so for the initial ~50GB backup to roll in, daily incrementals will take an hour or two each day (bandwidth choked to 25 kBps). When I leave town for a week next month I'll start pushing my music collection into its own unversioned directory.</p><p>
This is only for the stuff I've created or can't replace: not for system software and not for music/movies/media (apart from my iTunes.xml, inbox, and various bookmarks/pref's/stickiesDB/MySQLCachedQueries folders). Unlike most people, I don't worry too much about backing up my system software. Maybe I'm damaged from my Windows upbringing, but just reinstall from scratch if your OS gets hosed (I keep install disks and images around). The nascent defect may be present in the restore; the accumulated OS cruft certainly is. You're already kinda screwed; better to take a certain two days and finish with a clean system than to fight a flaky restore and then spend those two days again. Yes, you are allowed to come point and laugh if this happens to me.</p><p>
Beyond the backup, I have various levels of defense-in-depth for my data -- data that is created, changes daily and is essential for my economic well-being has four or more levels of redundancy. Data that is intransigently huge but can be sourced elsewhere has no redundancy. (There's no reason backing up my processed wikipedia dump, for instance: only the scripts that process it.)
Right now my fileverse is spread across <ul><li>Desktop computer ~ 1TB</li><li>Flivo (my homebrew Tivo computer ~0.5TB</li><li>School account ~3GB</li><li>sourceforge account</li><li>Four webservers holding sites I operate or caretake ~ 1-5GB each</li><li>GMail account ~4GB</li><li>Flickr account ~6GB</li><li>iPhone, Google Calendar, Yahoo Address book, Plaxo</li><li>blogger/twitter/facebook/etc</li></ul></p>
<p>This is organized as<ul><li>All the stuff I'm currently messing with is in a 'now' folder. This is what sync's to my school account, and every time I go on a trip I burn a DVD of the now folder to toss in my backpack. (You never know when you'll want a file, and it enforces an occasional hardcopy backup).</li><li>Within the now folder, the stuff I develop for work is versioned with svn. I house a private repository on my bluehost account and connect over svn+ssh. I'm reasonably good about checking in every few hours or when I shift conceptual gears.</li><li>Most of the 'now' folder I keep in sloppy sync with the school account. ('sloppy' because I sync when I think of it and not through a cron script as I should).</li><li>Each year I move everything that isn't under current work out of the now folder and into an 'archive/YEAR' folder; there it sits and changes almost never.</li><li>GMail holds all my mail (sync'ed with IMAP)</li><li>Flickr holds an incomplete and poorly correlated segment of my photos (they own my metadata, and yes that bugs me).</li><li>iPhone sync handles the address book; gcal is still quite difficult.</li><li>The rest of the little metacontent is trapped, meh.</li><li>Each webserver's content is replicated to the desktop. For small changes I'll sometimes diddle the file on the live server and then sync back later (tsk tsk); for heavier work I proceed locally and then deploy.</li></ul></p><p>
The usage breaksdown like this:<pre>
== Work space -- changes ~ daily ==
2 GB vizsage project software Desktop, svn, school, vizsage, bkup
9 GB infochimp site & working data Desktop, svn, school, infochimp, bkup
== Work resources -- changes ~ weekly ==
100 GB 'huge' datasets Desktop, infochimp
60 GB live local DB of datasets Desktop
30 GB infochimp website DB infochimp, bkup
== Slowly changing -- changes ~ semiyearly ==
3 GB Other projects, docs, stuff Desktop, bkup
4 GB Archive - doesn't change Desktop, bkup
(~ 300 MB / year for 11 years)
3 GB Library (prefs,caches,etc) Desktop
== Metacontent -- changes daily-weekly ==
12 GB Photos Desktop, Flickr
3 GB Mail Desktop, GMail, bkup
~ MB iPhone/Addr Book/Calendar Desktop, iPhone, GCal, Yahoo AddrBk
~ MB this blogger blog vizsage, Blogger
== Websites ==
1 GB website1 Desktop, website1, bkup
2 GB website2 Desktop, website2, bkup
6 GB website3 Desktop, website3, bkup
== Media ==
many GB music Desktop, some on iPhone, some on DVD
many GB recorded tv shows, movies, etc Desktop, data DVD
== System Software ==
some GB OS & installed programs on each machine
</pre></p>
<p>I'd like to live in a world where I wouldn't have to worry about how these are partitioned across machines. Changes made to 'website1', say, or to a project in 'now', would lazily propagate to each interested shard as well as to the remote time-machineish versioned backup. At any time I could force an immediate sync, whether to deploy a change, to repair a mistake, or to satiate an OCD twinge, if I don't want to wait for automatic syncronization.</p><p>
I'm actually pretty close to having this out of a McGuyvered patchwork of rsync, svn, time machine, IMAP/Aperture+Flickr and distributed file systems, all enforced by cron. I'm planning to soon waste a weekend buttoning up my sync scripts, getting everything to run daily and being superattentive in case I screw it up. </p><p>
But it sure would be nifty to augment (a cross-platform) Time Machine into a Time and Space Machine. I'd see an overview of my distributed fileverse (versioned in time, distributed in shards according to how I use it), and I could delegate various live realizations, svn/diff-versioned backups or hard-link-versioned backups to each local or remote instance. No single machine would necessarily hold the entire fileverse: note that a few things up there don't propagate back to my main desktop. And hopefully the whole thing would have polished Apple Fit And Finish instead of Mad Max Homebrew Itworksithinkihope.</p><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-49367907710490232632007-12-13T09:51:00.001-06:002008-01-06T22:01:17.831-06:00Old-School Shop Guide<div style="float: right; margin-left: 10px; margin-bottom: 10px;"> <a href="http://www.flickr.com/photos/mrflip/2108973686/" title="photo sharing"><img src="http://farm3.static.flickr.com/2078/2108973686_88944a07fd_m.jpg" alt="" style="border: solid 2px #000000;" /></a> <br /> <span style="font-size: 0.9em; margin-top: 0px;"> <a href="http://www.flickr.com/photos/mrflip/2108973686/">ShopGuideFront300dpi.png</a></span></div>I rediscovered this super-compact reference-and-tool-and-measuring device while looking for a tool. It is jam-packed with handy information for anyone doing things mechanical or woodworking.
I got this from a family friend of a family friend -- I bought their lathe after her husband, an avid (and skilled) woodworker, had passed away. She wanted his tools to go to someone who would love them and use them, which was me, which I do. The lathe is good, but I've discovered after the fact that the throwins were the best part. The chisels are *top notch*, but still pale in comparison to getting "his old woodworking magazines." This turned out to be almost every issue of Fine Woodworking magazine, beginning in its first year of publication; somewhere in the stack was this nifty Shop Guide. <br clear="all" />
<div style="float: right; margin-left: 10px; margin-bottom: 10px;"> <a href="http://www.flickr.com/photos/mrflip/2108971426/" title="photo sharing"><img src="http://farm3.static.flickr.com/2033/2108971426_2f506c2d01_m.jpg" alt="" style="border: solid 2px #000000;" /></a> <br /> <span style="font-size: 0.9em; margin-top: 0px;"> <a href="http://www.flickr.com/photos/mrflip/2108971426/">ShopGuideBack300dpi.png</a></span></div>I think this thing is so neat -- so much information in such a small space. My <a href="http://vizsage.com/other/flipopedia/MechanicalInfo.pdf">own mechanical data reference table</a> (<a href="http://vizsage.com/other/flipopedia">more here</a>) has more numbers but less intrinsic functionality ... What's really neat about this shop guide is how they used the shape of the guide itself as a tool.
Print this onto heavy cardstock and punch brother punch with care... enjoy!<br clear="all" /><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-86736470174533353252007-12-13T00:33:00.000-06:002008-01-07T15:29:27.974-06:00Leveraging the Bittorrent Underground for semantic data and media<p>I just ran across a pretty interesting site called <a href="http://www.coverbrowser.com">coverbrowser.com</a>, which uses a variety of image APIs to pull in comic book, game, book, music, movie and other cover art. (Read the <a href="http://blogoscoped.com/archive/2006-10-09-n22.html">technical details here</a>).</p><p>
It reminded me of an idea I had while back but which I will never get around to implementing --- maybe you will, or for all I know someone's already been doing for years. (Sidenote: I've had some people express interest in this, and have worked out some parts of it, but just don't have the time to complete it right now. If you'd like to help develop it get in touch).</p><p>
Many of the movie and music torrents on the, ahem, "Unauthorized Evaluation Copy" bittorrent sites contain hi-res scans of their cover art, and all of the major bittorrent sites maintain topic-specific RSS feeds. </p><p>
As long as the torrent indexes the files individually (as not as an opaque .zip or .rar) -- and most do index individually -- you can target specific files within the torrent. I don't know whether you could chop all the large-file-size copyright-problematic files that you don't want out of the torrent, or whether you'd have to hack Azureus or other bittorrent client (instructing it to get only *.{png,gif,jpg,jpeg,bmp,tiff,tff} or what have you). Either way, you would then only be pushing out the bandwidth required to grab the photos and not the accompanying multi-megabyte file, and you would only be getting the information to which you assumedly have fair use rights for. </p><p>
So you'd set up a daemon process that would </p><ul>
<li>watch the Movies and the Music RSS feeds off whichever or all of the sites,
</li><li>identify albums whose cover art you lack,
</li><li>pull in the bittorrent,
</li><li>but download only the cover art
</li><li>and perhaps also process any of the accompanying semantic data</li>
</ul>You might have to get yourself a <a href="http://en.wikipedia.org/wiki/Seedbox">seedbox</a> to make this work, but they're not unaffordable. </p><p>
I think this would lead to a large stream of incoming cover art for music and other media files, complete with a reasonable amount of semantic information.</p><p>
There's probably a lot of other crowdsourced semantic data flowing through the underground, if someone actually created such a torrenting robot. (And yes, I feel yucky using "crowdsourced" and "semantic data" in the same sentence).</p><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-54928721934690941862007-12-05T16:17:00.000-06:002008-01-07T04:16:38.070-06:00Moving from Perl to Python with XML and TemplatingMr. XKCD is <a href="http://xkcd.com/353/">correct in this</a>. (My friend <a href="http://larssono.com/">Dr. Larsson</a> has been saying this all along).
As I'm moving from data munging to data working-with, I've been moving from perl to python.
Recommended:<ul>
<li><a href="http://codespeak.net/lxml">lxml</a> is a beautiful interface for dealing with XML in Python. You get XPath and validation and namespaces and all that hooha but you don't have to think hard and you don't have to write SAX stream parsers or walk a DOM path. You just say crap like<pre>
from lxml import etree
from urllib2 import urlopen
# Load file
uri = "http://vizsage.com/apps/baseball/results/parkinfo/parkinfo-all.xml"
parks = etree.ElementTree(file=urlopen(uri))
# for each park (<park> tag anywhere in document)
for (idx, park) in enumerate(parks.xpath('//park')):
# dump its id, time of service and name (@attr is XPath for 'corresponding attribute')
print ' -- '.join(
[ s+': '+','.join(park.xpath('@'+s))
for s in ('parkID', 'beg', 'end', 'games', 'name',)
])
</pre>and you get this in return<pre>
parkID: MIL01 -- beg: 1878-05-14 -- end: 1878-09-14 -- games: 25 -- name: Milwaukee Base-Ball Grounds
parkID: MIL02 -- beg: 1884-09-27 -- end: 1885-09-25 -- games: 14 -- name: Wright Street Grounds
parkID: MIL03 -- beg: 1891-09-10 -- end: 1891-10-04 -- games: 20 -- name: Borchert Field
parkID: MIL04 -- beg: 1901-05-03 -- end: 1901-09-12 -- games: 70 -- name: Lloyd Street Grounds
parkID: MIL05 -- beg: 1953-04-14 -- end: 2000-09-28 -- games: 3484 -- name: County Stadium
parkID: MIL06 -- beg: 2001-04-06 -- end: NULL -- games: 486 -- name: Miller Park</pre></li>
<li><a href="http://codespeak.net/lxml/objectify.html">lxml.objectify</a> is the replacement for <a href="http://search.cpan.org/perldoc/XML::Simple">perl</a>'s <a href="http://www.mclean.net.nz/cpan/">XML::Simple</a> we've all been looking for. You just say gimme and it pulls in an XML file as the corresponding do-what-I-mean data structure (identical elements become arrays, tree leaves become atoms, tree structures become maps).</li>
<li><a href="http://www.kid-templating.org/">Kid Templating</a> is a great solution for XML transmogrifying, and I think I like it much better than XSLT. It looks perfect for your "Anything => XML" purposes, which is the hard part. I suppose XSLT can do the "XML => anything" tasks but those always look like stunts; the whole point of XML is that "Turn XML into whatever" tasks are easy, especially given a simple API like lxml or lxml.objectify.</li>
</ul><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-57144270847292045112007-10-26T18:03:00.001-05:002008-01-07T04:15:54.760-06:00Hourly Weather data for each Retrosheet gameI noticed some suspect entries for game conditions in the eventfiles
and realized I could not only fix it but add a pretty useful dimension
to the retrosheet collection. The National Climate Data Center makes available "Global Hourly Surface Data" -- several dozen physical and
observational characterizations of the current weather, taken hourly.
This data goes back to the forties and sometimes to the start of the
century.<p>
Please enjoy this preliminary dataset giving the hourly weather data
for each game in Fenway since 1957:
<a href="http://vizsage.com/apps/baseball/results/weather/">http://vizsage.com/apps/baseball/results/weather/</a><p>
(open the WeatherData-BOS07.* file of your choice)
I don't have all the data in hand yet, but I thought I'd get your
thoughts and see if anyone would like to help with some of the drudge
work.<p>
I'm excited about doing some fun things with the data, like see
knuckleball effectiveness vs. humidity or elderly pitchers vs.
temperature. Combined with the MLB gameday pitch trajectory info you could do physics "experiments": show the break distance of all
curveballs vs. atmospheric pressure.<p>
Email me back if you're interested or with comments.<p>
<Pre>
-----------------------
DATA FIELDS AVAILABLE
-----------------------
The fields I've spit out are
-- game_ID, gamedate, gamenum_in_day, start_time, daygame_flag from
the cwgame output.
- temp deg C
The temperature of the air in degrees Celsius.
- press_atmos HPa
The atmospheric pressure at the observation point.
- press_sealvl HPa
The air pressure relative to Mean Sea Level (MSL).
- press_altim HPa
The pressure value to which an aircraft altimeter is set so that it
will indicate the altitude relative to mean sea level of an aircraft
on the ground at the location for which the value was determined.
- press_chg_3hr_del HPa
The absolute value of the quantity of change in atmospheric pressure
measured at the beginning and end of a three hour period.
- press_chg_3hr_obs --
The code that denotes the characteristics of an
ATMOSPHERIC-PRESSURE-CHANGE that occurs over a period of
three hours.
- wind_dir deg
The angle, measured in a clockwise direction, between true north and
the direction from which the wind is blowing.
- wind_obs --
The code that denotes the character of the WIND-OBSERVATION.
- wind_speed m/s
The rate of horizontal travel of air past a fixed point.
- wind_gust_speed m/s
The rate of speed of a wind gust.
- cloud_cover_low (frac)
The code that represents the fraction of the celestial dome covered
by all low clouds present. If no low clouds are present; the code
denotes the fraction covered by all middle level clouds present.
- vis_dist m
The horizontal distance at which an object can be seen and identified.
- sunshine_time min
The quantity of time sunshine occurred over the reporting period.
- wea_pr_m_obs_1 --
The code that denotes a specific type of weather observed manually.
- wea_pr_m_obs_2 --
The code that denotes a specific type of weather observed manually.
- wea_pr_m_obs_3 --
The code that denotes a specific type of weather observed manually.
- groundcond --
The code that denotes a type of Ground condition
- precip_hist_contin bool
The code that denotes whether precipitation is continuous (true) or
intermittent (false).
- precip_lq1_depth mm
The depth of LIQUID-PRECIPITATION that is measured at the time of an
observation. Unit:Millimeters
- precip_lq1_period hours
The quantity of time over which the LIQUID-PRECIPITATION was measured.
</pre>
----------
WHAT I DID
----------<p>
I used Brian Foy's Google Earth index of Major League Stadiums:<br/>
<a href="http://www252.pair.com/comdog/google_earth/major_league_baseball_stadiums.kml">http://www252.pair.com/comdog/google_earth/major_league_baseball_stadiums.kml</a>
and the NCDC ISH-HISTORY file (gives locations for each weather station)
<a href="ftp://ftp.ncdc.noaa.gov/pub/data/inventories/">ftp://ftp.ncdc.noaa.gov/pub/data/inventories/</a><br/>
to find the closest station with continuous data. (Turns out I could
have saved a ton of trouble by just using the nearest airport -- in
almost every case it was the best match.)<p>
Then I pulled down data sets from
<a href="http://cdo.ncdc.noaa.gov/pls/plclimprod/poemain.accessrouter?datasetabbv=DS3505">http://cdo.ncdc.noaa.gov/pls/plclimprod/poemain.accessrouter?datasetabbv=DS3505</a>
(If you're interested in replicating any of this I have a script that
sends a GET url to help automate the weather data collection.) The
last step is to match games with stadiums with locations, and dates
and times with hourly observations.<p>
I could be clever and subtle and use the start time and game duration
to grab only the hours of gameplay, but instead I just pull in the
records from 10:00am to 11:59pm for day games, and 5:00pm to 11:59pm
for night games. I suppose I'll fix it to see if a game overhangs
midnight and get the post-12am data for those only.<p>
-----------------------
WHAT YOU CAN DO TO HELP
-----------------------<p>
Geolocation for the rest of the stadiums<p>
Inspect the data for consistency and correctness<p>
If you have access to a computer at a .edu or .k12.us, or fancy GIS
data, help me grab the rest of the weather files.<p>
Email me if you'd like to help.<p><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-18430189271719370512007-10-26T18:01:00.000-05:002008-01-07T04:14:43.336-06:00Retrosheet Eventfile Inconsistencies II<p>I've found a few more inconsistencies and minor inaccuracies in the
retrosheet event files and game logs.</p><p>
I made a diff
(applied using the 'patch' tool) to mechanically recreate these
corrections:
<a href="http://vizsage.com/apps/baseball/results/rseventfiles_20070923_patch.diff">http://vizsage.com/apps/baseball/results/rseventfiles_20070923_patch.diff</a></p><p>
I pulled these out by whipping up a few simple scripts (one-liners,
mostly) that extracts all unique values for each event file field.
For example, the only values for the "info,pitches" field are 'count,
'none' and 'pitches' -- just as promised in the documentation. The
"info,temp" field, however, has not only normal temperatures ("78", or
"104", or "0" for [unknown]) but also spurious values of '670' and
'700' (wrong), '8/7' (ill-formed) and '' (differs with the format
documentation).</p><p>
I'll posting all the dubious entries (event files version 2007 Sep 23)
I find at
http://vizsage.com/blog/2007/10/retrosheet-eventfile-inconsistencies.html
as comments.</p><p>
==================== Incorrect Data ====================</p><p>
<pre>
In 1993MIL.EVA:
info,start,spieb001,"Bim| Spiers",1,9,4
should be
info,start,spieb001,"Bill Spiers",1,9,4
These temperatures need fixing:
1988MON.EVN,info,temp,670
1988MON.EVN,info,temp,700
1964NYA.EVA,info,temp,8/7
I looked at a few suspiciously short games (< 60 minutes):
This should be 1:58, according to the NYT box score:
http://select.nytimes.com/gst/abstract.html?res=FB0614F73D59107B93C4A8178FD85F4C\
8585F9
1958BOS.EVA,info,timeofgame,58
These two are correct:
1971BAL.EVA,info,timeofgame,48 BAL197107300 -- Game called due to rain
1976BOS.EVA,info,timeofgame,57 BOS197609100 -- Game called due to rain
Another thing to look at would be suspicious game length/number of
outs ratio, but I haven't done this yet.
I also checked a few games with attendance below 1000, but these seem
to be very cold or rescheduled days. I'll taka a peak sometime soon at
"game attendance less than two and a half standard deviations from
that year's average attendance" to see what sticks out. (I also
peeked at 2.5+ above -- those look like bandwagon game)
</pre>
<p>
==================== Badly Formatted ====================</p>
<pre>
These are probably correct but just ill-formatted:
1959CHN.EVN,info,timeofgame,0158
2001PIT.EVN,info,attendance, 34915
1962BOS.EVA,info,daynight,day,
1966ATL.EVN,info,howscored,"park"
1966HOU.EVN,info,howscored,"park"
1970CHA.EVA:data,er,roung101,4#
1958PIT.EVN:data,er,wills102,1y
In these files, the "howscored" field is spelled "howentered":
1990BOS.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990HOU.EVN,info,howentered,game
1990HOU.EVN,info,howentered,game
1990LAN.EVN,info,howentered,game
1990MON.EVN,info,howentered,game
1990MON.EVN,info,howentered,game
1990PIT.EVN,info,howentered,game
1990SFN.EVN,info,howentered,game
1990SFN.EVN,info,howentered,game
1990SLN.EVN,info,howentered,game
1990TEX.EVA,info,howentered,game
1990TEX.EVA,info,howentered,game
There are no "info,edittime" records -- is this purposeful?
</pre>
<p>==================== Inconsistent with Documentation ====================</p>
<pre>
In the 2003TBA.EVA file, the umpires are given by name and not by ID.
These are supposed to use 0 as the unknown value but in a few places
use a blank.
1990NYA.EVA,info,temp,
1978ATL.EVN,info,attendance,
1978NYA.EVA,info,attendance,
1979SDN.EVN,info,attendance,
2000PIT.EVN:info,windspeed,
There are some "info,ump[...],(None)" fields, and there are some
"info,ump[...]," fields. Does one indicate "unknown" and the other
indicate "none"? Or is this a formatting inconsistency?
These files have a bunch of "info,windspeed,unknown" fields (the dox
say "An unknown windspeed is indicated by -1."):
1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN
1970ATL.EVN 1970HOU.EVN
These files have an "info,temp,unknown" field (the dox say "An unknown
temp is indicated by 0."):
1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN 1970ATL.EVN
1970HOU.EVN 1990NYA.EVA
These lines have trailing spaces, which is harmless but still
shouldn't be there:
1958CHA.EVA:info,save,
1957BOS.EVA:com,"xwas a lot of action. Had this game been played
today, it no doubt"
1957BRO.EVN:com,"$In addition to 12,559 paid, 6000 knothole,"
1957CLE.EVA:com,"xCC4 changed E9/F.2-3;BX2(9)# to 9/F.2-3(E9)#"
1957MLN.EVN:com,"xCC4 per film, TSN 26 is DP"
1958CLE.EVA:com,"$ Strong wind to left; cool"
1958KC1.EVA:com,"xScoresheet scores DP as 142. I Checked with newspaper"
1958NYA.EVA:com,"$Total attendance: 13323"
1958SFN.EVN:com,"$paper box and Cin s/s has Cepeda and Sauer reversed"
1958SFN.EVN:com,"$paper box has stats that match SF s/s not Cin s/s"
Here are all the well-formed windspeed values:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24
25 26 27 28 29 30 31 32 33 35 36 37 38 40 59 60 66 67 68 69 74 78 87
What are the units on these? If this is in MPH, 39 is Gale force
("Difficult to walk against wind. Twigs and small branches blown off
trees."), 55 is Storm ("Trees uprooted, structural damage likely") and 64
is ("Trees uprooted, structural damage likely").
Here are games with windspeeds over 40:
id,CHA197408270|windspeed,67
id,MIN198008190|windspeed,87
id,TOR198208030|windspeed,68
id,CHN198307042|windspeed,74
id,TOR198307270|windspeed,87
id,LAN199006050|windspeed,78
id,DET199506160|windspeed,87
id,CLE199609141|windspeed,69
id,COL199606150|windspeed,59
id,DET199704300|windspeed,66
id,TEX200104220|windspeed,40
id,SLN200610010|windspeed,60
</pre>
<p>
The SLN200610010 event file gives a wind speed of 60mph (from
baseball-reference and ESPN),
but a) that's crazy and b) the weather report from that day doesn't
confirm it:</p><p>
http://www.wunderground.com/history/airport/KSTL/2006/10/1/DailyHistory.html?req\
_city=NA&req_state=NA&req_statename=NA
Which gives 83F, 9mph SSW wind, clear</p><p>
See also my next message, about getting weather data for each game.</p><p>
The BGAME.exe documentation says "WindSpeed: 0 Unknown, 1 Known, other
value is the wind speed" but I think it should be "WindSpeed: -1
Unknown other value is the wind speed in miles per hour".</p><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-37106015850605802202007-10-26T04:33:00.000-05:002007-10-26T04:55:51.615-05:00The Asdrubal Carrera Hall of Fame<p>Inspired by one of <a href="http://awfulannouncing.blogspot.com/">Tim McCarver</a>'s <a href="http://shutuptimmccarver.com/">flights of fancy</a> during the ALCS, I present <a href="http://vizsage.com/apps/baseball/results/UniqueFirstNames.xml">The Asdrubal Carrera Hall of Fame</a>, open to anyone in unique possession of a particular first name among Major League baseball players: <a href="http://vizsage.com/apps/baseball/results/UniqueFirstNames.xml">LIST</a>.</p>
<p>You may be familiar with Honus Wagner, Eppa Rixey, Boog Powell or Yogi Berra. But have you heard the storied diamond exploits of Firpo Mayberry, Zoilo Versalles, Pi Schwert or Bevo LeBourveau? OK, then how about Mysterious Walker, The Only Nolan, or Phenomenal Smith?</p>
<p>For some dinnertime fun over the holidays, discuss the relative merits of Urban Shocker, Twink Twining, Pussy Tebeau, Bris Lord, Boob Fowler, Crazy Schmit, Creepy Crespi, Cuddles Marshall, Vinegar Bend Mizell, and Buttercup Dickerson. (Unfortunately, 12 other players keep Rusty Kuntz off this list.)</p>
<p>Other stunningly yclept combatants include Ambiorix Burgos, Alamazoo Jennings, Welcome Gaston, Chicken Hawks, Sixto Lezcano, Wheezer Dell, Yam Yaryan, Yo-Yo Davalillo, Admiral Schlei, Boss Schmidt, Brick Smith, Brickyard Kennedy, Broadway Jones, Cannonball Titcomb, Baby Doll Jacobson, Sweetbreads Bailey, Zaza Harvey, Bubbles Hargrave, Pickles Dillhoefer, Double Joe Dwyer, Cowboy Jones, Coot Veal, Mul Holland, Live Oak Taylor, Skyrocket Smith, Kaiser Wilhelm, Kewpie Pennington, Possum Whitted, Snooks Dowd, and Mox McQuery.</p>
<p>See <a href="http://vizsage.com/apps/baseball/results/UniqueFirstNames.xml">the list</a> for links to each player's Baseball Reference page. Nerds may additionally view the generating mySQL query <a href="http://vizsage.com/apps/baseball/results/UniqueFirstNames.sql.txt">here</a>.</p><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-52983381029712720162007-10-17T17:01:00.000-05:002008-01-07T04:11:45.560-06:00Retrosheet Eventfile InconsistenciesHere are a few inconsistent records in the <a href="http://retrosheet.org">retrosheet.org</a> event files of 2007 Sep 23. I'm using chadwick and not the retrosheet DOS utils, but I think I've source all these to the original event files.
Weird Attendance in gamelog GL1941.TXT:<pre> WS1194107220 (WS1 vs DET) has '1500 e' as its attendance</pre>Weird Start Time in eventfiles:
Many daynight records lack an AM or PM. I assume the time mapping of times are as follows:<pre> daynight start_time 24hr Time
D or N 0 Unknown
D 1000..1259 1000h to 1259h
D 100..459 1300h to 1659h
N 500..1150 1700h to 1359h</pre> In that case, here are some weird start times reported by cwgame:<pre> - Negative start time:
2003 D 0 -195 SEA 2003 04 15 SEA200304150 info,starttime,-2:05PM info,daynight,day
- No daynight flag:
1998 D 0 506 LAN 1998 08 30 LAN199808300 info,starttime,5:06 -- no daynight --
- Plainly inconsistent daynight flag:
1985 D 1 605 CIN 1985 06 21 CIN198506211 info,starttime,6:05PM info,daynight,day
1960 N 0 135 BOS 1960 04 19 BOS196004190 info,starttime,1:35PM info,daynight,night
- Second half of a double header, listed as a day game despite 5pm or later start:
1966 D 2 507 BAL 1966 10 02 BAL196610022 info,starttime,5:07PM info,daynight,day
2001 D 2 500 PHI 2001 05 27 PHI200105272 info,starttime,5:00PM info,daynight,day
2001 D 2 519 PIT 2001 06 03 PIT200106032 info,starttime,5:19PM info,daynight,day
2001 D 2 625 MIN 2001 05 26 MIN200105262 info,starttime,6:25PM info,daynight,day
2001 D 2 719 CHA 2001 09 04 CHA200109042 info,starttime,7:19PM info,daynight,day
2001 D 2 738 CHN 2001 08 20 CHN200108202 info,starttime,7:38PM info,daynight,day
2001 D 2 752 PIT 2001 09 03 PIT200109032 info,starttime,7:52PM info,daynight,day
2001 D 2 753 SLN 2001 08 03 SLN200108032 info,starttime,7:53PM info,daynight,day
- Start times that appear to be after midnight (this could be correct):
1996 N 1 35 CIN 1996 06 25 CIN199606251 info,starttime,0:35 info,daynight,night
1998 N 0 105 LAN 1998 06 13 LAN199806130 info,starttime,1:05 info,daynight,night
1966 N 2 1207 BAL 1966 06 08 BAL196606082 info,starttime,12:07AM info,daynight,night
</pre>These eventfile games have more than one "info,daynight" record<pre> ATL197004150 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197004160 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197005260 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197006191 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197006192 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197006200 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197006210 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197007031 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197007032 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197007050 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197009220 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197009230 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197009240 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197009250 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197009260 info,starttime,0:00PM info,daynight,day info,daynight,night
ATL197009270 info,starttime,0:00PM info,daynight,day info,daynight,night
HOU197006220 info,starttime,0:00PM info,daynight,day info,daynight,night
HOU197008031 info,starttime,0:00PM info,daynight,day info,daynight,night
HOU197008032 info,starttime,0:00PM info,daynight,day info,daynight,night
HOU197008040 info,starttime,0:00PM info,daynight,day info,daynight,night
HOU197009010 info,starttime,0:00PM info,daynight,day info,daynight,night
HOU197009110 info,starttime,0:00PM info,daynight,day info,daynight,night
HOU197009130 info,starttime,0:00PM info,daynight,day info,daynight,night</pre>This eventfile game is missing an "info,daynight" record:<pre> LAN199808300 info,starttime,5:06</pre>File Structure in eventfile 2001HOU.EVN:<pre> 2001HOU.EVN lacks a trailing newline (unix commands hate this).</pre>Here are the unix commands I used to dump all that info. Sorry for the one-linerism.<pre># How many have a negative starttime?
grep 'info,starttime,-' *.EV*
# How many have missing or extra "info,daynight" fields?
# -- pull out the info, daynight and starttime records in order
# -- slurp the whole file as one giant string with internal linebreaks;
# -- split each stretch following an id,XXXX record into one line
# -- dump lines that have none or more than one daynight record
cat *.EV* | egrep '^(id,|info,daynight|info,starttime)' | \
perl -e '$_ = join(" ",<>); s/[\r\n]+/!!!/g; @games= (split /id,/, $_);
shift @games;
for $game (@games) {
$game =~ s/!!!/\t/g; print "$game\n" if (($game !~ m/daynight/) || ($game =~ m/daynight.*daynight/));
}'
# How many have a start_time and daynight_flag that disagree?
# -- use cwgame to pull off the gameID,start_time,daynight_flag records;
# put it into a temporary file
# -- Use a big stupid regex to find
# . start_time that is > 500 and marked day
# . start_time that is < 500 and marked night
# . start_time that is > 1200 and marked night
# . start_time that is < 100
# . start_time that is negative
( for ((year=1957;$year<=2006;year++)) ; do \
for teamfile in ${year}*.[Ee][Vv]* ; do \
cwgame -y $year -f '0-0,4-4,6-6' $teamfile 2>/dev/null ; \
done; \
done ) > /tmp/starttimeIDs.txt
cat /tmp/starttimeIDs.txt | \
perl -ne '(m/"(\w\w\w)(\d\d\d\d)(\d\d)(\d\d)(\d)",(12\d\d|[1234]\d\d|\d\d|[1-9]|-\d+),"(N)"/ ||
m/"(\w\w\w)(\d\d\d\d)(\d\d)(\d\d)(\d)",((?:5|6|7)\d\d|.*-.*|\d\d|[1-9]),"(D)"/) &&
printf "%s %s %5d %s %s %s %s\n", $7, $5, $6, $1, $2, $3, $4;' | sort
</pre><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-87962295946045583292007-09-07T16:02:00.000-05:002007-09-07T17:13:21.544-05:00Rules of thumb for Rack Leave in ScrabbleThis isn't exactly within the ambit of this blog but at least it's about data.
While I should have been doing work, I instead made an awesome
spreadsheet to find rules of thumb for what the best Scrabble rack
leaves are. (rules of thumb below, tables here: <a href="http://vizsage.com/other/scrabble/RackLeaveRules.html">http://vizsage.com/other/scrabble/RackLeaveRules.html</a>)
The computer program <a href="http://web.mit.edu/jasonkb/www/quackle/">Quackle</a> is one of the strongest scrabble players in the world. It uses the following 'Superleave' valuation: <ul> <li><a href="http://web.mit.edu/~jasonkb/Public/scrabble/superleaves">http://web.mit.edu/~jasonkb/Public/scrabble/superleaves</a><br/>(warning: huge-assed file)</li> <li><a href="http://vizsage.com/other/data/superleaves.xls">http://vizsage.com/other/data/superleaves.xls</a><br/>(The above, sorted by value, only up-to-4 leaves)</li></ul>
To find "synergies" and "anti-synergies" (dysphoria?), I calculated the marginal valuation for each combo. Basically, how much of the value for each two-leave is explained by the valuation of the component one-leaves, etc? For example, <br/> - From S (7.35) and M (0.08), the joint valuation of MS is 7.44, a marginal gain of 0.1: the joint valuation is almost entirely from S&M. (<-- will lead to interesting google hits). This combination has no synergy.<br/> - From Q (-9.0) and U (-5.1), the joint valuation of QU is 0.2, a marginal gain of 14.3. This is by far the largest synergy; next is ZO at +3.2.<p>
I also played with three-letter synergies -- 3-leave valuations marginally different from the most explanatory 2-leave.
General Lessons:<ul><li>Get a feel for the 1-leave list, and the learn these:<ul> <li> Synergy:
QU OZ JU CH GN WY IN DE JK ER EV
GIN JKY JKU ERS KWY HWY ?IN EST JOW ?AL ?EL ?IL IST</li><li>Anti-Synergy:
BP CG FP MV PV CW CQ QS SX LQ
BV SZ QR BC CZ VZ MQ RX GQ + most things with blank
BPV CGQ BCG LQR FPV LNQ SVZ CMQ CLQ BCV BNV KTV
LMQ GKT CFV GMQ FSV LNR DGT</li></ul></li><li> Worth keeping with a blank:
The letters in "Lei an orc DTM" + the following digrams
IN AL IL EL CI AN ER EN AC AR IT NO
QU ET DE CO AT OR LO GN OT AM DI CE
IM IR DO MO GI AB AG</li><li> double letters are bad (duh), except FF, which is good.</li></ul>See the spreadsheet at <a href="http://vizsage.com/other/data/superleaves.xls">http://vizsage.com/other/data/superleaves.xls</a>). Don't go betting the house on these results....</p>
Tables (including 1-tile-leave values) are available <a href="http://vizsage.com/other/scrabble/RackLeaveRules.html">here</a>.<div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-91781604235121006902007-09-01T02:49:00.000-05:002007-09-04T13:46:19.078-05:00as3mathlib (formerly WIS math libraries)<p>I've just imported the WIS mathematics library -- an excellent collection of mathematics routines -- onto <a href="http://code.google.com/p/as3mathlib/">Google Code</a>. (You'll find the <a href="http://members.shaw.ca/flashprogramming/wisASLibrary/wis/index.html">Actionscript 2 version of the library</a> at its original site)</p><p>This library carries a BSD-ish license and includes support for
</p><ul><li>Geometric Objects and Intersection calculations</li><li> Integral and Differential equation calculations</li><li>Bezier, Quadric, Polynomial, Complex, Vector and Matrix calculations</li><li> Symbolic expression parsing </li></ul><p>I'm converting the library to Actionscript 3 from Actionscript 2 as time and necessity allow. (That's converting as in getting it to work, and converting as in getting it to be object/pattern oriented). Right now it builds without errors and only a few warnings, but I haven't applied any of the unit tests or checked it for correctness or compatibility.</p><p>If you see the value of updating this well-thought out collection of functions, please get in touch and I will add you as a developer. The code is quite modular: it will be straigforward to take modest chunks and get them working independently. I wrote the original author and maintainer, who responded "By all means, continue in the evolution/integration of my library to support AS3" -- but please let me know of any other efforts to update this code, or if a similar or superior math library exists, so that I don't waste my time :).
</p><p>Email me [flip at the mrflip with the dot and the com] or comment on this post if you'd like to pitch in!</p>
<!-- ckey="2D834602" --><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-7745739652066182252007-08-27T10:06:00.000-05:002008-01-07T04:11:00.703-06:00Subway Geography and GeometryI've written an applet that lets you <a href="http://vizsage.com/apps/subzero/">reimagine the geography of greater Washington, DC</a> area with "distance" measured by subway-travel-time, measured by subway-travel-cost, or as the standard clarified subway wall map would deform it.
<img src="http://vizsage.com/apps/subzero/assets/screenshots/time-kingst-700.jpg" />
This was in large part inspired by Oskar Karlin's beautifully rendered <a href="http://www.oskarlin.com/2005/11/29/time-travel">Isochronic Elephant-Castle map of the London Underground</a> and the <a href="http://www.tom-carden.co.uk/p5/tube_map_travel_times/applet/">interactive tube mapplet</a> from Tom Carden.
<a href="http://www.fakeisthenewreal.org/subway/" rel="nofollow">Subway Maps of the world all on the same scale</a> is pretty interesting, as is this directory of <a href="http://ni.chol.as/media/sillytube.html" rel="nofollow">remixed London Underground maps</a>. There's a few interesting images on wiki commons, like <a href="http://commons.wikimedia.org/wiki/Image:NYC_subway_simplified_map.png">this geographical map</a> within this <a href="http://commons.wikimedia.org/wiki/New_York_City_Subway">gallery</a>.
Also, you can download the image files (very large, register with each other) from Wikipedia:
<ul> <li><a href="http://commons.wikimedia.org/wiki/Image:DC_Area_Road_Map_With_FontSubset.svg">Greater Washington, DC Area:
Road Map</a></li> <li><a href="http://commons.wikimedia.org/wiki/Image:WashingtonDCTopoMap.jpg">Greater Washington, DC Area: Topological
Map</a></li> <li><a href="http://commons.wikimedia.org/wiki/Image:WashingtonDCAerialPhoto_2590x2000.jpg">Greater Washington, DC
Area: Aerial Photo</a></li> </ul><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-9951169068460193402007-08-24T15:08:00.000-05:002007-08-25T03:57:45.290-05:00Patches to the AS3 Cookbook Code<p>
The <a href="http://www.oreilly.com/catalog/actscpt3ckbk/">Actionscript 3 Cookbook</a> is a very helpful reference, and the example code that came with it has many good examples. Unfortunately there's a modicum of bitrot in the code: compiler warnings and errors when compiling under strict mode.
</p><p>Here is a <a href="http://www.blogger.com/files/AS3CB-patch-2007-08-24.diff">patch</a> against the version of the code I downloaded on 2007-08-24:</p><blockquote><a href="http://vizsage.com/blog/files/AS3CB-patch-2007-08-24.diff">files/AS3CB-patch-2007-08-24.diff</a></blockquote><p></p>Apply it <a href="http://vizsage.com/blog/2007/08/how-to-make-patch-using-diff.html">thusly</a>.
<p></p><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-47735205129623517712007-08-24T10:44:00.000-05:002007-09-01T02:31:42.001-05:00How to make a patch using diffI always forget the command to use, and google is strangely devoid of helpful/correct advice. Therefore I'm posting this for my own and future generations' reference.
<h3>How to Generate a Patch from Standalone Code</h3>For non-(svn|cvs),
<ul><li>You should have a directory holding the original source (i.e. "dir") and a directory holding the modified (i.e. "dir-orig").</li><li>Obviously, don't modify dir-orig (that is, it should match the author's). If you don't trust yourself, do a <span style="font-family:monospace;">chmod -R a-w dir-orig</span> to recursively mark the directory read-only.<code></code></li><li>Generate the patch by going to the parent directory (holding <code>dir-orig </code>and <code> dir</code>) and running the command<blockquote><code>diff -Nuwr dir-orig dir > /tmp/my-happy-patch.diff</code></blockquote><ul><li>dir and dir-orig should be paths to the dirs in question, obviously.</li><li><code>-N</code> creates newly added files (treats absent files as empty files)</li><li><code>-u</code> creates a "unified" diff -- it's hunam readable and works well with patch</li><li><code>-w</code> ignores whitespace, which is polite if your (clearly superior) formatting policy differs from the original.</li><li><code>-r</code> recursively descends the source tree.</li></ul></li><li>Other helpful options:</li><ul><li><code>-p</code> (applied to C or C++ code) shows the function the new code appears in. Only use this with C or C++ code (i.e. YMMV)
</li><li><code>-u6</code> (or any number following the -u) gives that many lines of context (the default is 3, which should be fine for code that isn't changing like the star of a Tootsie stage performance)</li><li><code>-x ".??*"</code> ignores .DS_Store, Eclipse and other hidden-file turds, if you have those. Emacsen should add <code>-x "*~"</code>.</li></ul><li>Sanity-check the change
<code>less /tmp/my-happy-patch.diff</code></li></ul><h3>How to Generate a Patch from Subversion or CVS</h3>To generate a patch from cvs or svn, (advice horked from <a href="http://www.xsmiles.org/participate.html">X-Smiles.org</a>):
<ul><li>Make sure you are synchronized with the latest sources:
<code>$ cvs update src</code>
(or wherever your changes are; use a directory that spans all the changed modules or the trunk directory.)
</li><li>Sanity-check the change:
<code>$ cvs diff src</code>
</li><li>Generate the patch (replace -Nuwr with whatever you decided works for you from the options above).
<code>$ svn --diff-cmd=diff -x-Nuw src > /tmp/my-happy-patch.diff</code></li><li>(Rather than include <code>-x ""</code> args, you should be adding turd files to your <code>~/.subversion/config</code>, for instance
<code>global-ignores = *.o *.lo *.la #*# .*.rej *.rej .*~ *~ .#* .DS_Store</code>
or however you pronounce that in a <code>~/.cvsrc</code> or <code>.cvsignore</code>.)</li><li>Sanity-check the patch
<code>less /tmp/my-happy-patch.diff</code></li></ul>(By the way, if your <a href="http://www.google.com/search?hl=en&q=windows+sucks&btnG=Search">broken OS</a> lacks a command line, you might be able to <a href="http://cygwin.com/">add one</a>).
<h3>How to use a Patch</h3>To apply such a patch,
<ul><li>Download the patch and save it somewhere intelligent.</li><li>If you're <span style="font-style: italic;">not</span> using version control, <span style="font-weight: bold;">!!make a copy of the source tree!!</span>
</li><li>Change directories <span style="font-style: italic;">into</span> that new folder --the one holding the unmolested (or svn trunk) code.</li><li>Sanity check: run the command
<code>cat </code><code>/tmp/my-happy-patch.diff | </code><code>patch -p1 --dry-run
</code></li><li>If you see results like
<code>patching file foo/bar/my-happy-file.as</code>
...
you're good to go:
<code>cat </code><code>/tmp/my-happy-patch.diff | </code><code>patch -p1</code></li><li>Pitfalls:</li><ul><li>If you get a patch taken from <span style="font-style: italic;">within</span> the modified directory, change -p1 to -p0.</li><li>If you get a patch with the original and modified dirs reversed, add a --reverse flag.
</li></ul></ul><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-56183345013073360302007-08-20T12:09:00.000-05:002007-08-24T14:39:32.990-05:00Flex Demo: Matrix Math (and an error in the Actionscript docs)I'm working on something that uses (an algorithm similar to) texture mapping, for which I want to precalculate the .invert() of a whole bunch of .transform.matrix objects. I'll post something about that in the next coupla days.
Meanwhile, I found something perplexing in the Actionscript documentation but the possibility exists that I am just a dope so please point out an error in my reasoning.
As you may know, you can represent any arbitrary combination of 2-D scalings, skews, rotations and translations using standard matrix operations. The <a href="http://livedocs.adobe.com/flex/2/langref/flash/geom/Matrix.html">Actionscript docs for the Matrix class</a> mention this in passing, but has elements .b and .c switched: it should be
<blockquote><img style="width: 72px; height: 65px;" src="http://vizsage.com/demos/matrixmathdemo/theory/genericMatrix.png" alt="right: [ [a,c,tx] [b,d,ty] [0,0,1] ]" border="0" /> and not <img style="width: 72px; height: 65px;" src="http://vizsage.com/demos/matrixmathdemo/theory/genericMatrix-wrong.png" alt="wrong: [ [a,c,tx] [c,d,ty] [0,0,1] ]" border="0" />.</blockquote>
I whipped up a <a href="http://vizsage.com/demos/matrixmathdemo/MatrixMathDemo.html">MatrixMathDemo</a> in flex to demonstrate the issue:
<ul><li><a href="http://vizsage.com/demos/matrixmathdemo/MatrixMathDemo.html">Demo</a></li><li><a href="http://vizsage.com/demos/matrixmathdemo/srcview/index.html">Source</a></li><li><a href="http://vizsage.com/demos/matrixmathdemo/docs/index.html">Docs</a></li></ul>and a writeup on <a href="http://vizsage.com/demos/matrixmathdemo/theory/MatrixMath.pdf">Mathematical matrices and the actionscript Matrix transformations</a> [PDF]. The 2d, 3d, 4th tabs compare the Matrix methods concat(), invert() and (deltaP/p)ointTransform() respectively with an explicit calculation of the corresponding Matrix operation: they show that in fact the documentation has <span style="font-style: italic;">b</span> and<span style="font-style: italic;"> c</span> switched. (The code is <a href="http://vizsage.com/license/Visage-Deed-BY.html">free to reuse or modify</a> (but give credit) in case that's useful.)
Some references:
<ul><li>This <a href="http://www.senocular.com/flash/tutorials/transformmatrix/">flash-specific tutorial at senocular.com</a> is good, though the matrices are transposed from what is typically presented.
</li><li>The posts on the flashcoders list <a href="http://www.mail-archive.com/flashcoders@chattyfig.figleaf.com/msg11331.html">A little matrix.invert() mystery</a> and <a href="http://www.mail-archive.com/flashcoders@chattyfig.figleaf.com/msg11401.html">followup</a> are explained by <span style="font-style: italic;">b</span> and<span style="font-style: italic;"> c</span> being switched; they link to <a href="http://kiroukou.media-box.net/blog/mes-recherches-sur-flash/62-classe-matrix-de-flash8-eronnee.html">post on a french blog</a> that is helpful if you parlez.
</li><li>To brush up on matrix mathematics, please see the <a href="http://mccammon.ucsd.edu/%7Eadcock/matrixfaq.html#Q41">Matrix and Quaternion FAQ</a> or <a href="http://en.wikipedia.org/wiki/Matrix_%28mathematics%29">Wikpedia</a>. </li></ul><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-11726390745957252672007-07-25T04:51:00.000-05:002007-08-24T14:33:47.087-05:00Emacs modes for Flex<ul><li>Emacs modes for Flex:</li><ul><li>XML:
<a href="http://www.oreillynet.com/mac/blog/2003/10/a_new_xmlediting_mode_for_emac.html">nXML-mode for Emacs from James Clark</a>
<a href="http://www.ibm.com/developerworks/xml/library/x-emacs/">Using Emacs for XML documents</a>
</li><li>Actionscript:
<a href="http://blog.pettomato.com/?cat=7">actionscript-mode.el</a> for editing actionscript files in emacs.
</li><li>At least right now it seems you want this <a href="http://www.thaiopensource.com/download/nxml-mode-20041004.tar.gz">xml mode</a> and this <a href="http://blog.pettomato.com/content/actionscript-mode.el">actionscript-mode.el</a>.</li><li>Then, add
<blockquote><pre>(setq auto-mode-alist (append (list
'("\\.as\\'" . actionscript-mode)
'("\\.\\(xml\\|xsl\\|rng\\|xhtml\\|mxml\\)\\'" . nxml-mode)
;; add more modes here
) auto-mode-alist))
;;
;; ------------------ Magic for XML Mode ----------------
;;
(setq nxml-mode-hook
'(lambda ()
(setq tab-width 2
indent-tabs-mode nil)
(set-variable 'nxml-child-indent 2)
(set-variable 'nxml-attribute-indent 2)
))
</pre></blockquote></li><li>You can use <code> M-x customize-group RET nxml-highlighting-faces RET</code> to fix your colors the way you like 'em.
</li></ul><li>Setting up asdoc to work within Flex Builder:</li><ul><li><a href="http://www.peterelst.com/blog/2006/09/03/flex-builder-2-ant-support/">First, install ant support</a> (Ant is <a href="http://ant.apache.org/">an offshoot of apache</a> and is like Makefile only more betterer.)</li><li>Then set up a <a href="http://blog.bittube.com/2006/08/15/ant-buildxml-for-asdocs-generation/">build.xml</a> in your docs/ directory to build the documentation set.
</li><li>I had to modify mine a bit: I added
<code><property name="Templates.dir" location="${FlexSDK.dir}/asdoc/templates/"/>
<arg line='-templates-path ${Templates.dir}'/>
</code></li><li>I also linked the flex home to a no-funny-characters dir:
<pre> ln -s "/Applications/Applications/Adobe Flex Builder 2" /work/ProgramStores/Flex
cd /work/ProgramStores/Flex
ln -s "Flex SDK 2" FlexSDK</pre>Then I exported the location for the asdoc file:
<pre> export FLEX_HOME=/work/ProgramStores/Flex/FlexSDK</pre> or else I got the error message:
<pre> Exception in thread "main" java.lang.NoClassDefFoundError: Flex</pre>
</li></ul></ul><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>fliptag:blogger.com,1999:blog-4201613802871642889.post-60390943620991345862007-07-24T00:54:00.000-05:002007-08-25T03:53:30.248-05:00Adobe Flex and Custom Namespace / manifest.xmlCreate a file "manifest.xml" and add the following:
<blockquote><pre><?xml version="1.0"?>
<?componentpackage>
<?!--
URI http://vizsage.com/vzg
namespace gg
package com. vizsage
-->
<component id="widget1" class="com.vizsage.controls.widget1"/>
<component id="widget2" class="com. vizsage.controls.widget2"/>
<?/componentpackage>
</pre></blockquote>Notes:
<ul><li>The class part should give the full path (with / turned into .) to the corresponding .as files.
</li><li>You don't need one <component/> for each .as file, just one for each component.
</li><li>The comment part, like most comments (and many goggles), does nothing: all you need is a <component><component/> for each widget.
</component></li><li>You don't have to follow the tld.domain.clevername.widgetname format, but it's what all the cool kids are doing. Just make sure the dotted path matches your files' path.</li><li>The dotted path and the namespace URL don't have anything to do with each other.</li><li>In fact, the namespace URL is <span style="font-style: italic;">completely made up</span>: it doesn't have to exist; the compiler doesn't look for it; hell, adobe's URL doesn't even <a href="http://www.adobe.com/2006/mxml">exist</a>. It's just a tag for uniquely identifying a namespace. All that matters is that the namespace in your compiler flags and your mxml files match up.</li></ul><p>If you use Flex Builder, go into the library project's properties, into the "Library Compiler" field -- add the namespace and manifest.xml into the respective fields. If you use the standalone package, you'll have to add an option for the <span style="text-decoration: underline;">Component Compiler</span> <blockquote><pre>-include-namespaces="http://vizsage.com/vzg" -namespace "http://vizsage.com/vzg" manifest.xml</pre></blockquote></p><ul><li>You need to include the namespace <span style="font-style: italic;">and</span> define it</li><li>The <code>-namespace</code> flag takes <span style="font-style: italic;">two</span> arguments (a namespace and a manifest.xml)</li><li>The URI here has to match the ns:URL in your .mxml file.</li></ul><component>Now your .mxml files (which can be anywhere, and not in that project) start off like</component><blockquote><pre><?xml version="1.0" encoding="utf-8"?>
<mx:Application
xmlns:mx="http://www.adobe.com/2006/mxml"
xmlns:gg="http://vizsage.com/vzg"
layout="absolute" width="100%" height="100%"
viewSourceURL="srcview/index.html">
<?!-- ... your mxml file ... -->
</pre></blockquote><ul><li>Make damn sure the xmlns URI matches what you used before. I spent 30 minutes figuring out that <code>http://www.vizsage.com/vzg</code> and <code>http://vizsage.com/vzg</code> weren't the same thing.
</li><li>In Flex Builder 2, you need to get your project's properties, go into "Flex Build Path", then the "Library Path" pane, and "Add SWC" (the one you built with your custom components).
</li><li>For the command-line tools, add a flag <pre>-library-path+=/abs/olute/path/to/library.swc</pre>Make sure that's a += there.</li><li>Either way, applications (as opposed to libraries) don't need any compiler flags or manifest.xml nothing. The library uniquely identifies itself within a namespace, and provides files in the right .com.foo.bar hierarchy. When your .mxml file (asserts a namespace) and (includes the file) everything turns out right.
</li></ul>For more <a href="http://blog.flashgen.com/2007/07/04/manifests-namespaces-and-flex-builder-2/">about namespaces see here</a>, with one caveat: I think you're better off using the <code><a href="http://livedocs.adobe.com/flex/201/html/wwhelp/wwhimpl/common/html/wwhelp.htm?context=LiveDocs_Book_Parts&file=compilers_123_09.html">-load-config+=</a> </code>trick (to just tack on your changes) than hacking stuff into the main flex-config.xml file.<code>
<a href="http://livedocs.adobe.com/flex/201/html/wwhelp/wwhimpl/common/html/wwhelp.htm?context=LiveDocs_Book_Parts&file=compilers_123_09.html"></a></code><div class="blogger-post-footer"><script type="text/javascript"
src="http://vizsage.com/assets/adsense-728x90.js">
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>flip