Sunday, January 27, 2008

Parsing Names with Honorifics

In Railscast #16, Ryan Bates goes over Virtual Attributes in Rails, using the standard example of storing first and last names but getting/setting full names. He uses the following simple snippet:


def full_name=(name)
  split = name.split(' ', 2)
  self.first_name = split.first
  self.last_name = split.last
end

Which -- given that the focus was on virtual attributes -- is fine for explanation. However, that snippet will fail on names like "Franklin Delano Roosevelt" (last name of "Delano Roosevelt"). Here's a method which our 32d President will like better:


def clean(n, re = /\s+|[^[:alpha:]\-]/)
 return n.gsub(re, ' ').strip
end

# Returns [first_name, last_name] (or '' if there isn't any).
# Leading/trailing spaces ignored.
def first_last_from_name(n) 
    parts    = clean(n).split(' ')
    [parts.slice(0..-2).join(' '), parts.last]
end

names = [
    "Bill! Merkin,PhD.",
    "Jim               Thurston Howell III   ",
    "Charo", 
    "Heywood Jablowmie",
    "Sergei Rodriguez-Ivanoviv",
    "Polly Romanesq. ",
    "   ", 
    "",
    ]
p names.map { |n| first_last_from_name n }
# => [["Bill", "Merkin,PhD"], ["Jim Thurston Howell", "III"], ["", "Charo"], ["Heywood", "Jablowmie"], ["Sergei", "Rodriguez-Ivanoviv"], ["Polly", "Romanesq"], ["", nil], ["", nil]]

A regex is more extensible, and makes more sense for Perl refugees like me.


# Returns [first_name, last_name] (or nil if there isn't any).
# Leading/trailing spaces ignored.
def first_last_from_name_re(n)
    n = clean(n); 
    (n =~ / /) ? (n.scan(/(.*)\s+(\S+)$/).first) : [nil, n]     
end

p names.map { |n| first_last_from_name_re n }
# => [["Bill", "Merkin,PhD"], ["Jim Thurston Howell", "III"], [nil, "Charo"], ["Heywood", "Jablowmie"], ["Sergei", "Rodriguez-Ivanoviv"], ["Polly", "Romanesq"], [nil, ""], [nil, ""]]

However, as someone who can't check in at the automatic kiosks in airports because -- no joke -- the credit card thinks my last name is "IV", I like this version better.


# Returns [first_name, last_name, appendix] 
# (first name and appendix are nil if there isn't any).
# Leading/trailing spaces ignored.
# 
def first_last_appendix_from_name_re(n, appendix = nil)
    n = clean(n)
    appendix_re ||= %q((I|II|III|IV|(?:jr|sr|m\.?d|esq|Ph\.?D)\.?))
    if (n !~ / /) then
        [nil, n, nil]           # with no spaces return n as last name
    else
        n.scan(
          /\A(.*?)\s+           # everything up to the last name
           (\S+?)               # last name is last stretch of non-whitespace
           (?:                  # But! there may be an appendix.  Look for an optional group
             (?:,\s*|\s+)       #   that is set off by a comma or spaces
             #{appendix_re}     #   and that matches any of our standard honorifics.
             )?                 # but if not, don't worry about it.
           \Z/ix).first         # scan gives array of arrays; \A..\Z guarantees exactly one match
    end
end

p names.map { |n| first_last_appendix_from_name_re n }
# => [["Bill", "Merkin", "PhD"], ["Jim Thurston", "Howell", "III"], [nil, "Charo", nil], ["Heywood", "Jablowmie", nil], ["Sergei", "Rodriguez-Ivanoviv", nil], ["Polly", "Romanesq", nil], [nil, "", nil], [nil, "", nil]]

All three versions might make Japanese (and other "FamilyName GivenNames" cultures) sad.

Labels: , , , , , , , , , , , , ,

Tuesday, July 24, 2007

How to use exuberant CTAGS with ActionScript and Flex

How to fix CTAGS to work with ActionScript, from the vim-taglist project:
  • ActionScript Add the following lines to the $HOME/.ctags or $HOME/ctags.conf file:
    --langdef=actionscript
    --langmap=actionscript:.as
    --regex-actionscript=/^[ \t]*[(private| public|static) ( \t)]*function[ \t]+([A-Za-z0-9_]+)[ \t]*\(/\1/f, function, functions/
    --regex-actionscript=/^[ \t]*[(public) ( \t)]*function[ \t]+(set|get) [ \t]+([A-Za-z0-9_]+)[ \t]*\(/\1 \2/p,property, properties/
    --regex-actionscript=/^[ \t]*[(private| public|static) ( \t)]*var[ \t]+([A-Za-z0-9_]+)[ \t]*/\1/v,variable, variables/
    --regex-actionscript=/.*\.prototype \.([A-Za-z0-9 ]+)=([ \t]?)function( [ \t]?)*\(/\1/ f,function, functions/
    --regex-actionscript=/^[ \t]*class[ \t]+([A-Za-z0-9_]+)[ \t]*/\1/c,class, classes/
    
    Add the following lines to the ~/.vimrc or $HOME\_vimrc file:
    " actionscript language
    let tlist_actionscript_settings = 'actionscript;c:class;f:method;p:property;v:variable'
    
I actually just add these lines to my maintenance Makefile:
CTAGLANGS = --langdef=actionscript \
--langmap=actionscript:.as \
--regex-actionscript='/^[ \t]*[(private| public|static) ( \t)]*function[\t]+([A-Za-z0-9_]+)[ \t]*\(/\1/f, function, functions/' \
--regex-actionscript='/^[ \t]*[(public) ( \t)]*function[ \t]+(set|get) [ \t]+([A-Za-z0-9_]+)[ \t]*\(/\1 \2/p,property, properties/' \
--regex-actionscript='/^[ \t]*[(private| public|static) ( \t)]*var[  \t]+([A-Za-z0-9_]+)[\t]*/\1/v,variable, variables/' \
--regex-actionscript='/.*\.prototype \.([A-Za-z0-9 ]+)=([ \t]?)function( [  \t]?)*\(/\1/f,function, functions/' \
--regex-actionscript='/^[ \t]*class[ \t]+([A-Za-z0-9_]+)[ \t]*/\1/c,class, classes/'

.PHONY: ctags
ctags:
-rm -f TAGS
find . -name "*.as" -or -name "*.mxml" | ctags -eL - $(CTAGLANGS)
Take off the -e (ctags -L - $(CTAGLANGS)) if you're one of those vi users (I'm being polite because, after all, it's from them comes this tip)

Labels: , , , , , , , ,