Friday, March 30, 2012

 

Some fun with ruby 1.9(.3) and string encoding..

Okay, I should probably start by directing you here: http://blog.grayproductions.net/articles/understanding_m17n If you wanna really get dirty in character encoding in ruby, read up (by the way, i think m17n = multilingualization or somesuch).

Anyway, so I absorbed some percentage of that, but was a little surprised to see some of the default behavior of ruby 1.9, in particular what happens when you do some string concatenation/interpolation with mixed ASCII and UTF-8 encoded strings. Surprisingly, if you combine two such strings, it will sometimes result in an ASCII-encoded string, sometimes UTF-8-encoded string, depending on whether there are multibyte chars or not in the UTF-8 substring!:

> irb
1.9.3p125 :001 > foo = "foo"
=> "foo"
1.9.3p125 :002 > bar = "bar"
=> "bar"
1.9.3p125 :003 > baz = "báz"
=> "báz"
1.9.3p125 :004 > foo.encoding.name
=> "UTF-8"
1.9.3p125 :005 > bar.encoding.name
=> "UTF-8"
1.9.3p125 :006 > baz.encoding.name
=> "UTF-8"
1.9.3p125 :007 > foobar1 = "#{foo.force_encoding(Encoding::US_ASCII)}#{bar}#{bar}"
=> "foobarbar"
1.9.3p125 :008 > foobar2 = "#{foo.force_encoding(Encoding::US_ASCII)}#{bar}#{baz}"
=> "foobarbáz"
1.9.3p125 :009 > foobar1.encoding.name
=> "US-ASCII"
1.9.3p125 :010 > foobar2.encoding.name
=> "UTF-8"

As a result, at Goodreads, we had to do some monkey-patching as we were getting some US-ASCII strings back from some rails helper code (pluralize(), number_with_delimiter()) as well as some ruby built-in classes (to_s() from NilClass, Float, Fixnum, Array). There must be a better way, but we've now got this force-utf8 monkey patch file with stuff like this:

module ActionView
module Helpers
module NumberHelper
def number_with_delimiter_with_force_utf8(*args)
number_with_delimiter_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
end
alias_method_chain :number_with_delimiter, :force_utf8
end
end
end

# bunch of to_s that need fixing...maybe see if there's a [Class1, Class2].each way of
# doing this that's a little DRYer...
class Array
def join_with_force_utf8(*args)
join_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
end
alias_method_chain :join, :force_utf8
end
class Fixnum
def to_s_with_force_utf8(*args)
to_s_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
end
alias_method_chain :to_s, :force_utf8
end
class Float
def to_s_with_force_utf8(*args)
to_s_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
end
alias_method_chain :to_s, :force_utf8
end
class NilClass
def to_s_with_force_utf8(*args)
to_s_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
end
alias_method_chain :to_s, :force_utf8
end

Note: we also started marking a bunch of our code files with the magic comment (more here) because that seems to be the most effective way to force ruby to default new strings to UTF-8 encoding (there are a few other options, but this has been easiest and most effective). Being an emacser, I tend to propagate the # -*- coding: utf-8 -*- form...

This page is powered by Blogger. Isn't yours?