Friday, March 30, 2012
Some fun with ruby 1.9(.3) and string encoding..
Okay, I should probably start by directing you here: http://blog.grayproductions.net/articles/understanding_m17n If you wanna really get dirty in character encoding in ruby, read up (by the way, i think m17n = multilingualization or somesuch).
Anyway, so I absorbed some percentage of that, but was a little surprised to see some of the default behavior of ruby 1.9, in particular what happens when you do some string concatenation/interpolation with mixed ASCII and UTF-8 encoded strings. Surprisingly, if you combine two such strings, it will sometimes result in an ASCII-encoded string, sometimes UTF-8-encoded string, depending on whether there are multibyte chars or not in the UTF-8 substring!:
As a result, at Goodreads, we had to do some monkey-patching as we were getting some US-ASCII strings back from some rails helper code (
Note: we also started marking a bunch of our code files with the magic comment (more here) because that seems to be the most effective way to force ruby to default new strings to UTF-8 encoding (there are a few other options, but this has been easiest and most effective). Being an emacser, I tend to propagate the
Anyway, so I absorbed some percentage of that, but was a little surprised to see some of the default behavior of ruby 1.9, in particular what happens when you do some string concatenation/interpolation with mixed ASCII and UTF-8 encoded strings. Surprisingly, if you combine two such strings, it will sometimes result in an ASCII-encoded string, sometimes UTF-8-encoded string, depending on whether there are multibyte chars or not in the UTF-8 substring!:
> irb
1.9.3p125 :001 > foo = "foo"
=> "foo"
1.9.3p125 :002 > bar = "bar"
=> "bar"
1.9.3p125 :003 > baz = "báz"
=> "báz"
1.9.3p125 :004 > foo.encoding.name
=> "UTF-8"
1.9.3p125 :005 > bar.encoding.name
=> "UTF-8"
1.9.3p125 :006 > baz.encoding.name
=> "UTF-8"
1.9.3p125 :007 > foobar1 = "#{foo.force_encoding(Encoding::US_ASCII)}#{bar}#{bar}"
=> "foobarbar"
1.9.3p125 :008 > foobar2 = "#{foo.force_encoding(Encoding::US_ASCII)}#{bar}#{baz}"
=> "foobarbáz"
1.9.3p125 :009 > foobar1.encoding.name
=> "US-ASCII"
1.9.3p125 :010 > foobar2.encoding.name
=> "UTF-8"
As a result, at Goodreads, we had to do some monkey-patching as we were getting some US-ASCII strings back from some rails helper code (
pluralize()
, number_with_delimiter()
) as well as some ruby built-in classes (to_s()
from NilClass
, Float
, Fixnum
, Array
). There must be a better way, but we've now got this force-utf8 monkey patch file with stuff like this:
module ActionView
module Helpers
module NumberHelper
def number_with_delimiter_with_force_utf8(*args)
number_with_delimiter_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
end
alias_method_chain :number_with_delimiter, :force_utf8
end
end
end
# bunch of to_s that need fixing...maybe see if there's a [Class1, Class2].each way of
# doing this that's a little DRYer...
class Array
def join_with_force_utf8(*args)
join_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
end
alias_method_chain :join, :force_utf8
end
class Fixnum
def to_s_with_force_utf8(*args)
to_s_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
end
alias_method_chain :to_s, :force_utf8
end
class Float
def to_s_with_force_utf8(*args)
to_s_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
end
alias_method_chain :to_s, :force_utf8
end
class NilClass
def to_s_with_force_utf8(*args)
to_s_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
end
alias_method_chain :to_s, :force_utf8
end
Note: we also started marking a bunch of our code files with the magic comment (more here) because that seems to be the most effective way to force ruby to default new strings to UTF-8 encoding (there are a few other options, but this has been easiest and most effective). Being an emacser, I tend to propagate the
# -*- coding: utf-8 -*-
form...