bmpercy: March 2012

Friday, March 30, 2012

Some fun with ruby 1.9(.3) and string encoding..

Okay, I should probably start by directing you here: http://blog.grayproductions.net/articles/understanding_m17n If you wanna really get dirty in character encoding in ruby, read up (by the way, i think m17n = multilingualization or somesuch).

Anyway, so I absorbed some percentage of that, but was a little surprised to see some of the default behavior of ruby 1.9, in particular what happens when you do some string concatenation/interpolation with mixed ASCII and UTF-8 encoded strings. Surprisingly, if you combine two such strings, it will sometimes result in an ASCII-encoded string, sometimes UTF-8-encoded string, depending on whether there are multibyte chars or not in the UTF-8 substring!:


> irb
1.9.3p125 :001 > foo = "foo"
 => "foo" 
1.9.3p125 :002 > bar = "bar"
 => "bar" 
1.9.3p125 :003 > baz = "báz"
 => "báz" 
1.9.3p125 :004 > foo.encoding.name
 => "UTF-8" 
1.9.3p125 :005 > bar.encoding.name
 => "UTF-8" 
1.9.3p125 :006 > baz.encoding.name
 => "UTF-8" 
1.9.3p125 :007 > foobar1 = "#{foo.force_encoding(Encoding::US_ASCII)}#{bar}#{bar}"
 => "foobarbar" 
1.9.3p125 :008 > foobar2 = "#{foo.force_encoding(Encoding::US_ASCII)}#{bar}#{baz}"
 => "foobarbáz" 
1.9.3p125 :009 > foobar1.encoding.name
 => "US-ASCII" 
1.9.3p125 :010 > foobar2.encoding.name
 => "UTF-8"

As a result, at Goodreads, we had to do some monkey-patching as we were getting some US-ASCII strings back from some rails helper code (pluralize(), number_with_delimiter()) as well as some ruby built-in classes (to_s() from NilClass, Float, Fixnum, Array). There must be a better way, but we've now got this force-utf8 monkey patch file with stuff like this:


module ActionView
  module Helpers
    module NumberHelper
      def number_with_delimiter_with_force_utf8(*args)
        number_with_delimiter_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
      end
      alias_method_chain :number_with_delimiter, :force_utf8
    end
  end
end

# bunch of to_s that need fixing...maybe see if there's a [Class1, Class2].each way of
# doing this that's a little DRYer...
class Array
  def join_with_force_utf8(*args)
    join_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
  end
  alias_method_chain :join, :force_utf8
end
class Fixnum
  def to_s_with_force_utf8(*args)
    to_s_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
  end
  alias_method_chain :to_s, :force_utf8
end
class Float
  def to_s_with_force_utf8(*args)
    to_s_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
  end
  alias_method_chain :to_s, :force_utf8
end
class NilClass
  def to_s_with_force_utf8(*args)
    to_s_without_force_utf8(*args).force_encoding(Encoding::UTF_8)
  end
  alias_method_chain :to_s, :force_utf8
end

Note: we also started marking a bunch of our code files with the magic comment (more here) because that seems to be the most effective way to force ruby to default new strings to UTF-8 encoding (there are a few other options, but this has been easiest and most effective). Being an emacser, I tend to propagate the # -*- coding: utf-8 -*- form...

# posted by brainpercy @ 10:20 PM 0 comments

bmpercy

Friday, March 30, 2012

Some fun with ruby 1.9(.3) and string encoding..

About Me

Links

archives