2011年12月31日土曜日

Pretty disappointed with Ruby 1.9. I'm new to Ruby.
I surprised with the fact that Ruby had been neither UTF-8 native nor multilingual-conscious. In addition, Ruby WG decided to adopt a non Unicode-native approach to multilingual scripting. It hit me as well. Most text files I'm willing to handle are UTF-8 encoded with over 20 European/Asian languages.

To avoid encoding errors, I have to add several redundant and ugly mantra to the head of the Ruby 1.9 scripts:

#Ruby 1.0.3p0 script
# -*- coding: utf-8 -*-
# $KCODE is not longer effective in 1.9!

Encoding.default.external = "utf-8" #set external encoding
Encoding.default.internal = "utf-8" #set internal encoding

Note that the line "# -*- coding: utf-8 -*-" is required in 1.9 multilingual scripting.
No title will be shown.

- Oniguruma, ruby 1.9 in other words, does not handle regular expressions with the length undetermined within look-backward:

  • (?<=something) -- works
  • (?<=something.+) -- does not work
It hit me. Look-forward/-backward feature is one of the most frequently used for my jobs. Hidemaru as well as .NET Framework support such regular expressions within the look-backward feature. I wonder why Oniguruma does not support it.

But it is very good for me that Oniguruma supports "character code classes" just like .NET Framework. Aggregating every language's delimiter and punctuation among the various UTF-8 code table is very boring and complicated. Character code classes dramatically reduce the size of it. \p{ } is just a placeholder for the classes:
  • \p{Lu} -- represents any uppercase letters including Cyrillic and Greek.
  • \p{P} -- represents any punctuations including .,:[].