text-icu package provides bindings for the ICU library. The library author notes that there can be some memory overheads of copying Haskell memory area (Text) to a fixed memory area for FFI to ICU library. text-icu provides automatic buffer mangement. So, you don’t need to mess around with Ptr and raw heap memory allocation, but there is a usually expected overhead.

kyagrd@kyahp:~/cscs/text-icu-ko$ ghci -XOverloadedStrings
GHCi, version 7.4.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim … linking … done.
Loading package integer-gmp … linking … done.
Loading package base … linking … done.
Prelude Data.Text.ICU> :m + Data.Text.ICU.Normalize
Prelude Data.Text.ICU Data.Text.ICU.Normalize>

By NFKD (compatibility decomposition) and NFKC (compatibility composition), Hangul chosung jamo can be equated with Hangul compatibility jamo, as follows:

…> (normalize NFD “나”,normalize NFD “ㄴㅏ”)
…> (normalize NFKD “나”,normalize NFKD “ㄴㅏ”)
…> normalize NFKD “나” == normalize NFKD “ㄴㅏ”

However, Unicode standard does not seem to provide some relation between jongsung jamo and compatibility jamo (hence, cannot expect ICU library to have such faclility).

…> (“난”,“ㄴㅏㄴ”)
…> (normalize NFD “난”,normalize NFD “ㄴㅏㄴ”)
…> (normalize NFKD “난”,normalize NFKD “ㄴㅏㄴ”)
…> normalize NFKD “난” == normalize NFKD “ㄴㅏㄴ”

So, in order to define an operator like “난” =:= “ㄴㅏㄴ” to be True, one needs to implement by themselves refering to the Hangule related unicode codepage.


#langdev channel at Ozinger IRC network

UTF8 한글 문자열을 첫가끝 낱자(자소)로 분해하기