Latinizing Vietnamese Text

June 21st, 2007

This issue came up as I was working on adding SEF (search-engine friendly) words to internal URLs on www.newhanoian.com. Adding human-readable text to URLs is not only good for Google - I find it helpful for me when I'm looking at URLs on mouseover or in the Location Bar dropdown.

If I put the vietnamese words Bia Hơi - Bia Tươi into a SEF URL I get the following mess, however: bia-h%C6%A1i-bia-t%C6%B0%C6%A1i.

I don't know whether this is friendly to Google or not, but it's not easy on my eyes at all!

Luckily Vietnamese is still quite comprehensible when it's reduced to latin characters only, so I decided to transform the accented, dia-criticalized characters into their recognizable latin counterparts - ắ etc become a, đ becomes d, è becomes e, and so on. I was suprised to learn that with all the diacritic combinations, there are 186 possible character / diacritic combinations in written vietnamese!

My first stab at this was using iconv, unsuccessfully. The following code is supposed to automatically transliterate into latin: $word = iconv('UTF-8', 'US-ASCII//TRANSLIT', $word); -- it doesn't work well at all for VN text. It chokes totally on the bia example above.

After a little poking around on Viet Unicode and learning some interesting things about The Vietnamese alphabet and its collocation I managed to come up with the complete set of vn characters:

aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆ
fFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTu
UùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ
... and from this I came up with a set of regular expressions for latinizing vietnamese text, which I've expressed as a static utility method:
	public static function romanize_vn($string) {
		//a 
		$string = preg_replace('/[àảãáạăằẳẵắặâầẩẫấậ]/u', 'a', $string);
		$string = preg_replace('/[ÀẢÃÁẠĂẰẲẴẮẶÂẦẨẪẤẬ]/u', 'A', $string);
		// e
		$string = preg_replace('/[èẻẽéẹêềểễếệ]/u', 'e', $string);
		$string = preg_replace('/[ÈẺẼÉẸÊỀỂỄẾỆ]/u', 'E', $string);
		// i
		$string = preg_replace('/[ìỉĩíị]/u', 'i', $string);
		$string = preg_replace('/[ÌỈĨÍỊ]/u', 'I', $string);
		// o
		$string = preg_replace('/[òỏõóọôồổỗốộơờởỡớợ]/u', 'o', $string);
		$string = preg_replace('/[ÒỎÕÓỌÔỒỔỖỐỘƠỜỞỠỚỢ]/u', 'O', $string);
		// u
		$string = preg_replace('/[ùủũúụưừửữứự]/u', 'u', $string);
		$string = preg_replace('/[ÙỦŨÚỤƯỪỬỮỨỰ]/u', 'U', $string);
		// y
		$string = preg_replace('/[ỳỷỹýỵ]/u', 'y', $string);
		$string = preg_replace('/[ỲỶỸÝỴ]/u', 'y', $string);
		// d
		$string = preg_replace('/[đ]/u', 'd', $string);
		$string = preg_replace('/[Đ]/u', 'D', $string);
		return $string;
	}
... note the '/u' switch the pattern to tell PHP to interpret it as unicode. Otherwise you'll be matching the latinized garbage version of your unicode string.

So far it seems to work quite well. I'm sure all those preg's are not particularly fast, but they're not obviously slow either.

4 Responses to “Latinizing Vietnamese Text”

  1. Progfou Says:

    About this sentence: " note the '/u' switch the pattern to tell PHP to interpret it as unicode. "

    In fact it's not PHP but Perl here. The preg_* functions are calling the Perl Regular Expression library linked into PHP.

    It's sad to say, but PHP is really bad at managing Unicode and a full Unicode compliance is only planed for PHP version... 7 !

    What's the current version this year of 2007? Only 5... And when did Unicode started? In 1991... So may be we'll get it in the next decade... ;-)

    You are pretty right when turning to Ruby/Rails! On my side I'm looking after Python...

    Cheers, J.C.

  2. Unsarce Says:

    хех. неплохо !

  3. Unsarce Says:

    hm... good one..

  4. Pesprierm Says:

    mm. really like it..

Leave a Reply