Home    All Articles    About    carlos@bueno.org    RSS

Accent-Folding for Auto­complete

23 Feb­rua­ry 2010

An­oth­er genera­tion of tech­nology has pas­sed and Uni­code sup­port is al­most every­where. The next step is to write software that is not just “in­ter­nationalized” but truly multi­lin­gu­al. In this ar­ticle we will skip through a bit of his­to­ry and theo­ry, then il­lustrate a neat hack cal­led accent-folding. Accent-folding has its li­mita­tions but it can help make some im­por­tant yet over­looked user in­terac­tions work bet­t­er.
Il­lustra­tion by Kevin Cor­nell
(Original­ly ap­peared on A List Apart.)

A com­mon as­sump­tion about in­ter­nationaliza­tion is that every user fits into a single loc­ale like “En­glish, Uni­ted States” or “French, Fran­ce.” It’s a han­gov­er from the PC days when just gett­ing the com­put­er to dis­play the right squigg­ly bits was a big deal. One byte equaled one charact­er, no ex­cep­tions, and you could only load one lan­guage’s al­phabet at a time. This was fine be­cause it was bet­t­er than noth­ing, and be­cause users spent most of their time with docu­ments they or their co­work­ers pro­duced them­selves.

Today users deal with data from every­where, in multi­ple lan­guages and loc­ales, all the time. The loc­ale I pre­f­er is only loose­ly cor­related with the loc­ales I ex­pect applica­tions to pro­cess.

Con­sid­er this address book:

If I com­pose a new mes­sage and type “lo” in the To: field, what should happ­en? In many applica­tions only Lorena will show up. These applica­tions “sup­port Uni­code,” in the sense that they don’t cor­rupt or barf on it, but that’s all. Screenshot of Microsoft Entourage's address book autosuggest, which does not fold accented characters.
Fig 1. Hey En­tourage, where are my con­tacts?

This pro­blem is not just in address books. Think about in­boxes, soci­al book­marks, com­ment feeds, users who speak multi­ple lan­guages, users in in­ter­net cafés in foreign co­unt­ries, even URLs. Look at the jour­nal­ist Rys­zard Kapuściński and how dif­ferent web­sites han­dle his name:

There is no ex­cuse for your software to play dumb when the user types “cafe” in­stead of “café.”

Áçčềñṭ-Ḟøłɖǐṅg

In specific applica­tions of search that favor re­call over pre­cis­ion, such as our address book ex­am­ple, á, a, å, and â can be treated as equivalent. Ac­cents (a.k.a di­ac­rit­ical marks) are pro­nun­cia­tion hints that don’t af­fect the tex­tu­al mean­ing. En­ter­ing them can be cum­ber­some, es­pecial­ly on mobile de­vices. Example of a YUI Autosuggest widget with accent-folding added.
Fig 2. accent-folding in an auto­sug­gest wid­get

An accent-folding func­tion es­sential­ly maps Uni­code charact­ers to ASCII equivalents. An­yw­here you apply case-folding, you should con­sid­er accent-folding, and for ex­act­ly the same rea­sons. With accent-folding, it doesn’t matt­er wheth­er users search for cafe, café or even çåFé; the re­sults will be the same.

Be aware that there are a mill­ion caveats to ac­cent rules. You will al­most cer­tain­ly get it wrong for some­body, some­where. Near­ly every al­phabet has a few extra-special marks that do af­fect mean­ing, and, of co­ur­se, non-Western al­phabets have com­plete­ly dif­ferent rules.

A minor gotcha is the Uni­code "full­width" Roman al­phabet. These are fixed-width vers­ions of plain ASCII charact­ers de­sig­ned to line up with Chinese/­Japanese/­Korean charact­ers (e.g., "1979年8月15日"). They re­side in 0xff00 to 0xff5e and should be treated as equivalent to their ASCII co­un­terparts.

Hey man, I’m only here for the copy/paste

I’ve post­ed more com­plete ex­am­ples on Git­Hub, but for il­lustra­tion, here’s a basic accent-folder in Javascript:

var accentMap = {
  'á':'a', 'é':'e', 'í':'i','ó':'o','ú':'u'
};

function accent_fold (s) {
  if (!s) { return ''; }
  var ret = '';
  for (var i = 0; i < s.length; i++) {
    ret += accent_map[s.charAt(i)] || s.charAt(i);
  }
  return ret;
};

Re­gular Ex­press­ions

Re­gular ex­press­ions are very tri­cky to make accent-aware. Notice that in Fig. 2 only the un­ac­cented en­t­ries are in bold type. The pro­blem is that the Uni­code charact­er layout does not lend it­self to pat­terns that cut ac­ross lan­guages. The pro­p­er regex for “lo” would be some­th­ing in­sane like:

[LlĹ弾ĻļḶḷḸḹḼḽḺḻŁłŁłĿŀȽƚɫ][OoÓóÒòŎŏÔôỐốỒồỖöȪȫŐőÕõṌȭȮȯǾǿ...ǬǭŌ]

Never, never do this. As of this writ­ing, few re­gular ex­press­ion en­gines sup­port shortcuts for Uni­code charact­er clas­ses. PCRE and Java seem to be in the van­guard. You pro­bab­ly should­n’t push it. In­stead, try highlight­ing an accent-folded vers­ion of the str­ing, and then use those charact­er posi­tions to highlight the origin­al, like so:

// accent_folded_hilite("Fulanilo López", 'lo')
//   --> "Fulani<b>lo</b> <b>Ló</b>pez"
//
function accent_folded_hilite(str, q) {
  var str_folded = accent_fold(str).toLowerCase().replace(/[<>]+/g, '');
  var q_folded = accent_fold(q).toLowerCase().replace(/[<>]+/g, '');

  // Create an intermediate string with hilite hints
  // Example: fulani<lo> <lo>pez
  var re = new RegExp(q_folded, 'g');
  var hilite_hints = str_folded.replace(re, '<'+q_folded+'>');

  // Index pointer for the original string
  var spos = 0;
  // Accumulator for our final string
  var highlighted = '';

  // Walk down the original string and the hilite hint
  // string in parallel. When you encounter a < or > hint,
  // append the opening / closing tag in our final string.
  // If the current char is not a hint, append the equiv.
  // char from the original string to our final string and
  // advance the original string's pointer.
  for (var i = 0; i< hilite_hints.length; i++) {
    var c = str.charAt(spos);
    var h = hilite_hints.charAt(i);
    if (h === '<') {
      highlighted += '<b>';
    } else if (h === '>') {
      highlighted += '</b>';
    } else {
      spos += 1;
      highlighted += c;
    }
  }
  return highlighted;
}

The pre­vi­ous ex­am­ple is pro­bab­ly too simplis­tic for pro­duc­tion code. You can’t highlight multi­ple terms, for ex­am­ple. Some speci­al charact­ers might ex­pand to two charact­ers, such as “æ” --> “ae” which will screw up spos. It also strips out angle-brackets (<>) in the origin­al str­ing. But it’s good en­ough for a first pass.

Accent-folding in YUI Auto­complete

YUI's Auto­complete li­bra­ry has many hooks and opt­ions to play with. Today we’ll look at two over­ride­able met­hods: filterResults() and formatMatch(). The filterResults met­hod al­lows you to write your own match­ing func­tion. The formatMatch met­hod al­lows you to chan­ge the HTML of an entry in the list of sug­gested matches. You can also download a com­plete, work­ing ex­am­ple with all of the sour­ce code.

<!-- this is important to tell javascript to treat
the strings as UTF-8 -->
<meta http-equiv="content-type" content="text/html;charset=utf-8" />

<!-- YUI stylesheets -->
<link rel="stylesheet" type="text/css"
 href="http://yui.yahooapis.com/2.7.0/build/fonts/fonts-min.css" />
<link rel="stylesheet" type="text/css"
 href="http://yui.yahooapis.com/2.7.0/build/autocomplete/assets/skins/sam/autocomplete.css" />

<!-- YUI libraries: events, datasource and autocomplete -->
<script type="text/javascript"
 src="http://yui.yahooapis.com/2.7.0/build/yahoo-dom-event/yahoo-dom-event.js"></script>
<script type="text/javascript"
 src="http://yui.yahooapis.com/2.7.0/build/datasource/datasource-min.js"></script>
<script type="text/javascript"
 src="http://yui.yahooapis.com/2.7.0/build/autocomplete/autocomplete-min.js"></script>

<!-- contains accent_fold() and accent_folded_hilite() -->
<script type="text/javascript" src="accent-fold.js"></script>

<!-- Give <body> the YUI "skin" -->
<body class="yui-skin-sam">
  <b>To:</b>
  <div style="width:25em">

    <!-- Our to: field -->
    <input id="to" type="text" />

    <!-- An empty <div> to contain the autocomplete -->
    <div id="ac"></div>

  </div>
</body>

<script>

// Our static address book as a list of hash tables
var addressBook = [
  {name:'Fulanito López', email:'fulanito@example.com'},
  {name:'Erik Lørgensen', email:'erik@example.com'},
  {name:'Lorena Smith',   email:'lorena@example.com'},
  {name:'James Lö',       email:'james@example.com'}
];

/*
Iterate our address book and add a new field to each
row called "search." This contains an accent-folded
version of the "name" field.
*/
for (var i = 0; i< addressBook.length; i++) {
  addressBook[i]['search'] = accent_fold(addressBook[i]['name']);
}

// Create a YUI datasource object from our raw address book
var datasource = new YAHOO.util.LocalDataSource(addressBook);

/*
A datasource is tabular, but our array of hash tables has no
concept of column order. So explicitly tell the datasource
what order to put the columns in.
*/
datasource.responseSchema = {fields : ["email", "name", "search"]};

/*
Instantiate the autocomplete widget with a reference to the
input field, the empty div, and the datasource object.
*/
var autocomp = new YAHOO.widget.AutoComplete("to", "ac", datasource);

// Allow multiple entries by specifying space
// and comma as delimiters
autocomp.delimChar = [","," "];

/*
Add a new filterResults() method to the autocomplete object:
Iterate over the datasource and search for q inside the
"search" field. This method is called each time the user
types a new character into the input field.
*/
autocomp.filterResults = function(q, entries, resultObj, cb) {
    var matches = [];
    var re = new RegExp('\\b'+accent_fold(q), 'i');
    for (var i = 0; i < entries.length; i++) {
        if (re.test(entries[i]['search'])) {
            matches.push(entries[i]);
        }
    }
    resultObj.results = matches;
    return resultObj;
};

/*
Add a new formatResult() method. It is called on each result
returned from filterResults(). It outputs a pretty HTML
representation of the match. In this method we run the
accent-folded highlight function over the name and email.
*/
autocomp.formatResult = function (entry, q, match) {
    var name = accent_folded_hilite(entry[1], q);
    var email = accent_folded_hilite(entry[0], q);
    return name + ' <' + email + '>';
};

//fin
</script>

About those mill­ion caveats...

This accent-folding trick works primari­ly for Wes­tern European text, but it won’t work for all of it. It ex­ploits specific quirks of the lan­guage fami­ly and the li­mited pro­blem domains of our ex­am­ples, where it’s bet­t­er to get more re­sults than no re­sults. In Ger­man, Ü should pro­bab­ly map to Ue in­stead of just U. A French per­son search­ing the web for thé (tea) would be upset if flooded with ir­relevant En­glish text.

You can only push a sim­ple charact­er map so far. It would be very tri­cky to re­con­cile the Roman “Marc Chagall” with the Cyril­lic “Марк Шагал” or Heb­rew “מאַרק שאַגאַל.” There are very in­terest­ing similarit­ies in the charact­ers but a mag­ical context-free two-way mapp­ing is pro­bab­ly not pos­sible.

On top of all that there is an­oth­er pro­blem: One lan­guage can have more than one writ­ing sys­tem. Trans­litera­tion is writ­ing a lan­guage in a dif­ferent al­phabet. It’s not quite the same as trans­crip­tion, which maps sounds as in “hola, que paso” --> “oh-la, keh pah-so.” Trans­litera­tions try to map the writt­en sym­bols to an­oth­er al­phabet, ideal­ly in a way that’s re­ver­sible.

These four sen­t­ences all say “Childr­en like to watch televis­ion” in Japanese:

For cen­tu­ries peo­ple have been in­vent­ing ways to write dif­ferent lan­guages with whatev­er keyboards or type­sett­ing they had avail­able. So even if the user reads only one lan­guage, they might do so in multi­ple trans­litera­tion schemes. Some schemes are log­ical and academic, but often they are messy or­ganic th­ings that de­pend on re­gion­al ac­cent and his­tor­ical cruft. The com­put­er era kic­ked off a new ex­plos­ion of sys­tems as peo­ple lear­ned to chat and send em­ails in plain ASCII.

There is a lot of prior work on this pro­blem and two ready paths you can choose: The right way and the maybe-good-enough way. Neith­er have the simplic­ity of our naïve hash table, but they will be less dis­ap­point­ing for your users in gener­al applica­tions.

The first is In­ter­nation­al Com­ponents for Uni­code (ICU), a pro­ject that originated in the early ninet­ies at Taligent. It aims to be a com­plete, language-aware trans­litera­tion, Uni­code, for­matt­ing, every­th­ing li­bra­ry. It’s big, it’s C++/Java, and it re­quires con­tex­tu­al know­ledge of the in­puts and out­puts to work.

The second is Uni­decode, a context-free trans­litera­tion li­bra­ry avail­able for Perl and Pyt­hon. It tries to trans­literate all Uni­code charact­ers to a basic Latin al­phabet. It makes lit­tle at­tempt to be re­ver­sible, language-specific, or even general­ly cor­rect; it’s a quick-and-dirty hack that non­et­heless is pre­tty com­prehen­sive.

Accent-folding in the right places saves your users time and makes your software smart­er. The strict seg­rega­tion of lan­guages and loc­ales is par­tial­ly an ar­tifact of tech­nology li­mita­tions which no long­er hold. It’s up to you how far you want to take th­ings, but even a lit­tle bit of ef­fort goes a long way. Good luck!


Italian Trans­la­tion