Matching names, titles or products from different systems for similarity

Everybody has slightly different way of naming things. For example, I like to name my music as "Artist - Title" while others prefer "(Artist) Title" or even "Title - Artist".

When it comes to matching product names or models, it gets even worse as some lists will include hyphens in the product names while other lists exclude them.

The "Kingston 128MB PC133" ram stick has the model name "KVR133X64C3Q-128", as some lists will exclude the "-" and display it as "KVR133X64C3Q128".

Sometimes it gets even more ridiculous. I had to match products from my workplace, our internal products database to an external feed which had product names such as "Sony 26 KLHW26 Commercial LCD TV Hotel Model, 1366x768 Resolution, 1300:1, HDMI, Component, S-Video, Composite (KLH-W26)".

I haven't been able to find any one hit wonder method to get this working, and its quite CPU intensive for larger catalogues of information.

The methods are used together in an attempt to find the closest match, with more precise or efficient methods at the top of the stack since they hit more often.

Examples will be written in PHP since its fairly simple to translate into other languages.

function find_matching_product() {
  if ($id = method1()) { return $id; }
  if ($id = method2()) { return $id; }
  // ...
  return null;
}

Prior to performing any matches, it is a good idea to lowercase all the names and strip any non-alphanumerical characters. This will simplify your problems a great deal.

Intel Core 2 Quad Q9550, 2.83GHz, Quad Core, Socket LGA775, 95W TDP, 1333MHz FSB, 2x6MB L2 Cache, Boxed, Yorkfield (BX80569Q9550)

Should become:

intel core 2 quad q9550 283ghz quad core socket lga775 95w tdp 1333mhz fsb 2x6mb l2 cache boxed yorkfield bx80569q9550

Using PHP and regex, it should look something like this:

/**
* Strips any characters other than numbers, letters or spaces from a string.
*/
function strip_string($string) {
  return preg_replace('/[^\da-z ]/', "", $string);
}

Another point to take note of is the use of the fake space (ascii value 160) as a seperator. Replace it with spaces asap!

Method 1: Strip back the longer title

This is a fairly quick and simple one. Trim back the excess on names (separated by spaces) until there is a match. Works best until there are less than 4 words, then it has a fairly high false positive rate.

Name A: intel core 2 quad q9550
Name B: intel core 2 quad q9550 283ghz quad core socket lga775 95w tdp 1333mhz fsb 2x6mb l2 cache boxed yorkfield bx80569q9550

The red writing is to be truncated until there is a match of "intel core 2 quad q9550".

The following function is passed this PHP array, which acts like a dictionary in other languages. The filtered name of the product is used as the key and the ID is the value.

$cache = array(
  'intel core 2 quad q9550' => 12345,
  'dell inspiron 15' => 12346,
  'microsoft 600 keyboard' => 12347,
);

It is also passed the name of the item to match.

function get_matching_product(&$cache, $name) {
  $stripped = strip_string(strtolower($name));

  // Break up the name at spaces and convert it into an array.
  $fragments = explode(' ', $stripped);

  while (count($fragments) > 2) {
    // Convert the array back into a string
    $joint = implode(' ', $fragments);

    // If there is a match for the exact name
    if (!empty($cache[$joint])) {
      return $cache[$joint];
    }

    // Pop the last item off the array
    array_pop($fragments);
  }

  return NULL;
}

Method 2: Exact Model Matching

If you are fortunate for the incoming product data to include a model number somewhere, use it!

function get_matching_product(&$cache, $model) {
  $stripped = strip_string(strtolower($model));

  // Dont bother if its less than 4 characters long. Too many false positives.
  if (strlen($stripped) < 4) {
    return null;
  }

  // Search our cache
  foreach ($cache as $key => $id) {
    // Find model number string within name
    if (strpos($key, $model) !== FALSE) {
      return $id;
    }
  }

  return null;
}

Method 3: Fragment count matching

If we cant find an exact match, we should attempt to do a best effort match rating it against success threshold such as 75%.

In this method we:

  • Split the clean test name and clean cached name by spaces
  • Intersect the arrays/dictionaries
  • Count the number of intersected values
  • Count the total number of fragments in the cached name array
  • Check the matching ratio and determine a pass/fail criteria

This method will match something like

edimax br6574n wireless 80211bgn broadband router with 4 gigabit ports switch

with something like

edimax nmax wireless 80211n gigabit broadband router br6574n

Even though the fragments are in a different order, they can still be contained within the other name. Counting up the coloured matches (edimax, br6574n, wireless, broadband, router and gigabit), we find there are 6 matches from the intersection out of the possible 8 fragments, which is exactly 75%.

function get_matching_product(&$cache, $model) {
  $stripped = strip_string(strtolower($model));
  $fragments = explode(' ', $stripped);

  // Search our cache
  foreach ($cache as $key => $id) {
    $title = explode(' ', $key);
    $matches = array_intersect($title, $fragments);

    // Count
    $m = count($matches);
    $t = count($title);

    // Match at least 75% of the original title
    if (($m > 1) && (($m / $t) * 100) >= 75) {
      return $id;
    }
  }

  return null;
}

Method 4: Levenshtein distance

This algorithm processes the number of insertions, replacements and deletions required to make one string the same as another and returns the "distance" from one string to another. The distance is an indication on how different the strings are.

Calculating the difference between two words can be a costly process, so I didn't bother with this method.

If you are interested, the PHP function is built into levenshtein(). Another function of interest is similar_text().

Success Rate?

Using the first 3 methods explained, we've matched approximately 2,000 of our products out of 4,000 in the cache against 15,000 external products.

The methods used here are fairly simple but cooked up without much regard to efficiency.

If you have more success with another method, let me know. It'd be interesting to see what other methods there are.

 
Copyright © Twig's Tech Tips
Theme by BloggerThemes & TopWPThemes Sponsored by iBlogtoBlog