Everybody has slightly different way of naming things. For example, I like to name my music as "Artist - Title" while others prefer "(Artist) Title" or even "Title - Artist".
When it comes to matching product names or models, it gets even worse as some lists will include hyphens in the product names while other lists exclude them.
The "Kingston 128MB PC133" ram stick has the model name "KVR133X64C3Q-128", as some lists will exclude the "-" and display it as "KVR133X64C3Q128".
Sometimes it gets even more ridiculous. I had to match products from my workplace, our internal products database to an external feed which had product names such as "Sony 26 KLHW26 Commercial LCD TV Hotel Model, 1366x768 Resolution, 1300:1, HDMI, Component, S-Video, Composite (KLH-W26)".
I haven't been able to find any one hit wonder method to get this working, and its quite CPU intensive for larger catalogues of information.
The methods are used together in an attempt to find the closest match, with more precise or efficient methods at the top of the stack since they hit more often.
Examples will be written in PHP since its fairly simple to translate into other languages.
1.
function
find_matching_product() {
2.
if
(
$id
= method1()) {
return
$id
; }
3.
if
(
$id
= method2()) {
return
$id
; }
4.
5.
return
null;
6.
}
Prior to performing any matches, it is a good idea to lowercase all the names and strip any non-alphanumerical characters. This will simplify your problems a great deal.
Intel Core 2 Quad Q9550, 2.83GHz, Quad Core, Socket LGA775, 95W TDP, 1333MHz FSB, 2x6MB L2 Cache, Boxed, Yorkfield (BX80569Q9550)
Should become:
intel core 2 quad q9550 283ghz quad core socket lga775 95w tdp 1333mhz fsb 2x6mb l2 cache boxed yorkfield bx80569q9550
Using PHP and regex, it should look something like this:
1.
2.
3.
4.
function
strip_string(
$string
) {
5.
return
preg_replace(
'/[^\da-z ]/'
,
""
,
$string
);
6.
}
Another point to take note of is the use of the fake space (ascii value 160) as a seperator. Replace it with spaces asap!
Method 1: Strip back the longer title
This is a fairly quick and simple one. Trim back the excess on names (separated by spaces) until there is a match. Works best until there are less than 4 words, then it has a fairly high false positive rate.
Name A: intel core 2 quad q9550
Name B: intel core 2 quad q9550 283ghz quad core socket lga775 95w tdp 1333mhz fsb 2x6mb l2 cache boxed yorkfield bx80569q9550
The red writing is to be truncated until there is a match of "intel core 2 quad q9550".
The following function is passed this PHP array, which acts like a dictionary in other languages. The filtered name of the product is used as the key and the ID is the value.
1.
$cache
=
array
(
2.
'intel core 2 quad q9550'
=> 12345,
3.
'dell inspiron 15'
=> 12346,
4.
'microsoft 600 keyboard'
=> 12347,
5.
);
It is also passed the name of the item to match.
01.
function
get_matching_product(&
$cache
,
$name
) {
02.
$stripped
= strip_string(
strtolower
(
$name
));
03.
04.
05.
$fragments
=
explode
(
' '
,
$stripped
);
06.
07.
while
(
count
(
$fragments
) > 2) {
08.
09.
$joint
= implode(
' '
,
$fragments
);
10.
11.
12.
if
(!
empty
(
$cache
[
$joint
])) {
13.
return
$cache
[
$joint
];
14.
}
15.
16.
17.
array_pop
(
$fragments
);
18.
}
19.
20.
return
NULL;
21.
}
Method 2: Exact Model Matching
If you are fortunate for the incoming product data to include a model number somewhere, use it!
01.
function
get_matching_product(&
$cache
,
$model
) {
02.
$stripped
= strip_string(
strtolower
(
$model
));
03.
04.
05.
if
(
strlen
(
$stripped
) < 4) {
06.
return
null;
07.
}
08.
09.
10.
foreach
(
$cache
as
$key
=>
$id
) {
11.
12.
if
(
strpos
(
$key
,
$model
) !== FALSE) {
13.
return
$id
;
14.
}
15.
}
16.
17.
return
null;
18.
}
Method 3: Fragment count matching
If we cant find an exact match, we should attempt to do a best effort match rating it against success threshold such as 75%.
In this method we:
- Split the clean test name and clean cached name by spaces
- Intersect the arrays/dictionaries
- Count the number of intersected values
- Count the total number of fragments in the cached name array
- Check the matching ratio and determine a pass/fail criteria
This method will match something like
edimax br6574n wireless 80211bgn broadband router with 4 gigabit ports switch
with something like
edimax nmax wireless 80211n gigabit broadband router br6574n
Even though the fragments are in a different order, they can still be contained within the other name. Counting up the coloured matches (edimax, br6574n, wireless, broadband, router and gigabit), we find there are 6 matches from the intersection out of the possible 8 fragments, which is exactly 75%.
01.
function
get_matching_product(&
$cache
,
$model
) {
02.
$stripped
= strip_string(
strtolower
(
$model
));
03.
$fragments
=
explode
(
' '
,
$stripped
);
04.
05.
06.
foreach
(
$cache
as
$key
=>
$id
) {
07.
$title
=
explode
(
' '
,
$key
);
08.
$matches
=
array_intersect
(
$title
,
$fragments
);
09.
10.
11.
$m
=
count
(
$matches
);
12.
$t
=
count
(
$title
);
13.
14.
15.
if
((
$m
> 1) && ((
$m
/
$t
) * 100) >= 75) {
16.
return
$id
;
17.
}
18.
}
19.
20.
return
null;
21.
}
Method 4: Levenshtein distance
This algorithm processes the number of insertions, replacements and deletions required to make one string the same as another and returns the "distance" from one string to another. The distance is an indication on how different the strings are.
Calculating the difference between two words can be a costly process, so I didn't bother with this method.
If you are interested, the PHP function is built into levenshtein(). Another function of interest is similar_text().
Success Rate?
Using the first 3 methods explained, we've matched approximately 2,000 of our products out of 4,000 in the cache against 15,000 external products.
The methods used here are fairly simple but cooked up without much regard to efficiency.
If you have more success with another method, let me know. It'd be interesting to see what other methods there are.