Database-Box: 💢Using Fuzzy Matching in Oracle Database 23c💢

💢Using Fuzzy Matching

Alireza Kamrani

09/14/2024

Data quality issues plague even the most meticulously maintained databases. Typos, misspellings, and phonetic variations can create duplicate records, hindering analysis and decision-making.

Considering that you're trying to do a fuzzy search on a list of school names, I don't think you want to go for traditional string similarity like Levenshtein distance.

My assumption is that you're taking a user's input (either keyboard input or spoken over the phone), and you want to quickly find the matching school.

Distance metrics tell you how similar two strings are based on substitutions, deletions, and insertions. But those algorithms don't really tell you anything about how similar the strings are as words in a human language.

Consider, for example, the words "smith," "smythe," and "smote". I can go from "smythe" to "smith" in two steps:

smythe -> smithe -> smith

And from "smote" to "smith" in two steps:

smote -> smite -> smith

So the two have the same distance as strings, but as words, they're significantly different.

If somebody told you (spoken language) that he was looking for "Symthe College," you'd almost certainly say, "Oh, I think you mean Smith." But if somebody said "Smote College," you wouldn't have any idea what he was talking about.

What you need is a phonetic algorithm like Soundex or Metaphone. Basically, those algorithms break a word down into phonemes and create a representation of how the word is pronounced in spoken language. You can then compare the result against a known list of words to find a match.

Such a system would be much faster than using a distance metric. Consider that with a distance metric, you need to compare the user's input with every word in your list to obtain the distance. That is computationally expensive and the results, as I demonstrated with "smith" and "smote" can be laughably bad.

Using a phonetic algorithm, you create the phoneme representation of each of your known words and place it in a dictionary (a hash map or possibly a trie).

That's a one-time startup cost. Then, whenever the user inputs a search term, you create the phoneme representation of his input and look it up in your dictionary.

That is a lot faster and produces much better results.

Consider also that when people misspell proper names, they almost always get the first letter right, and more often than not pronouncing the misspelling sounds like the actual word they were trying to spell.

If that's the case, then the phonetic algorithms are definitely the way to go.

Fortunately, Oracle Database 23c delivers two powerful tools for fuzzy string matching: FUZZY_MATCH and PHONIC_ENCODE.

This article delves into these operators, exploring their potential and providing practical code examples to unlock their power.

Fuzzy Matching in Action: FUZZY_MATCH

Imagine searching for customers named “kamrani” but encountering variations like “kamrani” or “kamranian" or "kamrany".

Here’s where FUZZY_MATCH shines.

It calculates the similarity between two strings using various algorithms, returning a score indicating their closeness.

Higher scores represent greater similarity.

Here’s an example:

SELECT customer_id, name, FUZZY_MATCH('SOUNDEX', name, 'Kamrani') AS match_score

FROM customers;

OUTPUT:

customer_id | name | match_score

----------- | ------------- | -----------

1 | Kamrani | 100

2 | Kamraniy | 80

3 | Kamranii | 90

```

Here, I used the SOUNDEX algorithm, which encodes names based on pronunciation.

Other algorithms available include LEVENSHTEIN (edit distance) and JARO_Winkler (similarity measure).

Phoning it In: PHONIC_ENCODE

Sometimes, variations arise due to pronunciation differences, not spelling errors.

In these cases, PHONIC_ENCODE is your ally.

It converts strings into a phonetic representation, focusing on sound, not character sequence.

For instance, “Kami” and “Cami” might have different spellings but share the same phonetic code, and "kitten" and "sitten", in strings that have small differences in spelling,

allowing you to identify potential duplicates:

SELECT customer_id, name, PHONIC_ENCODE(name) AS phonetic_code

FROM customers;

OUTPUT:

customer_id | name | phonetic_code

----------- | ------------- | -------------

1 | Michael | MKL

2 | Michal | MKL

3 | Micheal | MKL

4 | Chris | KRS

5 | Kris | KRS

```

By comparing phonetic codes, you can efficiently uncover near-duplicate records based on pronunciation similarity.

PL/SQL Support

While FUZZY_MATCH and PHONIC_ENCODE are powerful data quality operators, direct assignment within PL/SQL blocks isn’t currently possible.

DECLARE

my_name VARCHAR2(50);

BEGIN

-- Attempting direct assignment (doesn't work)

my_name := FUZZY_MATCH('SOUNDEX', 'Michael', 'Michal');

END;

...

PLS-00201: identifier 'FUZZY_MATCH' must be declared

...

Fortunately, we can leverage the SELECT … INTO construct to retrieve the desired output from the operator and store it in a PL/SQL variable:

DECLARE

my_name VARCHAR2(50);

BEGIN

-- Select the match score and store it in the variable

SELECT FUZZY_MATCH('SOUNDEX', 'Michael', 'Michal') INTO my_name

FROM DUAL;

END;

Practical Applications

Deduplication: Identify and merge near-duplicate customer records, product entries, or any other textual data.
Data cleansing: Correct typos and misspellings, improving data accuracy and consistency.
Fuzzy search: Enable flexible search functionalities, accommodating spelling variations in queries.

Conclusion

By embracing FUZZY_MATCH and PHONIC_ENCODE, you empower your Oracle Database 23c to handle imperfect data with agility and precision. Explore these tools to enhance data quality, streamline data management, and gain valuable insights from your information assets.

For more info:

Documentation: Data Quality Operators

Alireza Kamrani

09/14/2024

Database-Box

Saturday, September 14, 2024

💢Using Fuzzy Matching in Oracle Database 23c💢

No comments:

Post a Comment

Oracle Hot Block Concept, Detection and Resolving scenarios

Report Abuse

Labels