May 7
/
Aneta Klosek
Levenshtein, Jaro-Winkler, and the Art of Finding the Name You Almost Missed
The Fuzzy Matching Trade-Off: Why the Algorithm Behind Your Sanctions Screening Matters More Than You Think
Sanctions screening sounds straightforward until you try to do it with real data.
The theory is simple: compare a name against a list, flag what matches, review what's flagged. In practice, name data is messy in ways that make exact matching almost immediately useless. "Mohammed," "Muhammad," and "Mohamad" are not rare edge cases, they're everyday examples of how the same person's name can appear in a dozen different forms depending on transliteration standards, regional conventions, or whoever entered the data.
So organisations turn to fuzzy matching. And that's where things get interesting and quiet honestly, complicated.
Fuzzy Matching Isn't One Thing
When people say a system uses "fuzzy matching," they're often describing a black box. But fuzzy matching is a category, not a solution. The algorithm underneath makes a significant difference in what your system actually catches.
Two algorithms come up repeatedly in sanctions screening:
Levenshtein Distance
Levenshtein distance counts the number of edits, meaning the insertions, deletions, substitutions needed to turn one string into another. It's intuitive and widely used. If two names differ by one or two characters, the edit distance is low and the system flags them as similar. The limitation is that it treats all edits equally, regardless of where they fall in the name or what they mean.
Jaro-Winkler Similarity
Jaro-Winkler similarity works differently. It gives extra weight to matching prefixes and handles transpositions more graciously. In practice, this means it tends to perform better on short strings like personal names, where the start of a name carries more identifying weight than the end.
To make this concrete: "Hussain" versus "Husasin" involves a transposition. Levenshtein sees two edits; Jaro-Winkler is more forgiving because the structure is clearly similar. "Mohamed Ali" versus "Mohamed Aly" — Jaro-Winkler rewards the matching prefix more heavily. These aren't hypothetical differences. They shape which names your system surfaces and which ones it quietly lets through.
The Threshold Problem
Choosing an algorithm is only half the decision. The similarity threshold, the score above which a name is flagged, is just as consequential.
Set the threshold too high, and you generate false positives: your team spends time reviewing names that aren't genuine matches. That's a workload problem, and it's visible. People complain about it.
Set the threshold too low, and you get false negatives: real matches that never make it into a queue. That problem is invisible until it surfaces in an audit or a regulatory inquiry.
This asymmetry creates a structural bias in how systems get tuned. Operational pain is immediate and measurable. Detection gaps are theoretical, right up until they aren't. Organisations optimise for the pain they can see, which means they often underweight the risk they can't.
How Most Systems End Up the Way They Are
In most organisations, the matching algorithm and threshold weren't actively chosen. They were inherited.
A vendor selected a default approach. A system integrator implemented it during deployment. The logic was never revisited because it appeared to work. Over time, it became part of the operational fabric; accepted, unchallenged, and largely invisible to the people responsible for the output it produces.
This isn't negligence. It's the natural result of how enterprise software gets deployed and maintained. But it does mean that the decisions shaping your sanctions exposure were probably made by someone who was thinking about implementation speed, not your specific risk appetite.
What a Better Approach Looks Like
The goal isn't to find the single "best" algorithm and apply it universally. It's to build a matching strategy that reflects how your data actually behaves and what risks matter most to your organisation.
In practice, that usually means:
- Combining approaches rather than relying on a single algorithm. Different methods surface different things, and layering them reduces the chance of systematic blind spots.
- Calibrating thresholds against real data, not vendor defaults. What counts as "similar enough" should be informed by examples from your own screening environment — including known true positives you can test against.
- Segmenting by context. Names from different regions, languages, or scripts behave differently. A single threshold applied globally will be miscalibrated for at least part of your data.
- Reviewing performance over time. Risk profiles change, data quality changes, and regulatory expectations evolve. A configuration that was reasonable two years ago may not be today.
Most importantly: someone in your organisation needs to own this. Not just the system, but the logic behind it. Understanding why your screening produces the results it does, and being able to explain that to an auditor or regulator, is increasingly part of what good compliance governance looks like.
The Bottom Line
Fuzzy matching isn't a technical detail that can safely be delegated to a vendor and forgotten. It's a compliance decision with real regulatory consequences. The algorithm you use, and the threshold you set, determine which names get escalated and which ones disappear, and in sanctions screening, what you miss matters more than what you catch.
The industry has put enormous effort into escalation workflows, regulatory reporting, and audit trails. The matching logic that feeds all of that deserves the same scrutiny.
If you don't know which algorithm your system is using, or why the threshold is set where it is, that's worth finding out.
Aithea GmbH | Seidlstr. 5 | 80335 München | Germany
Registered at District Court Munich HRB 302338
VAT ID DE454846466 | nanoacademy@ai-thea.com
Registered at District Court Munich HRB 302338
VAT ID DE454846466 | nanoacademy@ai-thea.com
Copyright AITHEA © 2025
Do not miss!
Be an early bird
50% OFF - For first time users!
Life time access for the first 50!
Click the button to make this offer yours! Limited-time only!