Fuzzy Wuzzy is a String Matching Python Library

Some of my notes about the Python library FuzzyWuzzy.

Published: | Tags: , and

FuzzyWuzzy uses the Levenshtein distance to help calculate differences between sequences in a simple to use package.

Alex Volkov gave a great talk on it at last nights PythonToronto meeting.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

I'm interested in using it for matching latin accents.

fuzz.ratio('caffè espresso', 'caffe espresso')

fuzz.partial_ratio('caffè espresso', 'caffe espresso')

Also strange companies names during a duplication check.

fuzz.token_sort_ratio('01234567 Ontario Inc - Company Name', 'Company Name (01234567 Ontario Inc)')

fuzz.token_sort_ratio('01234567 Ontario Inc - Company Name', 'Company Name (01234567 Ont Inc)')

fuzz.partial_ratio('ABC Corp.', 'ABC Corporation')

fuzz.partial_ratio('ABC Corp.', 'ABC Inc.')

Generally the above wouldn't matter as you would remove the business entity during the duplicate check. It would be close to below.

fuzz.partial_ratio('Dell Canada', 'Dell')

It would also be interesting in matching Provinces and Territories of Canada.

provinces = ["Ontario", "Quebec", "Nova Scotia", "New Brunswick", "Manitoba",
              "British Columbia", "Prince Edward Island", "Saskatchewan",
              "Alberta", "Newfoundland and Labrador", "Northwest Territories",
              "Yukon", "Nunavut"]

process.extractOne('newfoundland', provinces)
('Newfoundland and Labrador', 90)

process.extractOne('Québec', provinces) # I'm sorry Francophones but it seems `process` doesn't work with accents.
('Quebec', 91)

process.extractOne('On', provinces)
('Ontario', 90)

process.extractOne('PEI', provinces)
('Prince Edward Island', 60)

process.extract('BC', provinces, limit=5)
[('Quebec', 50),
 ('Nova Scotia', 45),
 ('New Brunswick', 45),
 ('Manitoba', 45),
 ('British Columbia', 45)]

It's pretty bad for doing postal abbreviations but that would probably be better as a sperate function.