Finding Likely Duplicate Strings
Mostrar comentarios más antiguos
I have an existing database of contact information for various contacts at specified offices across the country (a "lead" list if you will). This database contains information such as first name, last name, etc. In an effort to refresh the database with current information, I have done some manual research and data logging and have compiled a new, separate data set of current contact information for contacts at the same specified offices.
When updating the existing database with the new data, I've noticed that I'm creating "duplicate" contact records quite a bit. The updating algorithm simply looks for an exact match when it references the contact's name in the new, current data set against the contact's name in the old, existing database. The algorithm thinks "Gregory Smith" is not currently in the database because there isn't an exact match, but upon closer inspection "Gregory" IS already in the database as "Greg Smith".
Instead of manually looking through the database as I update the data and "de-duping" things myself, I was wondering if there was a Matlab function that can compare 2 strings and return how likely it is that they're the same. For example, having the computer flag "Gregory Smith" when the database currently has "Greg Smith" in it. Having the computer do this type of preprocessing would save a lot of time. Any help would be greatly appreciated. Thanks.
1 comentario
Zachary Messaglia
el 7 de Mayo de 2018
Were you able to solve this?
Respuestas (1)
Jan
el 12 de Mzo. de 2014
0 votos
It is a good strategy to search in the FileExchange at first:
Categorías
Más información sobre Database Toolbox en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!