Entity resolution (ER) is a process used to identify and link records that refer to the same real-world entity in one or more data sources. It is also known as data matching and is used to determine if different records can actually represent the same entity. This task is closely related to entity alignment, which focuses on matching entities between knowledge bases. Entity resolution is an essential step for obtaining a holistic view of the various data sources, allowing analysts to carry out processes effectively. The applications of entity resolution are vast, particularly for federal and public sector data sets related to health, transportation, finance, law enforcement and the fight against terrorism.
Regardless of the industry, most companies work with data that is scattered across multiple systems. Entity resolution helps organizations make better use of this information to enrich their understanding of customers, make the right incorporation decisions, and minimize risk. There are several techniques for resolving entities. A simple example is to compare if a name and a unique identifier (e.g. Social Security Number) match.
However, when such unique identifiers are not available, other attributes may need to be analyzed. What complicates the challenge of resolving the entity is that data entered manually can be entered in different formats and is subject to errors in data entry. Scammers can create multiple false identities or misrepresent certain information to avoid detection. Entity resolution technology is based on probability, so no matter how much data you have or how intelligent your algorithm is, there can be ambiguity when it comes to matching records. Companies use entity resolution to connect different data sources with clean data, detect non-obvious relationships between multiple data silos, and obtain a unified view of data. This post will explore some basic approaches to resolving entities using one of those tools, the Python Dedupe library.
Dedupe is a library that uses machine learning to quickly deduplicate and resolve entities on structured data. By collecting disparate information, entity resolution can effectively thwart this type of fraud by pointing out this type of suspicious behavior.