Why Your Data Integration Graph Needs Entity Resolution
Entities are the key foundation for building the right data integration graphs. However, the way those entities are resolved is also critical.
What are the benefits of a graph data model?
Graph has historically been a niche data management technology. Typically, graph databases were used for single applications with a data model well suited for graph. In recent years they have been used for small-scale analytics with a limited set of traditional graph analytics, such as centrality metrics. However, graph technologies are increasingly seeing enterprise scale take-up.
Things are changing. Gartner has a strategic planning assumption that by 2025, graph technologies will be used in fully 80% of data and analytics innovations – that’s up from 10% this year. One of the major drivers is that the flexibility of graph data models is a natural way to interact with connected data to enable insights and decisions. However, to get real value from graphs, organizations need to solve the challenge that data is siloed across many applications.
Organizations have traditionally tried to do this by forcing data into enterprise data warehouse schema or structured data lake layers, but there are challenges as these:
Take a long time to design
Are hard to adapt in an agile way as data from new parts of the organization needs to be incorporated.
The right graph data models can enable key data to be onboarded, connected and used, while maintaining clear lineage back to source data.
The difference between context and graph data models
Graph data models are also ideal for representing the complex relationships that are present in the real world.
Many questions aren’t just about who your direct customer is and how they’re interacting with you. It should also be about their relationships, what they share and how they interact with others – and then their relationships, as far out as is useful. Business customers are a great example of this – they often have complex, deep hierarchies, informal relationships through shared directors, supply chains and so on.
But, particularly when integrating data from disparate sources, there is a big difference between simply having data in graph form, and having useful context for analytics, insight and decisions.
Context starts with entities
Firstly, what is an entity in a database?
Context starts with building meaningful graphs that join data up via entities. Entities are representations of people, companies, addresses and many other types of thing in the real world.
However, when you try to load a typical enterprise’s data into a graph database, it won’t join up, because enterprises generally operate as silos. Source data in each silo will have been captured separately by different teams, in different systems. Each will use different identifiers for referencing entities, so information about each real-world entity will be split into islands.
Sometimes, earlier attempts will have been made to unify records, or data will come from systems that have been tightly integrated. In that case, you can use the explicit cross-systems IDs that are there to create links between records as they are loaded. But these are rarely complete – they might only apply to a handful of systems within a given area of the enterprise.
A bank might have data relating to two customers coming from various CRM and product-specific systems across two divisions, for example a small business bank and a retail bank. Linking records up based on the explicit identifiers, for a set of companies and individuals that are closely related in the real world might look something like this:
In this example, business accounts are siloed. There is some linkage via transactions between two customers, and another completely isolated customer. But overall, the graph is sparsely linked and lacks context for analysis, data science, or operational decision-making. They need something more complete and more consistent that more closely represents the real world.
Finding the real-world entities in your data
To get a more complete view, it’s important to look beyond the explicit IDs at the additional data within the records. Someone eyeballing the data might guess that Antoinette Banasiewicz, born 12/03/80 might be the same as Nettie Banasiewicz, 48 Second Avenue – and when they find a third customer record for Antoinette Banasiewicz, Appt. 2, 48 Second Av., they will be confident to link all three. Conversely, a member of staff might also know not to link John Smith, born 1/1/1960 with another record relating John Smith, born 1/1/60 – because that DOB is a default value in the customer system, and because John Smith is a common name in the UK.
When trying to reconcile data in this way, they must also deal with inconsistencies in the way that data is structured, represented and captured. One data source might contain a handful of key attributes, another an overlapping but different set. An address might be represented one way on one system, another way on another – and the actual data values will often disagree. At that point it becomes necessary to specify rules for which values you’re going to trust – system A over system B, most frequently seen value, most recent record, sum of all the records or other variations.
The value of Entity Resolution
All this is necessary because traditional technologies like Master Data Management (MDM) generally haven’t managed to deliver a complete and current view of key entities like ‘customer’ that is referenced from or reliably distributed to every system.
Luckily, there is a category of product that does, and it’s called Entity Resolution. It parses, cleans and normalizes data and uses sophisticated Machine Learning and AI models to infer all the different ways of reliably identifying an entity. It clusters together records relating to each entity; compiles a set of attributes for each entity; and finally creates a set of labelled links between entities and source records. It’s dramatically more effective than the traditional record-to-record matching typically used by MDM systems.
Rather than trying to link all the source records directly to each other, you can add new entity nodes (shown below in blue), which act as a nexus for linking real-world data together. High quality Entity Resolution means you can link not only your own data, but also high value external data such as corporate registry information that in the past would have been difficult to match reliably. The result is:
Entities created using effective Entity Resolution provide a much richer context, with many additional links. Retaining links to source data allows provenance to be understood and ensures important data isn’t lost. At Quantexa we use hierarchical documents to accurately represent related source data and refer to this approach as a document-entity model. The “full graph” linking all documents and entities is an entity graph.
However, it can be too detailed for easy consumption by a user or systems. Transforming (or “projecting”) the graph by grouping together nodes and trimming edges according to rules means that simpler views can be formed. At Quantexa we call this a perspective. Users switch interactively between different perspectives focusing on different types of node and relationship, or to the document entity view for the richest picture.
A good example of a perspective is an ‘entity-to-entity view’:
This sacrifices detail, but allows us to easily see the wider context:
Antoinette, a retail customer, has been a guarantor for a loan to a business customer she is a director of, LogicSpace Ltd.
Antoinette lives with and sends/receives money from another retail bank customer Jamie. He is a director of AJ Coworking, the parent company of LogicSpace.
Understanding this context required resolving sometimes duplicated data across retail banking, small business banking and external corporate registry sources.
Why use graph for integrated Entity Resolution?
Entities are the key foundation for building the right data integration graphs. However, the way those entities are resolved is also critical – this process needs to be accurate even in the face of data quality issues, scalable enough to handle enterprise-wide data, secure, and easy to transition into production for multiple use cases.