Data Quality and the Importance of Reliable Data
Working with unreliable data, particularly user-entered data, can leave us open to bias, which has a more significant impact than mere poor model accuracy.
What is data quality?
Data quality is the measurement of data in terms of accuracy, completeness, consistency, validity, uniqueness and timeliness. These are known as data quality characteristics and enable businesses and organizations to quantify and better manage their data.
Why is data quality important with machine learning?
Machine learning (ML) is one of the biggest value-adds for businesses, and at the heart of every machine learning challenge is more than just complicated code; it’s the input data. A machine learning model is only as good as the data that goes into it, which is why, in the world of supervised learning, data scientists rely on a set of accurately labelled data points.
Every data scientist’s dream is to work with a trustworthy, quality data source where the target classes are easily distinguishable. For instance, if you are building a classifier to distinguish between photos of cocker spaniels and poodles, then an input dataset of images that have been certified by breeders would be ideal.
If this wasn’t available, you might search the internet for photos of the different breeds, but this could be subject to user entry error and result in some images being mislabelled. There could also be images of certain dogs that don’t quite belong to either breed, like a cockapoo, which could end up getting incorrectly labelled.
Challenges to reliable data
Similar data challenges occurred when building the name/business model, a Quantexa project described in an earlier blog post. The primary source of the business names was a highly regarded – and clean – company registry, however this was supplemented with other data sources which were based on user input. Here are a few of the challenges we encountered:
User entered data
Everyone has, at some point, been led astray by a mobile maps app. Much of the data in these apps will be user-entered, and if the information is enough to fool a human, a machine will certainly struggle.
When determining which country a business is based in, there are some common errors users make: Country codes are often wrong (one Croatian business was labelled as being in Costa Rica as it was given the “CR” country code), and countries on the border are also subject to confusion (Ireland and the UK being particularly common). To overcome this common data problem, writing processing steps could be an effective way to create a more reliable dataset from the outset, checking details such as the city listed in the address.
The ‘Italian restaurant problem’
In many countries, it is common to find organizations named after individuals. In the U.S., many self-employed medical professionals run a business registered under their own name. Often, their medical qualifications are included as a name suffix to make identifying them easier.
This phenomenon is particularly common in Italy, where small businesses – often restaurants or cafés – are named after an individual. These names have no obvious characteristics (such as suffixes for qualifications) to help a model detect that it is a business without first supplying additional information. This type of data confounding has come to be known internally as “the Italian restaurant problem” – although it is not just exclusive to Italy, but in many Italian restaurants and small businesses across the globe.
The ideal solution to a challenge such as this would be to filter these examples out of the training set, but this is difficult to do without using the model we are trying to build in the first place. Alternatives include building a model using data from jurisdictions where the ‘Italian restaurant problem’ is less common and applying this as a first pass.
The ‘Italian restaurant problem’ is a specific example of “class-confounding,” or “class-crossover,” where data points (or feature combinations) can exist in multiple target classes. When building certain models, such as logistic regression, this can lead to the training algorithm failing to converge, and so must be dealt with carefully. In other models, you will at least cause some degree of underfitting.
Redacted datapoints
There are many businesses and people commonly referred to by their initials. Consider the initials “HP” – this could be referring to a large printer brand, a brown sauce, or a student at Hogwarts: two of these entities are businesses, and one is an individual.
Here is another example of class crossover which requires careful consideration – some could exist in either class, even if that has not been observed in the training set. “HP” might only be seen in the business’s dataset, but it should still be considered as a potential class-confounder.
Under-represented classes
Another challenge intricately linked to that of class crossover is the existence of data points that have been systematically mislabelled or missed out of the training data altogether.
The history resulting in the ‘Italian restaurant problem’ stems from a population of enterprising and gastronomic immigrants who set up businesses in their own names. The history of other data will cause similar misclassifications.
For example, within financial crime, there exists an array of possible offenses. Some of these have been effectively detected – providing highly valuable modelling data – yet other offenses, despite being of equal or in some instances greater concern, have yet to be found, or proven.
Modelling a single financial crime class will actively down-weight characteristics indicative of the hidden offense, and only highlight that activity that is already being widely addressed. Therefore, it is important to understand exactly what the target class contains, and to be open to a full breadth of methods including, but not limited to, supervised learning.
Managing unreliable data
Working with unreliable data, particularly user-entered data, can leave us open to bias, which has a more significant impact than mere poor model accuracy. The ‘Italian restaurant problem’ removes an element of control in terms of how the model makes predictions, meaning bias is more likely to creep in – we might end up disproportionately labeling more Italian individuals as restaurants, and vice versa.
In these circumstances, the inclusion of additional context becomes key. Applying extra information to help clear out unreliable data points in the training set and fix problematic points will make the models more reliable, and lead to better quality data.