Pretty much all companies are coming to the realization of the value of their data (thanks @infonomics - the book which taught me this was coming), so what's next?
A lot of companies hit the quality wall. This is where there's so much unregistered or not-fully controlled data, the quality becomes questionable purely because one can't identify the original source. As privacy regulations are honing in on traceability and purpose (after beginning with more content based restrictions), companies need to boost up their quality audit routine. This has to start with good people! Many companies like to jump to technology or reporting solutions, but the companies that make the fastest progress, have great team members first!
When searching for someone with Data Quality principles keep in mind these qualities:
(1) Attention to detail
(2) Zoomability (zoom between microscopic and big picture thinking)
(3) Interested by data
Then look for technical prowess, logical, and creative, communicative thinkers!
After you've got the right person for the job, begin by listing your external data deliverables based on the highest value "product".
Side note: this could be services or real estate or even people - this person should be asking what does my company value (or put their money into consistently) and why?
Once you have a list of "ordered valuable products" (see my other blog about how everything starts with a list), then it's best to figure out how the data is leaving the building.
Data leaving the building could be as basic as email text and attachments, or as complicated as "double blind black box as a service" type of delivery systems. This information helps scope how big of a data quality team is necessary.
The first place to put a data quality check is right there as it leaves the virtual "door". In smaller organizations, inserting this check could mean going to your "data person" and saying you would like to add a go/no go step before it leaves the door, subject to these specific rules.
Don't know what rules those should be? Skip ahead one paragraph! For those of you in mid-size organizations, inserting a data quality check may mean integrating yourself into the IT, Data (Analyst, Scientists) or Engineering functions within your organization. For large organizations, there's likely product specific data quality teams - if you're not already the product owner/manager - go find them!
So what rules should one put in a data quality check? Of course, it's dependent on what's in your data, but how do you even start to think about it? The first question to ask oneself is how do I know it's right? Secret baked-in assumptions typically mean if you pull too hard or too long on a thread, soon the quality sweater is all gone! It's why I came up with BreaktheSystem- I pull and adjust until the thread becomes a diamond- corset. Custom-fitted and nearly impenetrable!
So you know the data is right, huh? How? Did you just ask whoever you found earlier? Is it because it comes directly from the source, excellent! How do they collect their data and how do they check? It's likely whatever answer you got came from someone who didn't fully know, so ask for proof. What proof, you ask? Well now you're in the circular expression of some people's hell, also known as data quality.
So focus on:
(1) Change in important fields - is that expected?
(2) Where do the inputs come from? How often do they change, does that reconcile with 1?
(3) Are there any empty fields - is that expected?
(4) Are there any columns that are all 0 or NA, if so, do you need them in there, and why - if for a customer, are they paying extra to keep that? Can this be a win/win where we stop storing and cut costs? The customer may replicate on their end or really question whether they need that zeroed column and why - this may garner goodwill with your customer or not, so beware!
(5) Is any data slowing your process down? If so, that's a great place to either start looking to limit data or maybe install some machine learning or known AI algorithms to see if it can speed up processing.
(6) A generic (or zoomed out principle) is the common question, what tool should one use or what is best? If you're truly 100% in control of choosing the tool, then anything I write here will be out of date as soon as I post this. So, realistically, the answer is the easiest tool that answers your questions and integrates into the delivery process in a repeatable way.
That's my mini #tedtalk on data quality. Maybe I'll dive into a Privacy Impact Assessment next!
Comments