Before you can analyze data, you have to collect it. And collecting data is easy, right? Perhaps, at least until you do try to analyze it and you find out your data is actually just a big pile of mush. Without rules, strategy, and discipline you are better off collecting nothing at all. But it doesn’t have to be like this.
It is beyond the scope of a blog post to advise on how to structure your data but I can offer some anecdotes drawn from real-life experiences to at least point you in the right direction and highlight some of the problem you can face.
The single biggest problem in data collection is allowing free-form data entry. The second biggest may well be not allowing free-form data entry, which makes this a complicated problem. Some compromise is required, and a thorough understanding of what you actually plan on doing with the data.
A very simple example comes with the entry of names and addresses. Let’s say I enter some information about John Smith who lives at 32 Anywhere Lane. Then let’s say you enter some information about Jay Smith who lives at 32 Anywhere Ln. Of course, this is the same person. John goes by the nick-name of Jay and I asked him for his legal name (John), you asked him for his name (Jay, as he is called by his friends). The second problem is that I wrote out the word “Lane” and you used a common abbreviation “Ln.”. Same thing, but not to a computer search. As far as the computer is concerned, these are two very different people.
The problem of Lane vs Ln can be solved by a set of rules that translate one to the other – standardizing the data. Or we can force people to select the type of road from a pull-down, although that presents a problem when we run into oddball road descriptions. Pull-downs are usually slower to use than just typing. We have a bigger problem with Jay and John. Perhaps we have a list of common nicknames or we can even decide that these two people, having the same address (which we have now corrected) might be the same and at least flag it for examination. We might be able to get fancy and refer to any number of databases that include names and addresses and find out that John, Cindy, and Tom are the only residents at 32 Anywhere Lane, so chances are even higher that John is Jay, unless Jay is actually John’s brother who has moved in with him. Unless we can pick up the nickname from some other database, the best we can do is flag this as a possible match. But until it is fixed, we now have two people in our data base when there is only really one in real life.
Things get really fun when we look at other information like job titles. I once examined a list of job titles for people who purchased a certain product. There were really only about 10 different job functions at most, and those names were the most common, but after you remove those 10 common titles I was left with over 400 other titles to sort through. And these 400 titles all basically were different variants on the same 10 job functions. These 10 titles, by the way, represented different management levels too so really there were only about 4 real different titles plus a few different management levels. This can present a real problem if you haven’t done your homework up front.
For fields like title (or pretty much any other attribute) you can either let the user enter free form text, or force them to select from a predefined list. Doing the former can present a real problem but will be largely accurate. I say “largely” because here, too, someone can enter “Business Development Manager” once, and then later enter “Biz Dev Mgr”. Perhaps you can create rules to resolve these two, but abbreviations can be even more cryptic and difficult to decipher, especially when an abbreviation can resolve to several different things. And that’s the easy part. What do you do when you get a title that reads “Chief Evangelical Officer” or “Content Scrum Director”? Note: these are both real titles…
These problem all go away when you force the user to select from a list, but it also creates new problems There is a definite loss of resolution with the information because you are forcing the user to select an attribute that is a “best fit” rather than one that is accurate. An “Asia/PAC Field Marketing Manager” might have to settle for “Field Marketing Manager” or even “Marketing Manager”. These lose information and that might be fine if you want to analyze data for all marketing managers, but what happens when you want to parse for people supporting Asia/PAC? This person might be based in CA, not in Asia, and with this loss of granularity in the title you have now lost the ability to correctly sort that entry.
A second problem happens when you need to combine data from two different sources that used different selection lists. What happens when you acquire a new company and you track level as “People Manager” vs. “Individual Contributor” but they track “Manager”, “Director”, and “Vice President”? “People Manager” tells you nothing about what level the person is, and “Manager”, “Director” and even “Vic e President” do not always indicate that the person is a people manager. The combined data is essentially useless for this attribute.
There is no right answer to all of this, but I hope these examples highlight the importance of planning ahead. What data you collect and how you collect it should reflect what you plan on doing with it in the future. Think of all the types of analysis you might run on it at any point, and then design your data collection plan. The time to do this is before you start collecting data, not when you have 10,000 entries and find out that all that data is actually useless for what you want to do.
One final thought: the more data you collect, the safer it is for you later. If, for example, you ask every question you can think of, and capture both free form entries and then also make them pick from lists, you will have the safest data for future unanticipated uses. But you also need to think about your user’s tolerance for entering data. If you are at a trade show and are trying to grab data from someone, they are only going to be willing to spend so much time answering questions before they time out. The alternative is to settle for the data that the trade show itself collects when you scan a badge, but here you will certainly have the problem of their fields only loosely matching your fields. What fun!