Different datasets have different potential for analysis; they are more or less amenable to having particular tools and techniques applied to bring out, or define, their underlying structure. But how can we tell, or at least guess, when we first cast eyes on a dataset, what analysis potential lies within?
Preliminary Dataset Assessment
The richness of a dataset – how much valuable, interesting, juicy information is held within – isn’t necessarily obvious. More data isn’t automatically better. We might have a gigantic dataset, filled with many datapoints, which is nonetheless fairly inert from an analysis point of view. For example, perhaps the dataset consists of all of the names and postal codes of everyone in Canada, plus their shoe sizes. Although we might be able to use this data set for something interesting (e.g. a shoe advertisement mail campaign) from an analysis point of view the dataset isn’t immediately that exciting.
Conversely, we might have a very small dataset that has a lot of analysis potential. Consider, for example, a dataset containing information about Canadian universities, which includes some statistics for each university along with information about a variety of student quality of life measures and life outcomes of students who have attended the university. The analysis of this dataset could be very interesting, even though the number of data points is relatively small because the total number of universities in Canada is itself relatively small.
Some questions we might ask in order to evaluate the overall analysis potential of a dataset:
- How many objects are being considered (i.e. do we have data on a lot of objects? On a few objects but over a lot of time?)
- Are we looking at a population of objects or a sample of that population?
- Are there lot of different types (fields) of information collected about each object?
- What is the granularity of the information?
- Is there relatively nuanced information in each field?
- Is there relatively diverse information in each field?
I should hasten to add that if it isn’t a particularly rich dataset, that doesn’t mean it’s analytically worthless. There are likely still some interesting and valuable basic analytics options that can be applied (e.g. counting the number of dwellings in each area based on postal code, determining shoe size distribution in Canada, which could then be used to inform stocking decisions). But in such a case, applying sophisticated data mining techniques may be overkill.
On the other hand, if we’ve determined that we may be dealing with a dataset that has a lot of analysis potential, how can we go on to get more specific about possible analysis technniques that could be applied to the data?
Let’s consider a number of popular analysis categories in turn:
Time Series Analysis
Time series analysis involves tracking a change in an object (or objects) over time (as measured at particular moments in time). The goal is to try to discover a relationship (connection) between the change in the object and the passage of time. To determine if a dataset is amenable to this type of analysis the number one question is: Does the data track changes in an object property (or an aggregate property of a group of objects) over time? Proceed only if the answer is yes.
For time series analysis, the amount of data available is also very important. Are there enough data points over the range of time, with high enough granularity, to make patterns over time detectable? Are there enough data points to extract the underlying pattern from the noise? 10 data points over 10 years likely won’t cut it, no matter how many objects we have that data for.
Classification and Categorization
Classification and categorization involve the binning and labelling of objects. Arguably, they also involve, indirectly, establishing a relationship between the objects put into each bin.
Questions we might ask to assess the classification potential of data:
- Are there any data fields in the dataset that are categorical in nature? Or that can be made categorical in nature?
- Does the application of these categories require some kind of judgement or discernment, or are they simply obvious labels?
- If we had data about new objects of the same kinds coming in, how useful would it be to quickly and automatically categorize these objects?
- If no categories currently exist in this dataset, could we come up with some interesting categories by combining this dataset with another dataset?
- Are there enough objects in the population to make training possible and auto-classification genuinely feasible?
- Can we instead generate interesting categories simply by doing relatively straightforward calculations on existing fields?
Regression Analysis, Multivariate Analysis
In these types of analysis we are discerning whether or not relationships exist between objects (or objects properties) and, if yes, describing the nature of that relationship by means of mathematical equations.
Questions we might ask to assess the regression or multivariate analysis potential of data:
- Are the data fields largely numeric?
- Are there a fairly large number of data points?
- Is this a sample of data? Are we interested in drawing conclusions about the population as a whole?
- Are we interested in making predictions about some aspect of a type of object based on our knowledge of another aspect of that object or other related objects?
Clustering is a very simple structuring method, where objects are divided or put into groups (sometimes the groups are then further divided or combined) based on measures of similarities between the objects.
Questions we might ask to assess the clustering potential of data:
- Are there only a small number of objects involved? Contrary to previous types of analysis, this suggests that clustering might be a good option, because you can still get interesting clustering results with relatively small datasets.
- Are there lots of fields that are categorical or, more broadly, non-numeric?
- Does it seem like there might be some surprising connections or similarities between objects that aren’t immediately obvious, and could be interesting?
- If we divided the collector of objects into subgroups, could we do something useful with these subgroups?
Asking and answering questions like the ones above can give you a preliminary idea of the analyses it might be worth running on a particular dataset. It can also help you to give the owner of the dataset some initial information about what they might expect in terms of results.
Of course, a great deal of the analysis potential for a dataset also depends on how clean and valid the data is, so evaluating that is an important next step in any data analysis project, but doing a preliminary assessment of the dataset beforehand can often set expectations appropriately and get an analysis project headed in the right direction from the get go.