Learning to Eyeball Data

Different datasets have different potential for analysis; they are more or less amenable to having particular tools and techniques applied to bring out, or define, their underlying structure. But how can we tell, or at least guess, when we first cast eyes on a dataset, what analysis potential lies within?

Preliminary Dataset Assessment

The richness of a dataset – how much valuable, interesting, juicy information is held within – isn’t necessarily obvious. More data isn’t automatically better. We might have a gigantic dataset, filled with many datapoints, which is nonetheless fairly inert from an analysis point of view. For example, perhaps the dataset consists of all of the names and postal codes of everyone in Canada, plus their shoe sizes. Although we might be able to use this data set for something interesting (e.g. a shoe advertisement mail campaign) from an analysis point of view the dataset isn’t immediately that exciting.

Conversely, we might have a very small dataset that has a lot of analysis potential. Consider, for example, a dataset containing information about Canadian universities, which includes some statistics for each university along with information about a variety of student quality of life measures and life outcomes of students who have attended the university. The analysis of this dataset could be very interesting, even though the number of data points is relatively small because the total number of universities in Canada is itself relatively small.

Some questions we might ask in order to evaluate the overall analysis potential of a dataset:

  • How many objects are being considered (i.e. do we have data on a lot of objects? On a few objects but over a lot of time?)
  • Are we looking at a population of objects or a sample of that population?
  • Are there lot of different types (fields) of information collected about each object?
  • What is the granularity of the information?
  • Is there relatively nuanced information in each field?
  • Is there relatively diverse information in each field?

I should hasten to add that if it isn’t a particularly rich dataset, that doesn’t mean it’s analytically worthless. There are likely still some interesting and valuable basic analytics options that can be applied (e.g. counting the number of dwellings in each area based on postal code, determining shoe size distribution in Canada, which could then be used to inform stocking decisions). But in such a case, applying sophisticated data mining techniques may be overkill.

On the other hand, if we’ve determined that we may be dealing with a dataset that has a lot of analysis potential, how can we go on to get more specific about possible analysis technniques that could be applied to the data?

Let’s consider a number of popular analysis categories in turn:

Time Series Analysis

Time series analysis involves tracking a change in an object (or objects) over time (as measured at particular moments in time). The goal is to try to discover a relationship (connection) between the change in the object and the passage of time. To determine if a dataset is amenable to this type of analysis the number one question is: Does the data track changes in an object property (or an aggregate property of a group of objects) over time? Proceed only if the answer is yes.

For time series analysis, the amount of data available is also very important. Are there enough data points over the range of time, with high enough granularity, to make patterns over time detectable? Are there enough data points to extract the underlying pattern from the noise? 10 data points over 10 years likely won’t cut it, no matter how many objects we have that data for.

Classification and Categorization

Classification and categorization involve the binning and labelling of objects. Arguably, they also involve, indirectly, establishing a relationship between the objects put into each bin.

Questions we might ask to assess the classification potential of data:

  • Are there any data fields in the dataset that are categorical in nature? Or that can be made categorical in nature?
  • Does the application of these categories require some kind of judgement or discernment, or are they simply obvious labels?
  • If we had data about new objects of the same kinds coming in, how useful would it be to quickly and automatically categorize these objects?
  • If no categories currently exist in this dataset, could we come up with some interesting categories by combining this dataset with another dataset?
  • Are there enough objects in the population to make training possible and auto-classification genuinely feasible?
  • Can we instead generate interesting categories simply by doing relatively straightforward calculations on existing fields?

Regression Analysis, Multivariate Analysis

In these types of analysis we are discerning whether or not relationships exist between objects (or objects properties) and, if yes, describing the nature of that relationship by means of mathematical equations.

Questions we might ask to assess the regression or multivariate analysis potential of data:

  • Are the data fields largely numeric?
  • Are there a fairly large number of data points?
  • Is this a sample of data? Are we interested in drawing conclusions about the population as a whole?
  • Are we interested in making predictions about some aspect of a type of object based on our knowledge of another aspect of that object or other related objects?

Clustering

Clustering is a very simple structuring method, where objects are divided or put into groups (sometimes the groups are then further divided or combined) based on measures of similarities between the objects.

Questions we might ask to assess the clustering potential of data:

  • Are there only a small number of objects involved? Contrary to previous types of analysis, this suggests that clustering might be a good option, because you can still get interesting clustering results with relatively small datasets.
  • Are there lots of fields that are categorical or, more broadly, non-numeric?
  • Does it seem like there might be some surprising connections or similarities between objects that aren’t immediately obvious, and could be interesting?
  • If we divided the collector of objects into subgroups, could we do something useful with these subgroups?

To Conclude…

Asking and answering questions like the ones above can give you a preliminary idea of the analyses it might be worth running on a particular dataset. It can also help you to give the owner of the dataset some initial information about what they might expect in terms of results.

Of course, a great deal of the analysis potential for a dataset also depends on how clean and valid the data is, so evaluating that is an important next step in any data analysis project, but doing a preliminary assessment of the dataset beforehand can often set expectations appropriately and get an analysis project headed in the right direction from the get go.

Some Basic Data Science Questions

In this blog article I take a quick shot at answering three basic data science questions:

  • What is data analysis?
  • What is modeling (i.e. computer modeling, mathematical modeling, simulations)?
  • Where does data come from?

What is data analysis?

When we do data analysis, we perform mathematical and logical operations on a collection of data (generally called a dataset).

The results of these operations allow us to draw conclusions about the objects, systems or processes that are generating the data (what the data is about).

Analysis can also allow us to structure objects in useful ways. For example, it might allow us to:

  • group similar objects together
  • classify objects into particular categories
  • make predictions about the current and future behaviors or properties of these objects or similar objects.

This might in turn produce useful new data objects or structures:

  • decision trees that can then be used to make decisions
  • networks that can be used to trace connections and understand links between objects
  • logical or mathematical statements that describe something interesting about the objects and their relationship to other objects
  • systems models that can make predictions about the current or future behavior of objects and systems
  • new categories, classifications or groupings of objects.

What is modeling?

Modelling is the act of creating models. But what is a model? And what is it good for?

There are a number of different definitions of a model, but I favor Grier’s approach: a model is a structure (physical or virtual) with useful similarities to something else that is of interest. This real world ‘something’ might be an object (or type of object), a system or a process. Modellers usually refer to this as the target (or target system). It’s the part of the world that we want to learn about and understand better.

Models are created using information and data about the target system. This information determines the structure of the model. The modeler must also decide how to properly relate the model to the target.

Once a model is created we can use it to predict or learn about the behavior of the target system. We can also use models to ask ‘what if’ questions- i.e. “If I were to change this aspect of my target system, how would its overall behavior likely change as a result?”

Where does data come from?

Both data analysis and modeling need data to work with. But where does data come from?

Data consists of observations or information about objects, system or processes of interest. Data may be collected by hand, automatically by computer, or some combination of the two (e.g. entered by hand into computer based forms).

Once collected, it may reside in files (e.g. comma or tab delimited text files), spreadsheets (e.g. Excel) or databases (e.g. MySQL, Microsoft Access, Oracle, SQL Server).

Data may also exist in the form of documents and websites. This is referred to as unstructured data (or in the case of websites, semi-structured data). Unstructured data can be analyzed to give it more structure, at which point it may become semi-structured or structured data. The resulting data can then be stored in a database for further analysis, or used to create a model of the target system.

Welcome to the Sysabee Blog

As principal of Sysabee, I’d like to welcome you to the Sysabee data science blog.

In addition to working on data science and systems modeling projects, it’s nice to step back and write down a few thoughts on the techniques, strategies and practices that go in to this work: both the tried and true and also the new and emerging ideas, tools and trends.

I hope you enjoy the articles here, and find them useful. If you have a particular topic you’d like to see discussed, don’t hesitate to get in touch- I’m always interested in learning about what other people are interested in.