Data Mining: Working with ‘Wild’ Data

Prior to the IT revolution, data was collected, usually laboriously and carefully, by hand. This didn’t mean that it was entirely error or problem free. But the sheer difficulty of gathering it, combined with the fact that it was usually collected in relatively small quantities, with its end use planned well in advance, meant that there was a fairly small jump between data collection and data analysis. As well, because data was perhaps most frequently collected in a scientific context, it was not unusual for many aspects of the system,process or objects generating the data to already be well understood, with their states carefully controlled and managed during data collection.

These days, outside of the scientific research context, such controlled, ideal conditions are not as typical. As a result, data generated and collected ‘in the wild’, perhaps by sensors, computer applications or web forms, is often more than a little rough around the edges. It may be collected willy nilly, in vast quantities, with the equivalent of rocks, twigs and leaves mixed in with the good data. It may also be stored for years in old, musty, strangely designed databases, with very little in the way of labels or maps to illuminate the data structure.

Because of this, getting data from its starting state, and location, to the point where analysis can be performed, as well as determining what analyses can legitimately be performed, may be substantial tasks in and of themselves.

Polishing the dataset

To appreciate some of the challenges on the data collection front, it’s helpful to consider the end goal first. From an analysis point of view, the ideal dataset would be one where there was:

  • metadata for each data field describing the meaning, expected format and data type, and intended values or range of the data in the field
  • an up-to-date data model describing the relationship between the set of data fields that make up a record, the relationship between the different data tables, as well as a description of the relationship of both of these to the system responsible for generating the data.
  • information provided on the origins and collection strategy for the data
  • information provided that either describes, or at least enables, an assessment of the level of data precision and data accuracy
  • an assessment and summary of the range and type of data in each field, compared with the expected range and type of data
  • identification of missing data and a consistent strategy applied for denoting different types of missing data in the database
  • identification of data collection or data entry errors, which would then be identified and corrected or managed in some systematic fashion
  • storage of the cleaned dataset in a database (likely not the original one) that is readily and directly accessible by the data analysis tools being used to carry out the analysis
  • structuring of the data in the cleaned dataset, both with respect to format and database structure, in a manner that is appropriate for the intended analysis

If a dataset that met all of these criteria were delivered to me, I would be in data analysis heaven! That said, I have enough experience to know that expecting this to be the starting state of the data is not realistic for most analysis projects. Indeed, an important starting point for a project is assessing where the dataset is now, and comparing that with where it needs to get to in order to be analysis ready – and then determining how that will happen.

Assessing and Evaluating the Dataset

On top of this, it’s important to determine the extent to which the data reflects the current or past state of the system of interest and also how this, and the dataset itself, are likely to change over time. To do this we need to understand if the dataset is intended to:

  • act as a sample in order to draw conclusions about a larger population or if, rather, it represents some aspects of the entire population of interest
  • act as a snapshot in time of a particular system or set of objects, which in the future may themselves either remain the same or change over time
  • continue to grow and be added to following the analysis, with the new data being incorporated into the analysis in some way
  • illustrate something in the context of a before-and-after scenario, with plans to change the system, process or objects, and then gather additional data reflecting this new state, which can be usefully compared with the old one

  • All of these possibilities will substantially influence the choice of appropriate data analysis techniques, and also determine which conclusions about the system (its past, present and future states) can usefully be drawn from an analysis of the currently available data.

    Data Analysis Systems: Good Design Matters

    In a recent blog post I presented a gloss of the components that go in to making a dynamic data analysis system. Although the high level picture I presented there is fairly straightforward, in practice the design and functional requirements of each of the parts require a fair amount of attention.

    Here I’ll provide a few quick follow up notes on some of the system design considerations that need to come into play.

    • Data Collection: The data collection components of the system must be designed to collect the right kinds of data, in the right format, at the right level of detail, in a way that ensures high quality data that can be analyzed in useful ways. Also very importantly, the data collection user interface, if there is one, must be carefully designed to allow users to easily provide high quality data.
    • Data Storage: The database must be designed with a solid underlying data model that understands and properly formalizes the structure, relationships and properties of the objects for which data is being collected, in such a way that the desired analysis can be performed. The database must also be designed with sufficient functionality and efficiency to support the analysis operations being carried out on the dataset.
    • Data Restructuring and Analysis: The data analysis component of the system must be designed to take into consideration the accuracy of the data, the way the data is representing the objects behind the data and what analysis results will be useful and informative to the end-consumers of the analysis.
    • Data Presentation and Visualization: The data and analysis presentation interface must be designed to clearly, accurately and effectively display the results of the analysis. From a functional requirements point of view, it must be able to deliver and display up-to-date results of the analysis in a timely fashion, based on the requirements of the end-user.

    From these considerations alone, it should be fairly apparent that designing and implementing a successful dynamic data analysis system will almost always be a group effort, requiring experts and experienced practitioners from several different domains. This can add to the scope of the project, but from my perspective it’s also what makes this work fun and compelling – working together to effectively build something cool and useful.

    Data Analysis Systems: A Gloss

    You’ll often hear people saying that data science is a team effort. And I think this is very true. Many different types of expertise are required to successfully carry out data science projects – there’s the software and software systems part, the data analysis part, the domain expertise part, the interface and visualization components… All of these are required for a successful project, along with the communications glue, both technical and human, that makes them stick successfully together.

    That said, although all of these pieces are going to be incorporated in some way into any data science project, there is certainly an interaction with scale here.

    Some data science projects will be relatively small and static. This doesn’t necessarily make them any less valuable, but it does reduce the technical requirements. They will typically involve a single, already existing, dataset (perhaps contained in an Excel spreadsheet or text file), which can be uploaded into an analytics package and analyzed essentially all at once (although there will no doubt be some back and forth between the analysis and domain specific team members here). The results themselves can then be reported in a relatively comprehensive and exhaustive fashion – for example, in the form of a report describing and visualizing the results of the analysis. To the researchers out there, this should sound suspiciously like writing a journal article.

    Dynamic datasets

    Other data science projects are going to require a bit more support in the way of an underlying data analysis system. In particular, projects that involve what in this blog article I’ll refer to as dynamic datasets – ones where data is being added, updated and changed over time – will benefit from a data analysis system that is less manual, more automated, and a bit more technically sophisticated.

    In this case, in creating such a system, what we’re essentially doing is building a pipeline through which our data can flow and be transformed into useful output along the way. Even with dynamic datasets there can be considerable variability in the scope and technical requirements needed, but it’s fairly safe to say that all of these systems will need to have certain core components present in some form or another.

    So what are the pieces in this pipeline?

    Data Collection: The data analysis pipeline starts with data collection. Data might be collected by computer programs (e.g. ones that keep track of user and other computer behaviours), through sensors set up in the environment (e.g. weather sensors that measure if it is raining and how much rain has fallen), or by being manually entered through user interfaces (e.g. data collected through forms on a website or web app).

    Data Storage: Once collected, all of this data must be stored somewhere, and this is where database software and systems come in to play. The programs, sensors or forms gather the data, and then connect to the database (handwaving aside the technical details here) and add that data into the database. Once in the database, other programs can then come along and work with the data stored there.

    Data Evaluation and Data Analysis: Data cleaning and data evaluation programs might review the stored data to determine its quality and, potentially, correct issues that are detected. Then, analysis software can carry out predefined analyses on the data. As you can see from this description, these components of the data analysis system will often make changes or additions to the database, based on the results of their work on the dataset.

    Data Results and Visualization: Finally, the results of the analysis must be presented as useful output to the consumers of the analysis. Since, in this case, we’re talking about a dynamic dataset, the results themselves will need to be presented in a dynamic fashion, and kept up to date as new data is added to the dataset. For example, the data might be made available via a web application that takes current analysis results either directly from the analysis program or from analysis results stored in the database, and then presents an up-to-date picture of these analysis results every time a web page is loaded.

    To conclude…

    Although at Sysabee we focus on the data analysis piece, we also know that it’s critically important for that piece to ‘play nice’ with all of the other components in the system. A major enabler of this is good system design. This blog article is already getting a little long, so I’ll take that topic up in a follow up post.

    Learning to Eyeball Data

    Different datasets have different potential for analysis; they are more or less amenable to having particular tools and techniques applied to bring out, or define, their underlying structure. But how can we tell, or at least guess, when we first cast eyes on a dataset, what analysis potential lies within?

    Preliminary Dataset Assessment

    The richness of a dataset – how much valuable, interesting, juicy information is held within – isn’t necessarily obvious. More data isn’t automatically better. We might have a gigantic dataset, filled with many datapoints, which is nonetheless fairly inert from an analysis point of view. For example, perhaps the dataset consists of all of the names and postal codes of everyone in Canada, plus their shoe sizes. Although we might be able to use this data set for something interesting (e.g. a shoe advertisement mail campaign) from an analysis point of view the dataset isn’t immediately that exciting.

    Conversely, we might have a very small dataset that has a lot of analysis potential. Consider, for example, a dataset containing information about Canadian universities, which includes some statistics for each university along with information about a variety of student quality of life measures and life outcomes of students who have attended the university. The analysis of this dataset could be very interesting, even though the number of data points is relatively small because the total number of universities in Canada is itself relatively small.

    Some questions we might ask in order to evaluate the overall analysis potential of a dataset:

    • How many objects are being considered (i.e. do we have data on a lot of objects? On a few objects but over a lot of time?)
    • Are we looking at a population of objects or a sample of that population?
    • Are there lot of different types (fields) of information collected about each object?
    • What is the granularity of the information?
    • Is there relatively nuanced information in each field?
    • Is there relatively diverse information in each field?

    I should hasten to add that if it isn’t a particularly rich dataset, that doesn’t mean it’s analytically worthless. There are likely still some interesting and valuable basic analytics options that can be applied (e.g. counting the number of dwellings in each area based on postal code, determining shoe size distribution in Canada, which could then be used to inform stocking decisions). But in such a case, applying sophisticated data mining techniques may be overkill.

    On the other hand, if we’ve determined that we may be dealing with a dataset that has a lot of analysis potential, how can we go on to get more specific about possible analysis technniques that could be applied to the data?

    Let’s consider a number of popular analysis categories in turn:

    Time Series Analysis

    Time series analysis involves tracking a change in an object (or objects) over time (as measured at particular moments in time). The goal is to try to discover a relationship (connection) between the change in the object and the passage of time. To determine if a dataset is amenable to this type of analysis the number one question is: Does the data track changes in an object property (or an aggregate property of a group of objects) over time? Proceed only if the answer is yes.

    For time series analysis, the amount of data available is also very important. Are there enough data points over the range of time, with high enough granularity, to make patterns over time detectable? Are there enough data points to extract the underlying pattern from the noise? 10 data points over 10 years likely won’t cut it, no matter how many objects we have that data for.

    Classification and Categorization

    Classification and categorization involve the binning and labelling of objects. Arguably, they also involve, indirectly, establishing a relationship between the objects put into each bin.

    Questions we might ask to assess the classification potential of data:

    • Are there any data fields in the dataset that are categorical in nature? Or that can be made categorical in nature?
    • Does the application of these categories require some kind of judgement or discernment, or are they simply obvious labels?
    • If we had data about new objects of the same kinds coming in, how useful would it be to quickly and automatically categorize these objects?
    • If no categories currently exist in this dataset, could we come up with some interesting categories by combining this dataset with another dataset?
    • Are there enough objects in the population to make training possible and auto-classification genuinely feasible?
    • Can we instead generate interesting categories simply by doing relatively straightforward calculations on existing fields?

    Regression Analysis, Multivariate Analysis

    In these types of analysis we are discerning whether or not relationships exist between objects (or objects properties) and, if yes, describing the nature of that relationship by means of mathematical equations.

    Questions we might ask to assess the regression or multivariate analysis potential of data:

    • Are the data fields largely numeric?
    • Are there a fairly large number of data points?
    • Is this a sample of data? Are we interested in drawing conclusions about the population as a whole?
    • Are we interested in making predictions about some aspect of a type of object based on our knowledge of another aspect of that object or other related objects?


    Clustering is a very simple structuring method, where objects are divided or put into groups (sometimes the groups are then further divided or combined) based on measures of similarities between the objects.

    Questions we might ask to assess the clustering potential of data:

    • Are there only a small number of objects involved? Contrary to previous types of analysis, this suggests that clustering might be a good option, because you can still get interesting clustering results with relatively small datasets.
    • Are there lots of fields that are categorical or, more broadly, non-numeric?
    • Does it seem like there might be some surprising connections or similarities between objects that aren’t immediately obvious, and could be interesting?
    • If we divided the collector of objects into subgroups, could we do something useful with these subgroups?

    To Conclude…

    Asking and answering questions like the ones above can give you a preliminary idea of the analyses it might be worth running on a particular dataset. It can also help you to give the owner of the dataset some initial information about what they might expect in terms of results.

    Of course, a great deal of the analysis potential for a dataset also depends on how clean and valid the data is, so evaluating that is an important next step in any data analysis project, but doing a preliminary assessment of the dataset beforehand can often set expectations appropriately and get an analysis project headed in the right direction from the get go.

    Some Basic Data Science Questions

    In this blog article I take a quick shot at answering three basic data science questions:

    • What is data analysis?
    • What is modeling (i.e. computer modeling, mathematical modeling, simulations)?
    • Where does data come from?

    What is data analysis?

    When we do data analysis, we perform mathematical and logical operations on a collection of data (generally called a dataset).

    The results of these operations allow us to draw conclusions about the objects, systems or processes that are generating the data (what the data is about).

    Analysis can also allow us to structure objects in useful ways. For example, it might allow us to:

    • group similar objects together
    • classify objects into particular categories
    • make predictions about the current and future behaviors or properties of these objects or similar objects.

    This might in turn produce useful new data objects or structures:

    • decision trees that can then be used to make decisions
    • networks that can be used to trace connections and understand links between objects
    • logical or mathematical statements that describe something interesting about the objects and their relationship to other objects
    • systems models that can make predictions about the current or future behavior of objects and systems
    • new categories, classifications or groupings of objects.

    What is modeling?

    Modelling is the act of creating models. But what is a model? And what is it good for?

    There are a number of different definitions of a model, but I favor Grier’s approach: a model is a structure (physical or virtual) with useful similarities to something else that is of interest. This real world ‘something’ might be an object (or type of object), a system or a process. Modellers usually refer to this as the target (or target system). It’s the part of the world that we want to learn about and understand better.

    Models are created using information and data about the target system. This information determines the structure of the model. The modeler must also decide how to properly relate the model to the target.

    Once a model is created we can use it to predict or learn about the behavior of the target system. We can also use models to ask ‘what if’ questions- i.e. “If I were to change this aspect of my target system, how would its overall behavior likely change as a result?”

    Where does data come from?

    Both data analysis and modeling need data to work with. But where does data come from?

    Data consists of observations or information about objects, system or processes of interest. Data may be collected by hand, automatically by computer, or some combination of the two (e.g. entered by hand into computer based forms).

    Once collected, it may reside in files (e.g. comma or tab delimited text files), spreadsheets (e.g. Excel) or databases (e.g. MySQL, Microsoft Access, Oracle, SQL Server).

    Data may also exist in the form of documents and websites. This is referred to as unstructured data (or in the case of websites, semi-structured data). Unstructured data can be analyzed to give it more structure, at which point it may become semi-structured or structured data. The resulting data can then be stored in a database for further analysis, or used to create a model of the target system.

    Welcome to the Sysabee Blog

    As principal of Sysabee, I’d like to welcome you to the Sysabee data science blog.

    In addition to working on data science and systems modeling projects, it’s nice to step back and write down a few thoughts on the techniques, strategies and practices that go in to this work: both the tried and true and also the new and emerging ideas, tools and trends.

    I hope you enjoy the articles here, and find them useful. If you have a particular topic you’d like to see discussed, don’t hesitate to get in touch- I’m always interested in learning about what other people are interested in.