Data Mining: Working with ‘Wild’ Data

Prior to the IT revolution, data was collected, usually laboriously and carefully, by hand. This didn’t mean that it was entirely error or problem free. But the sheer difficulty of gathering it, combined with the fact that it was usually collected in relatively small quantities, with its end use planned well in advance, meant that there was a fairly small jump between data collection and data analysis. As well, because data was perhaps most frequently collected in a scientific context, it was not unusual for many aspects of the system,process or objects generating the data to already be well understood, with their states carefully controlled and managed during data collection.

These days, outside of the scientific research context, such controlled, ideal conditions are not as typical. As a result, data generated and collected ‘in the wild’, perhaps by sensors, computer applications or web forms, is often more than a little rough around the edges. It may be collected willy nilly, in vast quantities, with the equivalent of rocks, twigs and leaves mixed in with the good data. It may also be stored for years in old, musty, strangely designed databases, with very little in the way of labels or maps to illuminate the data structure.

Because of this, getting data from its starting state, and location, to the point where analysis can be performed, as well as determining what analyses can legitimately be performed, may be substantial tasks in and of themselves.

Polishing the dataset

To appreciate some of the challenges on the data collection front, it’s helpful to consider the end goal first. From an analysis point of view, the ideal dataset would be one where there was:

  • metadata for each data field describing the meaning, expected format and data type, and intended values or range of the data in the field
  • an up-to-date data model describing the relationship between the set of data fields that make up a record, the relationship between the different data tables, as well as a description of the relationship of both of these to the system responsible for generating the data.
  • information provided on the origins and collection strategy for the data
  • information provided that either describes, or at least enables, an assessment of the level of data precision and data accuracy
  • an assessment and summary of the range and type of data in each field, compared with the expected range and type of data
  • identification of missing data and a consistent strategy applied for denoting different types of missing data in the database
  • identification of data collection or data entry errors, which would then be identified and corrected or managed in some systematic fashion
  • storage of the cleaned dataset in a database (likely not the original one) that is readily and directly accessible by the data analysis tools being used to carry out the analysis
  • structuring of the data in the cleaned dataset, both with respect to format and database structure, in a manner that is appropriate for the intended analysis

If a dataset that met all of these criteria were delivered to me, I would be in data analysis heaven! That said, I have enough experience to know that expecting this to be the starting state of the data is not realistic for most analysis projects. Indeed, an important starting point for a project is assessing where the dataset is now, and comparing that with where it needs to get to in order to be analysis ready – and then determining how that will happen.

Assessing and Evaluating the Dataset

On top of this, it’s important to determine the extent to which the data reflects the current or past state of the system of interest and also how this, and the dataset itself, are likely to change over time. To do this we need to understand if the dataset is intended to:

  • act as a sample in order to draw conclusions about a larger population or if, rather, it represents some aspects of the entire population of interest
  • act as a snapshot in time of a particular system or set of objects, which in the future may themselves either remain the same or change over time
  • continue to grow and be added to following the analysis, with the new data being incorporated into the analysis in some way
  • illustrate something in the context of a before-and-after scenario, with plans to change the system, process or objects, and then gather additional data reflecting this new state, which can be usefully compared with the old one

  • All of these possibilities will substantially influence the choice of appropriate data analysis techniques, and also determine which conclusions about the system (its past, present and future states) can usefully be drawn from an analysis of the currently available data.

    Data Analysis Systems: Good Design Matters

    In a recent blog post I presented a gloss of the components that go in to making a dynamic data analysis system. Although the high level picture I presented there is fairly straightforward, in practice the design and functional requirements of each of the parts require a fair amount of attention.

    Here I’ll provide a few quick follow up notes on some of the system design considerations that need to come into play.

    • Data Collection: The data collection components of the system must be designed to collect the right kinds of data, in the right format, at the right level of detail, in a way that ensures high quality data that can be analyzed in useful ways. Also very importantly, the data collection user interface, if there is one, must be carefully designed to allow users to easily provide high quality data.
    • Data Storage: The database must be designed with a solid underlying data model that understands and properly formalizes the structure, relationships and properties of the objects for which data is being collected, in such a way that the desired analysis can be performed. The database must also be designed with sufficient functionality and efficiency to support the analysis operations being carried out on the dataset.
    • Data Restructuring and Analysis: The data analysis component of the system must be designed to take into consideration the accuracy of the data, the way the data is representing the objects behind the data and what analysis results will be useful and informative to the end-consumers of the analysis.
    • Data Presentation and Visualization: The data and analysis presentation interface must be designed to clearly, accurately and effectively display the results of the analysis. From a functional requirements point of view, it must be able to deliver and display up-to-date results of the analysis in a timely fashion, based on the requirements of the end-user.

    From these considerations alone, it should be fairly apparent that designing and implementing a successful dynamic data analysis system will almost always be a group effort, requiring experts and experienced practitioners from several different domains. This can add to the scope of the project, but from my perspective it’s also what makes this work fun and compelling – working together to effectively build something cool and useful.

    Data Analysis Systems: A Gloss

    You’ll often hear people saying that data science is a team effort. And I think this is very true. Many different types of expertise are required to successfully carry out data science projects – there’s the software and software systems part, the data analysis part, the domain expertise part, the interface and visualization components… All of these are required for a successful project, along with the communications glue, both technical and human, that makes them stick successfully together.

    That said, although all of these pieces are going to be incorporated in some way into any data science project, there is certainly an interaction with scale here.

    Some data science projects will be relatively small and static. This doesn’t necessarily make them any less valuable, but it does reduce the technical requirements. They will typically involve a single, already existing, dataset (perhaps contained in an Excel spreadsheet or text file), which can be uploaded into an analytics package and analyzed essentially all at once (although there will no doubt be some back and forth between the analysis and domain specific team members here). The results themselves can then be reported in a relatively comprehensive and exhaustive fashion – for example, in the form of a report describing and visualizing the results of the analysis. To the researchers out there, this should sound suspiciously like writing a journal article.

    Dynamic datasets

    Other data science projects are going to require a bit more support in the way of an underlying data analysis system. In particular, projects that involve what in this blog article I’ll refer to as dynamic datasets – ones where data is being added, updated and changed over time – will benefit from a data analysis system that is less manual, more automated, and a bit more technically sophisticated.

    In this case, in creating such a system, what we’re essentially doing is building a pipeline through which our data can flow and be transformed into useful output along the way. Even with dynamic datasets there can be considerable variability in the scope and technical requirements needed, but it’s fairly safe to say that all of these systems will need to have certain core components present in some form or another.

    So what are the pieces in this pipeline?

    Data Collection: The data analysis pipeline starts with data collection. Data might be collected by computer programs (e.g. ones that keep track of user and other computer behaviours), through sensors set up in the environment (e.g. weather sensors that measure if it is raining and how much rain has fallen), or by being manually entered through user interfaces (e.g. data collected through forms on a website or web app).

    Data Storage: Once collected, all of this data must be stored somewhere, and this is where database software and systems come in to play. The programs, sensors or forms gather the data, and then connect to the database (handwaving aside the technical details here) and add that data into the database. Once in the database, other programs can then come along and work with the data stored there.

    Data Evaluation and Data Analysis: Data cleaning and data evaluation programs might review the stored data to determine its quality and, potentially, correct issues that are detected. Then, analysis software can carry out predefined analyses on the data. As you can see from this description, these components of the data analysis system will often make changes or additions to the database, based on the results of their work on the dataset.

    Data Results and Visualization: Finally, the results of the analysis must be presented as useful output to the consumers of the analysis. Since, in this case, we’re talking about a dynamic dataset, the results themselves will need to be presented in a dynamic fashion, and kept up to date as new data is added to the dataset. For example, the data might be made available via a web application that takes current analysis results either directly from the analysis program or from analysis results stored in the database, and then presents an up-to-date picture of these analysis results every time a web page is loaded.

    To conclude…

    Although at Sysabee we focus on the data analysis piece, we also know that it’s critically important for that piece to ‘play nice’ with all of the other components in the system. A major enabler of this is good system design. This blog article is already getting a little long, so I’ll take that topic up in a follow up post.