Data Mining: Working with ‘Wild’ Data

Prior to the IT revolution, data was collected, usually laboriously and carefully, by hand. This didn’t mean that it was entirely error or problem free. But the sheer difficulty of gathering it, combined with the fact that it was usually collected in relatively small quantities, with its end use planned well in advance, meant that there was a fairly small jump between data collection and data analysis. As well, because data was perhaps most frequently collected in a scientific context, it was not unusual for many aspects of the system,process or objects generating the data to already be well understood, with their states carefully controlled and managed during data collection.

These days, outside of the scientific research context, such controlled, ideal conditions are not as typical. As a result, data generated and collected ‘in the wild’, perhaps by sensors, computer applications or web forms, is often more than a little rough around the edges. It may be collected willy nilly, in vast quantities, with the equivalent of rocks, twigs and leaves mixed in with the good data. It may also be stored for years in old, musty, strangely designed databases, with very little in the way of labels or maps to illuminate the data structure.

Because of this, getting data from its starting state, and location, to the point where analysis can be performed, as well as determining what analyses can legitimately be performed, may be substantial tasks in and of themselves.

Polishing the dataset

To appreciate some of the challenges on the data collection front, it’s helpful to consider the end goal first. From an analysis point of view, the ideal dataset would be one where there was:

  • metadata for each data field describing the meaning, expected format and data type, and intended values or range of the data in the field
  • an up-to-date data model describing the relationship between the set of data fields that make up a record, the relationship between the different data tables, as well as a description of the relationship of both of these to the system responsible for generating the data.
  • information provided on the origins and collection strategy for the data
  • information provided that either describes, or at least enables, an assessment of the level of data precision and data accuracy
  • an assessment and summary of the range and type of data in each field, compared with the expected range and type of data
  • identification of missing data and a consistent strategy applied for denoting different types of missing data in the database
  • identification of data collection or data entry errors, which would then be identified and corrected or managed in some systematic fashion
  • storage of the cleaned dataset in a database (likely not the original one) that is readily and directly accessible by the data analysis tools being used to carry out the analysis
  • structuring of the data in the cleaned dataset, both with respect to format and database structure, in a manner that is appropriate for the intended analysis

If a dataset that met all of these criteria were delivered to me, I would be in data analysis heaven! That said, I have enough experience to know that expecting this to be the starting state of the data is not realistic for most analysis projects. Indeed, an important starting point for a project is assessing where the dataset is now, and comparing that with where it needs to get to in order to be analysis ready – and then determining how that will happen.

Assessing and Evaluating the Dataset

On top of this, it’s important to determine the extent to which the data reflects the current or past state of the system of interest and also how this, and the dataset itself, are likely to change over time. To do this we need to understand if the dataset is intended to:

  • act as a sample in order to draw conclusions about a larger population or if, rather, it represents some aspects of the entire population of interest
  • act as a snapshot in time of a particular system or set of objects, which in the future may themselves either remain the same or change over time
  • continue to grow and be added to following the analysis, with the new data being incorporated into the analysis in some way
  • illustrate something in the context of a before-and-after scenario, with plans to change the system, process or objects, and then gather additional data reflecting this new state, which can be usefully compared with the old one

  • All of these possibilities will substantially influence the choice of appropriate data analysis techniques, and also determine which conclusions about the system (its past, present and future states) can usefully be drawn from an analysis of the currently available data.

    Leave a Reply

    Your email address will not be published. Required fields are marked *