In this blog article I take a quick shot at answering three basic data science questions:
- What is data analysis?
- What is modeling (i.e. computer modeling, mathematical modeling, simulations)?
- Where does data come from?
What is data analysis?
When we do data analysis, we perform mathematical and logical operations on a collection of data (generally called a dataset).
The results of these operations allow us to draw conclusions about the objects, systems or processes that are generating the data (what the data is about).
Analysis can also allow us to structure objects in useful ways. For example, it might allow us to:
- group similar objects together
- classify objects into particular categories
- make predictions about the current and future behaviors or properties of these objects or similar objects.
This might in turn produce useful new data objects or structures:
- decision trees that can then be used to make decisions
- networks that can be used to trace connections and understand links between objects
- logical or mathematical statements that describe something interesting about the objects and their relationship to other objects
- systems models that can make predictions about the current or future behavior of objects and systems
- new categories, classifications or groupings of objects.
What is modeling?
Modelling is the act of creating models. But what is a model? And what is it good for?
There are a number of different definitions of a model, but I favor Grier’s approach: a model is a structure (physical or virtual) with useful similarities to something else that is of interest. This real world ‘something’ might be an object (or type of object), a system or a process. Modellers usually refer to this as the target (or target system). It’s the part of the world that we want to learn about and understand better.
Models are created using information and data about the target system. This information determines the structure of the model. The modeler must also decide how to properly relate the model to the target.
Once a model is created we can use it to predict or learn about the behavior of the target system. We can also use models to ask ‘what if’ questions- i.e. “If I were to change this aspect of my target system, how would its overall behavior likely change as a result?”
Where does data come from?
Both data analysis and modeling need data to work with. But where does data come from?
Data consists of observations or information about objects, system or processes of interest. Data may be collected by hand, automatically by computer, or some combination of the two (e.g. entered by hand into computer based forms).
Once collected, it may reside in files (e.g. comma or tab delimited text files), spreadsheets (e.g. Excel) or databases (e.g. MySQL, Microsoft Access, Oracle, SQL Server).
Data may also exist in the form of documents and websites. This is referred to as unstructured data (or in the case of websites, semi-structured data). Unstructured data can be analyzed to give it more structure, at which point it may become semi-structured or structured data. The resulting data can then be stored in a database for further analysis, or used to create a model of the target system.