You’ll often hear people saying that data science is a team effort. And I think this is very true. Many different types of expertise are required to successfully carry out data science projects – there’s the software and software systems part, the data analysis part, the domain expertise part, the interface and visualization components… All of these are required for a successful project, along with the communications glue, both technical and human, that makes them stick successfully together.
That said, although all of these pieces are going to be incorporated in some way into any data science project, there is certainly an interaction with scale here.
Some data science projects will be relatively small and static. This doesn’t necessarily make them any less valuable, but it does reduce the technical requirements. They will typically involve a single, already existing, dataset (perhaps contained in an Excel spreadsheet or text file), which can be uploaded into an analytics package and analyzed essentially all at once (although there will no doubt be some back and forth between the analysis and domain specific team members here). The results themselves can then be reported in a relatively comprehensive and exhaustive fashion – for example, in the form of a report describing and visualizing the results of the analysis. To the researchers out there, this should sound suspiciously like writing a journal article.
Other data science projects are going to require a bit more support in the way of an underlying data analysis system. In particular, projects that involve what in this blog article I’ll refer to as dynamic datasets – ones where data is being added, updated and changed over time – will benefit from a data analysis system that is less manual, more automated, and a bit more technically sophisticated.
In this case, in creating such a system, what we’re essentially doing is building a pipeline through which our data can flow and be transformed into useful output along the way. Even with dynamic datasets there can be considerable variability in the scope and technical requirements needed, but it’s fairly safe to say that all of these systems will need to have certain core components present in some form or another.
So what are the pieces in this pipeline?
Data Collection: The data analysis pipeline starts with data collection. Data might be collected by computer programs (e.g. ones that keep track of user and other computer behaviours), through sensors set up in the environment (e.g. weather sensors that measure if it is raining and how much rain has fallen), or by being manually entered through user interfaces (e.g. data collected through forms on a website or web app).
Data Storage: Once collected, all of this data must be stored somewhere, and this is where database software and systems come in to play. The programs, sensors or forms gather the data, and then connect to the database (handwaving aside the technical details here) and add that data into the database. Once in the database, other programs can then come along and work with the data stored there.
Data Evaluation and Data Analysis: Data cleaning and data evaluation programs might review the stored data to determine its quality and, potentially, correct issues that are detected. Then, analysis software can carry out predefined analyses on the data. As you can see from this description, these components of the data analysis system will often make changes or additions to the database, based on the results of their work on the dataset.
Data Results and Visualization: Finally, the results of the analysis must be presented as useful output to the consumers of the analysis. Since, in this case, we’re talking about a dynamic dataset, the results themselves will need to be presented in a dynamic fashion, and kept up to date as new data is added to the dataset. For example, the data might be made available via a web application that takes current analysis results either directly from the analysis program or from analysis results stored in the database, and then presents an up-to-date picture of these analysis results every time a web page is loaded.
Although at Sysabee we focus on the data analysis piece, we also know that it’s critically important for that piece to ‘play nice’ with all of the other components in the system. A major enabler of this is good system design. This blog article is already getting a little long, so I’ll take that topic up in a follow up post.