Big_Data_MSIRecently I have read a number of articles about Big Data Scientists. Actually, they have been about Scientists, Analysts, Researchers, etc. Nobody knows what to call them, so I’ll use the term “Scientists.”

And similarly, no one seems to agree on where they come from (or where they should come from). The arguments range from IT, Business Analysts and Statisticians to Social Scientists. As I read these debates, my gut reaction was to believe the best data scientists will come from the social sciences.

Dr. Mark Hayward is Director of the Population Research Center at University of Texas.

Dr. Mark Hayward is Director of the Population Research Center at University of Texas.

The fact that my background is in sociology had nothing to do with that conclusion. However, rather than write a blog confirming my gut reaction, I decided to talk to one of the best social science researchers I know, Dr. Mark Hayward from the University of Texas. Mark is Director of the Population Research Center at UT. He is also a professor of Sociology and a Centennial Commission Professor of Liberal Arts. I went to graduate school with Mark, and the man knows his research.

We began talking about Big Data Scientists and where the best analysts will come from. The first thing that surprised me was that Mark is dealing with Big Data on a daily basis. As Mark and his colleagues study the effects of life-course exposures on health and morbidity, they are dealing with massive volumes of data. Imagine a large longitudinal database addressing health indicators. These researchers have ongoing survey data, married with ongoing environmental or macro-level data, coupled with biogenetic data. When Mark described the types of datasets he is dealing with, I thought he was going to confirm my admittedly biased conclusion that social scientists, especially sociologists, were going to provide the next generation of data scientists.

I was dead wrong (well partially). Mark’s position is that social scientists are somewhat flawed in their ability to fully analyze the complexities of Big Data. His rationale is that social scientists are trained to approach analysis with a hypothesis and to solely rely on the acceptance or rejection of that hypothesis. While this approach serves all of us well in most of our research efforts, it doesn’t help us get at the nuances of Big Data Analysis. I should also note that there is probably a bit of a stigma in academic circles to this type of analysis – typically referred to as “data dredging.”

Imagine Big Data as a large cloud of data. Mark argues that there are a variety of “hot spots” in that data cloud. And remember that this data is not linear – there is a lot of co-linearity or coherence in the data. Think of someone with a PhD – they are going to on average have higher income, live in better environments, have better access to health care, etc. These variables don’t vary independently but adhere to each other in the data cloud.

As analysts, Mark suggests that we need to look at the myriad of hot spots in the data cloud and let those hot spots guide the evolution of our hypotheses and findings. What is it that makes the hot spots different? How do we explain the various behaviors or outcomes within each hot spot rather than across hot spots? In fact, when we look across all hot spots, we’re likely to completely miss what’s going on.

My final question to Mark was – OK, so where do these data scientists come from? His response was “all of the disciplines.” Mark believes that no single discipline today is preparing the next generation of scientists to tackle Big Data. Rather, it will be collaboration among a variety of skills and disciplines that will tease out the insights. “If I were to build a Big Data Scientist curriculum today, I would include experts from the following fields:”

  • Social scientists – who understand human behavior
  • Business researchers – who understand the application of Bayesian approaches to consumer behavior
  • Statisticians – at the forefront of evolving techniques like Random Forest
  • Information technology – which can deal with the complexities of database integration and manipulation

So while my hypothesis wasn’t totally validated, it wasn’t totally rejected either. Therefore, sociologists/demographers will play a role in the evolving discipline of Big Data Scientists.