Dean of Big Data - CTO, IoT and Analytics at Hitachi Vantara

William Schmarzo

Subscribe to William Schmarzo: eMailAlertsEmail Alerts
Get William Schmarzo: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: EMC Journal

Blog Feed Post

Lessons in Becoming an Effective Data Scientist

I was recently a guest lecturer at the University of California Berkeley Extension in San Francisco. On a lovely Saturday afternoon, the classroom was crowded with students of all ages learning the tools of the modern economy. The craftspeople of the “Analytics Revolution” were busy learning new skills and tools that will prepare them for this Brave New World of analytics. I was blown away by their dedication!

As we teach the next generation, it’s important that we focus more on capabilities and less so on skills. What I mean is “learning TensorFlow” isn’t nearly as important as “learning how to learn TensorFlow.”

We need to make sure that we teach concepts and methodologies along with the tools. We should teach the “What” and “Why” as well as the “How” so we don’t put our students in a situation where they “can’t see the forest for the trees.”

This brings me to a recent article “What IBM Looks for in a Data Scientist,” which outlines what IBM looks for in a Data Scientist. The list of skills is very useful, especially for someone pursuing such a career:

  1. Training as a scientist with an MS or PhD.
  2. Expertise in machine learning and statistics with an emphasis on decision optimization.
  3. Expertise in R, Python or Scala.
  4. Ability to transform and manage large data sets.
  5. Proven ability to apply the skills above to real-world business problems.
  6. Ability to evaluate model performance and tune it accordingly.

Unfortunately, this is a tactical list, not a strategic list. In fact, some of the points are too granular and too focused on “how” versus “why.”  For example, on point #3, it’s more important to know how to program than it is to know a specific language. It’s more important to learn the concepts and approach to effectively program than it is to learn the tools themselves. The minute you think you’re expert at R or Python or Scala, along comes Julia. It’s important to develop transferable skills rather having to re-educate yourself each time a new tool arrives.

In a world driven by the rapid introduction and adoption of open source tools and frameworks (like TensorFlow for machine learning), expertise in a tool is fleeting.  However, mastery of the concepts and approaches for which those tools are used is critical because being a data scientist is more than just a bag of skills. The best data scientists are about outcomes and results.

Data Science DEPP Engagement Process

Our data science team at Dell EMC uses a methodology called DEPP that guides the collaboration with the business stakeholders through the following stages:

  • Descriptive Analytics to clearly understand what happened and how the business is measuring success.
  • Exploratory Analytics to understand the financial, business and operational drivers behind what happened.
  • Predictive Analytics to transition the business stakeholder mindset to focus on predicting what is likely to happen.
  • Prescriptive Analytics to identify actions or recommendations based upon the measures of business success and the Predictive Analytics.

The DEPP Methodology is an agile and iterative process that continues to evolve in scope and complexity as our clients mature in their advanced analytics capabilities (see Figure 1).

Figure 1: Dell EMC DEPP Data Science Collaborative Methodology

Importance of Humility

The first skill that I look for when engaging with or hiring a data scientist is humility. I look for the ability to listen and engage with others who may not seem as smart as them. And as you can see from our DEPP methodology, humility is the key to driving collaboration between the business stakeholders (who will never understand data science to the level that a data scientist do) and the data scientist (who will never understand the business to the level that the business stakeholders do).

Humility is critical to our DEPP methodology because you can’t learn what’s important for the business if you aren’t willing to acknowledge that you might not know everything.

Humility is one of the secrets to effective collaboration. Nowhere does the importance of the business/data science collaboration play a more important role than in hypothesis development.

A hypothesis is a formal statement that presents the expected relationship between an independent and dependent variable. (Creswell,1994)

If you get the hypothesis and the metrics against which you are going to measure success wrong, everything the data scientist does to support that hypothesis doesn’t matter. In fact, if you get the hypothesis and the metrics against which you are going to measure wrong, not only are you likely to achieve suboptimal results, but you could actually achieve the wrong results altogether.

For example, in the healthcare industry, we are seeing the disastrous effects of the wrong metrics (see the blog “Unintended Consequences of the Wrong Measures” for more details). Instead of using “Patient Satisfaction” as the metric against which to measure the doctor and hospital effectiveness (which is leading to unintended consequences), the healthcare industry may benefit from a more holistic metric against which to measure success. One example is a “Quality and Effectiveness of Care” combined with a “Readmissions” score and “Hospital Acquired Infections” score.

Being off in your hypothesis by just one degree can be disastrous. For example, if you are flying San Francisco to Washington, D.C. and were off by a mere one degree upon takeoff, you’d end up on the other side of Baltimore, 42.6 miles away (“Impact of A Mere One-Degree Difference”).

Figure 2: Ramifications of being off 1 degree

 

Get the hypothesis wrong, even by a one degree, and the results could be wrong or even disastrous (if you have tickets to watch the Washington Redskins play football and not the Baltimore Ravens).

Type I / Type II Errors

Being humble also means to concede when you may be wrong, particularly with analytic models that may not always deliver the right predictions or outcomes. In that case, a solid understanding of the business or organizational costs of Type I (False Positive) and Type II (False Negative) errors is important. To understand the business and organizational ramifications of such errors requires close collaboration with the business stakeholders (see Figure 3).

Figure 3: Understanding Type I Errors and Type II Errors

See the blog “Understanding Type I and Type II Errors” for more details.

Summary

In my classes, I focus on the “What” and “Why” versus spending too much time on the “How”. I want my students to have a framework that enables them to understand how the different technologies, techniques and tools can be more effectively used.

I’m not teaching my students data science, I’m teaching them how to learn data science. It is an important distinction that can be humbling, but results in a more detailed-oriented student that wishes not only to become a data scientist, but become an effective data scientist. As teachers, it is important that we know the difference.

The post Lessons in Becoming an Effective Data Scientist appeared first on InFocus Blog | Dell EMC Services.

Read the original blog entry...

More Stories By William Schmarzo

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting strategy and defining the Big Data service offerings for Hitachi Vantara as CTO, IoT and Analytics.

Previously, as a CTO within Dell EMC’s 2,000+ person consulting organization, he works with organizations to identify where and how to start their big data journeys. He’s written white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power an organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill also just completed a research paper on “Determining The Economic Value of Data”. Onalytica recently ranked Bill as #4 Big Data Influencer worldwide.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored the Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements. Bill serves on the City of San Jose’s Technology Innovation Board, and on the faculties of The Data Warehouse Institute and Strata.

Previously, Bill was vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a Masters Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.