Data Scientists, ASPIRE BIG! 18 skills of the Full Workflow Data Scientist

Authors: Madhu Gopinathan (MakeMyTrip), Varun Modi (InMobi), Avi Patchava (InMobi)

Data Science has fast become a rich awe-inspiring expertise.

There are multiple areas in which to excel and build subject matter expertise. At InMobi and MakeMyTrip, we learned that Data science teams strongly benefit from having a cadre of Full Workflow Data Scientists – analogous to what software engineers calls Full Stack Engineers.

These are individuals who have developed a skill set that enables them to architectadviseapply algorithms, and take action across all activities necessary to obtain maximum value from applying a model in a business environment.

Eric Colson of Stitchfix notes that Data Science generalists are preferred to specialists when the uncertainty in the environment is high, e.g., in a start-up context or in new markets. As a generalist, the Full Workflow Data Scientist adapts better to change than singular specialists.

This is not a skill set that is necessarily required for every Data Scientist. But, for those with appetite, and enough core training and experience across business, algorithms and engineering the ‘Full Workflow Data Scientist’ is a worthy ambition –  it will make them indispensable high-impact individuals.

Madhu, Varun and Avi have enumerated a full set of 18 relevant skill areas, intended to service as a compass to aspiring young data scientists. We are excited to see many more such individuals develop in the Indian and Bangalore ecosystem. Data scientists, ASPIRE BIG!

The 18 skills of The Full Workflow Data Scientist

There are 5 types of skill following the workflow of a Data Scientist: Understanding the problem; Building the dataset; Developing the model; Putting into production; Achieving sustainable impact.

Each of the 18 specific skills is categorised as Core, Intermediate, or Advanced. There are: 6 Core skills, 7 Intermediate skills, 5 Advanced skills.

Here is the full list. Details, description and useful links follow after

A) UNDERSTANDING THE PROBLEM

1.   Develop deep understanding of the business context (CORE SKILL)

2.   Frame the business problem into a Data Science or modelling problem (INTERMEDIATE SKILL)

3.   Develop a use case roadmap for a problem area or capability for the business (ADVANCED SKILL)

B) BUILDING THE DATASET

4.   Capable in extracting data from multiple sources (CORE SKILL)

5. Build the dataset (using data-joins) to solve the problem at hand (CORE SKILL)

6. Run Data exploration to understand relationships and patterns within the data (INTERMEDIATE SKILL)

7.   Develop data visualisation to represent and be able to demonstrate the relationships identified from data exploration (ADVANCED SKILL)

C) DEVELOPING THE MODEL

8. Gain proficiency with executing the Model => Train => Evaluate => Refine cycle (CORE SKILL)

9. Refine and deepen understanding of the algorithmic and inferential aspects of Statistical analysis (INTERMEDIATE SKILL)

10. Evaluate new algorithms from latest research and develop intuition about the problems for which they are likely to improve the state of the practice  (ADVANCED SKILL)

D) PUTTING INTO PRODUCTION

11. Building training pipelines for production environment (CORE SKILL)

12. Provide inputs for design, quality assurance parameters and support implementation for the model in online environment (INTERMEDIATE SKILL)

13. Provide inputs and determine infra requirements and infra management for model deployment (INTERMEDIATE SKILL)

14. Lead debugging of data pipelines and model behaviour in production environment (ADVANCED SKILL)

E) ACHIEVING SUSTAINABLE IMPACT

15.Develop dashboards to enable easy tracking and communication of model impact (CORE SKILL)

16. Support and influence design of the business metrics to track model business impact (INTERMEDIATE SKILL)

17.  Develop and execute on a plan for continuous iteration and refinement of a new model (INTERMEDIATE SKILL)

18. Develop the story and communicate the story to all stakeholders on what has been achieved with a new use case (ADVANCED SKILL)

Details, description and useful links follow.

A) UNDERSTANDING THE PROBLEM

CORE SKILL:

1.   Develop deep understanding of the business context

A Data Scientist of a business, or organisation, has a job in order to support the organisation achieve its objectives. The Data Scientist must ensure she is working on problem areas that matter for the business. The Data Scientist is the expert on model-building; it will be their responsibility (among others) to ensure that the model being built is truly going to deliver value and impact for the business.

To do this, the Data Scientist must invest time up-front in any problem to fully understand the business context and situation. They should be able to answer questions such as:

  • How does this business (unit) operate?
  • What are the most pressing problems or concerns for this business to solve?
  • What are the forces at work in driving success for this business?
  • Why is the business problem that is being framed so relevant to the business right now?
  • What is the timeline? Who will the beneficiaries be? What form will value come in e.g., revenue up-side, team productivity, or cost-savings, or new skills that enable innovation and growth?

Thus, the Data Scientist must be an active and effective listener as they speak to new stakeholders to understand the context. They must be able to capture key points that could be relevant for thinking and planning of their models. They must be able to synthesise the essence of the problem(s) they are trying to solve. They also must be able to probe and investigate, as an interviewer of their key stakeholders, to fully build the picture and relevant details. Sometimes, they might choose to shadow key stakeholders  (i.e. potential users of the model) to better understand business processes and how the model will be relevant.

As an example, when InMobi’s Data science team started working with creatives (advertising images), the lead data Scientist – Arpan Maheshwari – started the process by shadowing eight different creative artists at InMobi, over the course of 2 weeks. He also conducted deep interviews to better understand how these artists operate, and how they add value in their work.

Relevant references:

  1. http://www.business-science.io/business/2018/06/19/business-science-problem-framework.html
  2. https://www.youtube.com/watch?v=ppCsvSeRZcI&t=273s

INTERMEDIATE SKILL:

2.   Frame the business problem into a Data Science or modelling problem

A business problem, by itself, does not immediately give clarity on how a model needs to be used. The Data scientist will need to consider what value is needed to support the business problem – sometimes they can only offer a component of the solution – and how a model will drive this value. There might be various ways to frame the model problem; some might make more sense than others, and importantly some will be more feasible than others.

For example, when Avi was once working with a sales team for a major B2C player where there were relatively long-sales cycles for the product in question, and yet the sales consultants would receive  ~15 new leads per day. The business problem was very open: how to make sales consultants more effective and ultimately drive an uplift in sales.

The Data science team had to frame that business challenge into a viable model that would add value for sales consultants. The team built a model to score different leads and provide these scores – in 3 simple buckets – so sales consultants could immediately see what their top priority leads would be, and thus where they needed to spend maximum energy. That decision to provide the probability scores in 3 simple buckets was driven by an understanding of how the sales consultants would be able to practically apply the results of the model.

Relevant links and references:

https://www.kdnuggets.com/2016/03/data-science-process.html

https://medium.com/@jameschen_78678/solve-business-problems-with-data-science-155534b1995d

ADVANCED SKILL:

3.   Develop a use case roadmap for a problem area or capability for the business

When a Data scientist is supporting a business, or a unit of the business, or a capability within the business, there needs to be a longer-term plan to ensure the area gets the maximum benefit from applying data science and ML-AI. Some Data Scientists might work from use case to use case, as the demands of the business or product team require. However, Data science teams can be much more productive by having a longer-term view (e.g. 2-4 quarters ahead) of what problems will be addressed to fully build out the capability. This ensures better planning and managing high lead-time tasks, such as logging of new variables, and preparation of large datasets.

Thus, a Data scientist can be a more effective team member in a multi-disciplinary team by co-developing the “use case roadmap” of what problems will be addressed in what order. This can be as simple as a clear plan of use cases, against a timeline, and ensuring that the use cases ‘stack’ and build upon each other, as they are delivered. The plan should also take account of high lead time activities which should happen in advance – e.g., new data field logging, PRD specification, dataset preparation –  to ensure that the Data Scientist can be most productive as quickly as possible when the problem is picked up.

As an example, InMobi’s Data science team for anti-fraud products  – led by Dr. Farhat Habib and emerging full workflow data scientist Vishesh Sharma – has developed a comprehensive view on roughly 20 areas where models can be applied in the workflow beginning with ad-requests to actual transactions being delivered for our advertising clients. These are prioritised based on the needs of the business, feature availability, and depending on how fraud behaviour is also evolving in response to our strengthening fraud-prevention system – an ongoing battlefield.

Relevant links and references:

  1. https://www.datasciencecentral.com/profiles/blogs/3-stages-of-creating-smart?xg_source=activity

B) BUILDING THE DATASET

CORE SKILL:

4.   Capable in extracting data from multiple sources

This can typically be the most time taking step for a Data Scientist (particularly for newer problems) so needs to be managed well. The following skills help a Data Scientist to reduce the overall experiment time significantly:

The input data can vary in size from few megabytes to terabytes, hence hands on knowledge of tools/libraries for processing both smaller datasets (e.g., Python-pandas, R-datatable) and larger datasets (e.g., Spark, Pig) can help a data Scientist optimally solve different problems depending on the scale of the data.

While command line tools, such as grep/sed/awk, can prove useful to explore the data structure and schema for data stored in text format, interpreter-based tools such as Spark, Python, Pig, SQL, PSQL, Hive can be used to explore data from other storage systems (SQL or no-SQL) or storage-formats (Parquet, orc). And for various NLP (Natural Language Processing) use cases, comfort with web scraping tools such selenium, beautifulsoup in Python can be helpful for collecting more data than is available in existing datasets.

In the case of multiple sources, modularizing the code for extracting data from each source is very helpful in validation, debugging and re-use of the data pipeline.

Depending on model requirements and infra limitations, a Data scientist may need to sample the data over time or feature-space. Understanding the relationship between the sampling approach used and the final business use case is very important [Refer to Skills 1 and 2 above]. For example, in a dataset involving age data, age groups with low frequency in the data might get very smaller representation in a random sample, which might not be optimal for a business use case weighting each age group equally (as opposed to a stratified sample).

When solving business problems, the input data sources might have various business and engineering stakeholders. For reducing the initial overhead of extracting the data, often several rounds of discussion, debate and clarifications are needed with the business and engineering owners around the following points:

  • Known issues in the data related to data hygiene, missing data and backfills for the duration of the data-pull
  • Nature of existing aggregations and relationship with upstream sources
  • Already existing scripts, tools or pipelines to create the required dataset
  • Privacy or legal issues in access and use of any of the source datasets / features
  • Align your assumptions about the nature of data with those of owners
  • Check if the preliminary descriptive statistics match with what is being reported at the business end

For example, one of the very first problems that came to Inmobi’s Data science team when it was founded several years ago was to create user segments for ad-targeting. This involved processing user response data, and associated meta-data, for more than one billion users over an extended period of time. The data involved was in terabytes and was spread across a variety of sources ranging from raw logs on Hadoop to Hive and PostgreSQL DBs.

The initial attempt to generate the final dataset in a single workflow was error prone. Not just due to the scale and complexity of the pipeline, but because a good number of the input sources were never used in a data science context before and had multiple issues related to hygiene, consistency and freshness. Multiple iterations with the product and engineering owners and modularising the workflow helped build a scalable pipeline which is used even to date for user related modelling problems.

Relevant links and references:

  1. https://www.mastersindatascience.org/data-Scientist-skills/hadoop/
  2. https://www.slideshare.net/Hadoop_Summit/t-1205p230-cstella
  3. https://www.slideshare.net/databricks/building-robust-etl-pipelines-with-apache-spark
  4. https://spark.apache.org/docs/2.2.0/ml-features.html

CORE SKILL:

5. Build the dataset (using data-joins) to solve the problem at hand

This is typically one of the most error prone and computation heavy steps in dataset preparation. There are several specific elements in this skill in order to be able to conduct this as error-free as possible:

  • Clear understanding of the nature of joins required across various data sources depending on the relationship between join keys (one, many-to-one, many-to-many) can help prevent overcounting or undercounting errors downstream
  • Matching aggregates after joins to those in commonly used business reporting tools, which can result in quick validation of the pipeline
  • Understanding of NLP-based descriptors or other fuzzy-join techniques for joining when the join key is not well defined. For example, joining on address strings (for physical home address) across different sources
  • Effectively using partitioning techniques (in particular for skewed datasets), sequencing joins correctly and replicated joins in case of larger datasets for computational efficiency. Knowledge of various hashing techniques can also help here

For example to share a challenge we have had at InMobi: several problem areas require us to predict metrics for a user’s response to mobile ads; metrics such as ‘Conversion Rate’ – computed as: Total Conversions / Total Clicks. Building the training data for these problems typically involves joining data from multiple raw streams. In multiple experiments we have run, overprediction or underprediction bias in the evaluation phase of the model was traced back to over-counting of clicks or conversion events. We discovered this was due to duplicate records present in one or more of the data streams for the join-key that had been selected.

Relevant links and references:

  1. https://databricks.com/session/optimizing-apache-spark-sql-joins
  2. http://kirillpavlov.com/blog/2016/04/23/beyond-traditional-join-with-apache-spark/

INTERMEDIATE SKILL:

6. Run Data exploration to understand relationships and patterns within the data

This skill enables a Data Scientist to understand the relationships and patterns in the data for data cleaning, feature engineering, model development. Effective application of this steps ultimately supports genuine improvements in the final accuracy and relevance of a model. Each exploratory step should be performed to answer a specific question or validate assumptions about the nature of the data, which should further result in more exploratory steps, or leads and ideas, for feature engineering and model development.

Techniques like univariate analysis (i.e. checking feature coverage and skew), bivariate analysis (i.e. relationship with the target variable) and multivariate analysis (i.e. relationships among features) all are relevant. Basic statistical concepts such as correlation, covariance, mean, median and IQR should be applied to derive insights to inform feature selection decisions. In some cases the problems involve a large number of features and a number of sub-problems, such as building different models for each country. In these cases, ‘automated approaches’ from information theory can be used, such as Mutual Information or Information Gain Ratio for classification problems, and regression analysis (e.g., using R-squared measures).

In most business problems the data is noisy and some form of data cleaning creates significant step-ups in model performance. It may involve multiple iterations with feedback from subsequent model selection and evaluation steps. Techniques based on anomaly detection (hypothesis tests, clustering) as well as prior context about hygiene and coverage of features, are relevant to support data cleaning activities.

For example, one of the most common problems in user response prediction is of high-dimensional data and yet a small amount of learning data per feature value. For Ad Networks handling billions of request per day, the total dataset size increases exponentially with more features being added to the training data. Applying feature selection techniques such as Mutual Information or Information Gain Ratio can reduce both the feature set and dataset size. This allows the Data Scientist to iterate faster in the feature engineering and modeling phases. We often have to reduce the cardinality of many features by clustering feature values supervised on the target variable, or by building latent factors.

Relevant links and references:

  1. https://www.kdnuggets.com/2017/04/value-exploratory-data-analysis.html
  2. https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
  3. https://medium.com/district-data-labs/data-exploration-with-python-part-1-643fda933479

ADVANCED SKILL:

7.   Develop data visualisation to represent and be able to demonstrate the relationships identified from data exploration

This skill is intertwined with data exploration and data cleaning. Visualisations have two major functional goals for a Data Scientist:

  • To support insights for further feature engineering and modeling steps
  • To include the visualisations as part of the final story of the model [See Skill-18 on ‘Developing the story and communication’]

There are many good references on what are good and bad visualisations (see the Reference links below). In essence a good visualisation should be able to clearly answer the questions asked in the exploratory step by placing maximum emphasis on the relevant information.

Depending on the nature of the problem, a Data scientist would apply different visualisation techniques. For understanding relationships between images of different characters in a digit recognition problem, data can be plotted in lower dimensions after applying dimensionality reduction techniques such as PCA or t-SNE. On the other hand for figuring out number of past days to consider in a forecasting problem ACF or PACF plots are used. Though R(ggplot2) and Python (matplotlib, seaborn) provide capabilities to generate a decent breadth of visualisations, effectively applying visualisations to relevant problems to support the storytelling is a skill the Data Scientist builds from practice in solving more and newer problems, and seeing what does and does not work in communicating with stakeholders.

Building on our previous example, one effective technique to reduce cardinality of features is clustering the values based on latent factors. These latent factors can be learnt using various techniques such as from Matrix Factorizations  or Neural-network built embeddings. At InMobi, we seek to visualize the nature of the clusters we build to further our understanding. It not only validates the model correctness for external stakeholders but can help extract valuable business insights.

Relevant links and references:

  1. https://www.analyticsvidhya.com/blog/2018/01/collection-data-visualizations-you-must-see/
  2. https://www.kdnuggets.com/2018/08/interpreting-data-set.html
  3. https://www.kdnuggets.com/2017/03/what-makes-good-data-visualization.html
  4. https://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap3_data_exploration.ppt

C) DEVELOPING THE MODEL

CORE SKILL:

8. Gain proficiency with executing the Model => Train => Evaluate => Refine cycle

This is typically the very first activity that comes to mind when one thinks of applying machine learning to a problem. Naturally we cannot do full justice to all the steps and nuances involved in this activity. However as a broad introduction, we think of this activity has having the following major components for a Data Scientist:

  • Choose a model evaluation metric that is appropriate for the problem. This will help a Data Scientist to quantify the performance of a model and to help select between competing models
  • Divide the data into train, validation and test sets. Build a baseline model, i.e. a simple strategy for prediction. Compute the evaluation metric on the train and validation sets using the baseline model. This sets the first target to beat using a more sophisticated model
  • Train a more sophisticated model with the training set and then assess its performance on the validation set. A learning curve shows how the evaluation metric changes as a function of the amount of training data. Plot learning curves of a model using the train and validation sets to assess whether the model suffers from high bias or high variance in its errors.
  • High bias can cause underfitting, i.e. the model is not learning as much of the signal in the data as it coul
  • High variance can use overfitting, i.e. the model is fitting the noise in the data and therefore may not generalize to unseen data.
  • Decide the next steps based on these cases:
  • Model suffers from high bias: i.e. training set error is close to the validation set error and you are not satisfied with the performance:
  • Try feature engineering, or build a more complex model
  • Model suffers from high variance: i.e. the performance on the training set is satisfactory, but there is a large gap when compared to the validation set:
  • Try obtaining more training data, or using a simpler model
  • Performance on the validation set is satisfactory:
  • Evaluate the model on the test set
  • If satisfactory, then proceed to put the model in production

For example, when working on a binary classification problem, there are different evaluation metrics such as accuracy, precision, recall or F1-score that can be used. The choice of the metric depends on the problem at hand. While accuracy might be the simplest metric to work with, it is not an appropriate metric when building a model to predict whether a credit card transaction is fraudulent. Consider a credit card company with 1% fraud rate: a baseline model that predicts all transactions as “not fraud” will achieve 99% accuracy, but this defeats the purpose of having an intelligent model to prevent costly frauds.

Relevant links and references:

  1. Learning curves, Lecture by Andrew Ng, Machine Learning, Coursera
  2. http://scikit-learn.org/stable/modules/model_evaluation.html
  3. http://scikit-learn.org/stable/modules/learning_curve.html
  4. Overfitting: The most important scientific problem you’ve never heard of (chapter 5), The Signal and the Noise, Nate Silver

INTERMEDIATE SKILL:  

9. Refine and deepen understanding of the algorithmic and inferential aspects of Statistical analysis

A Data scientist strives to improve her understanding of the most widely used machine learning algorithms. She pushed to understand how and why they work and more importantly, when they do not work and why.

Consider the simplest algorithm: averaging a set of numbers to compute mean. For assessing the accuracy of the mean, we can use another algorithm to compute the standard error. Thus, the standard error provides an inference of the mean – the output of the averaging algorithm. In general, one can view an algorithm as what a Data scientist does, while inference justifies why she does it.

For example, consider K-means, one of the simplest and most scalable clustering algorithms. Given the desired number of clusters, the standard version of K-means assigns every point to the centroid of a cluster that is nearest as measured by Euclidean distance. The goal of the algorithm is to minimize the within-cluster variance in each cluster. This works best when the data is spherical, i.e. the same variance in all directions.

However, if one is dealing with a messy real world dataset in which variance is not the same in all directions, then K-means is likely not to work well. Furthermore, as the number of dimensions increase, the Euclidean distance which – as a standard version of K-means uses to assign points to a cluster – loses its meaning as a concept of ‘length’ or ‘distance’ between points. These are the examples of the level of statistical understanding that enables a Data Scientist to make sensible modelling decisions, achieve better results and make the correct inferences when she has her modelling results.

Relevant links and references:

  1. Computer Age Statistical Inference, Efron and Hastie
  2. A Few Useful Things to Know about Machine Learning, Pedro Domingos
  3. On the Surprising Behavior of Distance Metrics in High Dimensional Space, Aggarwal, Hinneburg and Keim
  4. When K-Means Clustering Fails: Alternatives for Segmenting Noisy Data
  5. Demonstration of K-means assumptions

ADVANCED LEVEL:

10. Evaluate new algorithms from latest research and develop intuition about the problems for which they are likely to improve the state of the practice

Machine learning has seen extraordinary progress in the last several years. In recent years, many of the exciting results are due to Deep learning (DL) techniques. From applying such techniques, there has been an avalanche of papers published in top conferences showing that a new DL-based method beats previous methods on a given task or benchmark.

While these methods have yielded impressive empirical results, they are currently difficult to analyze theoretically – the challenge of the proverbial ‘black box’ algorithm. Thus, empirical rigor is very important especially from the point of view of a Data Scientist applying such techniques to her specific problems.

For example, an ablation study (which involves removing some “feature” of the model or algorithm, and seeing how that affects performance) on various encoder-decoder style networks with attention demonstrated that one could perform as well or better using just the attention module [1]. For more tips on improving empirical rigor, please see [2]. See the talk [3] about the importance and usefulness of developing intuition towards solving your Data science problems.

Relevant links and references:

  1. Attention is All You Need, Vaswani et. al.
  2. Winner’s Curse? On Pace, Progress and Empirical Rigor, Scully et. al.
  3. https://www.youtube.com/watch?v=ppCsvSeRZcI&t=273s

D) PUTTING INTO PRODUCTION

CORE SKILL:

11. Building training pipelines for production environment

When taking a model into production, the importance of training pipelines cannot be understated. They need to optimised to be able to redeploy a modified pipeline fast, as well as to ensure the robustness of the pipeline to failures.

Most business use cases will involve multiple offline experimentation and online deployment iterations to truly perfect a model. At every iteration, high engineering overhead for deploying these changes leads to delayed feedbacks, which in turn increases the pain and challenge to reach a final model. A Data Scientist needs to work with her counterpart Engineering teams to standardize components of the training pipeline, so any modifications (inclusions and removal of pipeline components) can be translated to automated online deployments without further engineering cycles.

Typically, the offline experimentation process ignored minority cases in both training and evaluation processes, in a bid to get results faster which apply to the majority of cases/data. While violations of these assumptions in the model training phase might have caused pipeline failures or unexpected results, a failure to handle such edge cases in production can cause failure of dependent systems and might result in serious revenue consequences for the business.

A Data scientist needs to define automated tests / assertions to prevent these failures, valid fallback actions when failures occur, and alerting/reporting strategies for both training and real-time pipelines. Various other training statistics, such as the model parameters and convergence information for each training run, should be tracked and reported to support further debugging.

For example, user response prediction at InMobi involves a variety of sub-problems , each of which has its own set of experiments spanning across a variety of platforms such as Spark, Tensorflow and Vowpal Wabbit. Having separate engineering efforts for deploying each experiment for real-time feedback is infeasible. To solve for this, both Data science and Engineering teams at InMobi collaborated to build scalable model serving and experimentation frameworks. The Data Scientists were not only involved in coming up with the design and functional specs, but also in implementation, code reviews and testing process.

Relevant links and references:

  1. https://medium.com/@vikati/the-rise-of-the-model-servers-9395522b6c58
  2. https://databricks.com/session/mleap-productionize-data-science-workflows-using-spark
  3. https://www.tensorflow.org/serving/
  4. https://spark.apache.org/docs/latest/ml-pipeline.html
  5. http://clipper.ai/

INTERMEDIATE SKILL:

12. Provide inputs for design, quality assurance parameters and support implementation for the model in online environment

Engineering Machine Learning pipelines require prior context about the transformations and the models used, particularly in cases when the final implementation might be modified for performance improvements. As model training is a stochastic process, its QA process requires a strong understanding of the expected behavior of the model, which can change in a production deployment due to different data pipelines being consumed.

A Data scientist needs to work closely with their engineering team by reviewing implementations of online components, defining QA tests to catch deviations from expected model behavior (e.g., using statistical techniques) and running automated consistency checks on predictions from the training pipeline and real-time deployment. This will require the Data Scientist to have a working knowledge of the platform/language used for productionisation alongwith performance overhead associated with various models and preprocessing steps. For real-time cases, due to certain SLA restrictions various optimisations such as caching predictions or moving to simpler models might be required.

For example, in the Ad-tech industry the real-time data ingestion might be owned by the ad-serving team, whilst ownership of the offline data used for training lies with the data warehousing team. This can cause predictions to differ or fail depending on the nature of differences. One observed example of these discrepancies can be a commonly used feature such as ‘screen dimension’. If used as categorical features and converted to a string before input to the model pipeline, “5” and “5.0” will be treated as different categories by the model. Depending how the value is logged in  real-time and offline, model predictions or results will be different from what was expected. The Data scientist needs to drive consistency checks for such cases before her model goes live.

Relevant links and references:

  1. https://hydrosphere.io/blog/deploying-machine-learning-pipelines-to-production/

INTERMEDIATE SKILL:

13. Provide inputs and determine infra requirements and infra management for

model deployment

Infra requirements need to be estimated for both training and prediction (real-time or batch) deployments. For training part, this becomes crucial when the problem involves processing 100s of GigaBytes to TeraBytes of data or hyperparameter tuning for models involving large number of parameters.

Given the trade-off between model training time and infra required for the same, the data Scientist will need to understand business SLA requirements and suggest appropriate actions. This may involve getting more infrastructure, suggesting optimizations such as moving hyperparameter search to a job with lesser frequency or decreasing the complexity of the model. Some prior experience with this step might help the data Scientist discard certain models based on constraints on training time and infra limitations in a production setup. This can prevent situations where the experiment has to be performed again with a different model, just because the model used in current experiment did not meet production requirements.

Even in case of real-time deployments, model might require certain system resources depending on number of parameters, features in the model or meta-data lookups. For each model the being considered, the Data Scientist’s inputs are needed in estimation and reduction of resources required to productionalise the model.

For example, when using Spark MLlib, tree-based models, such as Random Forests or Gradient Boosted Trees, can be much more expensive to train compared to simpler regression models. Even though they might have higher error reduction, for model training phases involving large data and very frequent model updates, simpler models such as Logistic or Linear Regression with pre-engineered features might be preferred.

Relevant links and references:

  1. https://github.com/szilard/benchm-ml

ADVANCED SKILL:

14. Lead debugging of data pipelines and model behaviour in production environment

Many times the results from the offline pipelines and online pipelines significantly differ. This can be due to a bug in data pipelines such different feature values in offline and online world, or second order effect of taking actions based on model prediction.

When given an unexpected behavior of the pipeline, it is hard to attribute the cause beforehand, hence the data Scientist involved needs to work closely with the relevant engineering and business stakeholders to debug and fix the cause.

For example in a performance advertising world, underpredicting on certain ads might cause the model to bid lower than minimum bid(called floor) and can result in further loss in number of ads shown causing error increase and potential fallout situations. A data Scientist generating predictions for the bidder will need clarity on the whole bidding process to debug such issues.

Relevant links and references:

  1. https://www.practicalai.io/how-to-debug-and-diagnose-machine-learning-problems/

E) ACHIEVING SUSTAINABLE IMPACT

CORE SKILL:

15.Develop dashboards to enable easy tracking and communication of model impact

Once you have metrics, you have to decide the best way represent these for easy consumption by all stakeholders, and to support the ability of the team to get quick insights, on a regular basis.

Dashboards can broadly support several purposes: help you organise and collect several metrics; develop visualisations; crisply compare different models.

The Data Scientist should take the lead in developing the dashboards because it will give her complete clarity on the source of the numbers – ensuring they are consistent with how the models are trained. Developing dashboards will also strengthen the Data Scientist’s ability to understand how to communicate to business Stakeholders, and what visualisations can be deployed. The Data Scientist will also be the first consumer of these dashboards to track how their model is doing so they should doubly ensure that the dashboards are 100% useful for their purpose. See the reference links [1, 2] for tips on good dashboard design for Data science.

Relevant links and references:

  1. https://blog.dataiku.com/2016/03/10/dashboards_and_data_viz
  2. https://www.datasciencecentral.com/profiles/blogs/10-features-all-dashboards-should-have

INTERMEDIATE SKILL:

16. Support and influence design of the business metrics to track model business impact

Once a model goes into production, we need to be able to  understand how it is performing and also to confirm the impact that it is having. Both the Data science team and business leaders will want to ensure that time invested on a particular projects, have indeed delivered impact for the business. And if not, then how this can be addressed or what lessons need to be learned.

Thus, much before a model goes live the team needs to be aligned on what metrics will be tracked to confirm the model is working, and making a difference for the business. This activity is not solely the responsibility of the Data Scientist but ideally happens via a dialogue between Data science, product and business leaders. This needs to happen muchbefore the model going into production, because there needs to be planning, and often development work, to ensure the data that these metrics require is indeed logged and available in data systems.

For example, at InMobi we typically track 2 specific major metrics, and 5 supplementary metrics, for the prediction models we deploy to run campaigns. These help us directly evaluate if the model is achieving success in terms of scale for the campaign and profitability of running the campaign. These metrics are separate from the model performance metrics – such as error metrics.

Relevant links and references:

https://www.analyticsvidhya.com/blog/2018/02/using-data-master-the-science-in-data-science/

http://www.mlyearning.org/

INTERMEDIATE SKILL:

17.  Develop and execute on a plan for continuous iteration and refinement of a new model

When developing a new model, employ an offline evaluation method with proxy metrics that correlate highly with the business metrics defined earlier. This will help to weed out poor models and iterate faster. Once a Data scientist is satisfied with the model’s performance as judged by the offline evaluation method, typically, A/B testing is used to compare the performance of the new model with an appropriate baseline model. One must make sure that the sample size needed to achieve statistical significance has been estimated in order to draw valid conclusions from the A/B test.

Instead of relying on the average business metric (say average conversion),  one should “look inside the average” by segmenting the data using dimensions (for e.g. android users vs. iOS users) and compare the performance of the model with respect to the baseline on these segments. This will give actionable information to refine the model.

Tools that make it easy to conduct dimensional analysis can help look inside the datafaster. This can improve the productivity of Data Science teams and make it easier to conduct joint analyses with product teams so that the teams can be on the same page.

For example, Airbnb redesigned their search page. Qualitative studies indicated that the new design was much better when compared to the existing design. They decided to measure the actual impact using A/B testing. The experimental results showed that there was no business impact! They decided to look inside the data and found that the new design was performing better on all browsers except Internet Explorer (IE). This was because the implementation of the new design had a bug in older versions of IE. After fixing this bug, performance on IE was similar to other browsers. Even though this example is about testing a design change, similar principles apply for model deployment as well.

Relevant links and references:

  1. https://hbr.org/2014/12/yes-ab-testing-is-still-necessary
  2. https://hbr.org/2002/11/the-flaw-of-averages
  3. Seven Pitfalls to Avoid when Running Controlled Experiments on the Web
  4. https://www.evanmiller.org/how-not-to-run-an-ab-test.html
  5. https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7

ADVANCED SKILL:

18. Develop the story and communicate the story to all stakeholders on what has been achieved with a new use case

A Data Scientist needs to communicate the value of her new model using a story that different stakeholders can appreciate. It is easy to fall prey to the curse of knowledge: using technical jargon and discuss a model in abstract terms. However, a much more effective strategy is to tell a story that is simple, unexpected, concrete, and credible [See 2,3].

Use concrete images to convey ideas rather than abstract concepts as this will increase the probability of success. Metaphors can help to map abstract ideas to more concrete concepts.

For example, suppose that a Data scientist needs to communicate to her leadership team the following survey results of 23,000 employees from various organizations [2]:

  • 37% of the employees have a clear understanding of what their organization is trying to achieve and why
  • 20% of the employees are enthusiastic about their organization’s goals
  • 20% of the employees clearly understand how their tasks relate to the organization’s goals
  • 15% feel that their organization fully enables them to execute key goals
  • 20% fully trust their organization

If these are presented as bare statistics, the leadership team most likely will think that most employees are dissatisfied with the organizations they work for.

Let us now consider how the audience will perceive this idea if a metaphor is used to map the 23,000 employees to 11 players of a soccer team.

  • Only 4 of the 11 players on the field know the goal post to target.
  • Only 2 of the 11 care about hitting a goal
  • Only 2 of the 11 know about their position of play and know what they’re supposed to do
  • All but 2 of the 11 are in some way playing against their team members rather than the opponent

This metaphor helps to brings vivid imagery of an utterly incompetent soccer team to the minds of listeners and increases the odds that listeners will firmly grasp the idea that organisations should operate like soccer teams, but in reality they don’t.

Relevant links and references:

  1. Metaphors We Live By, Lakoff and Johnson
  2. Made to Stick: Why Some Ideas Survive and Others Die, Heath and Heath
  3. https://hbr.org/2006/12/the-curse-of-knowledge
  4. https://www.gsb.stanford.edu/faculty-research/books/made-stick-why-some-ideas-survive-others-die
  5. https://hbr.org/2013/06/battle-tested-tips-for-effecti
  6. https://ecorner.stanford.edu/in-brief/use-metaphor-to-communicate-ideas/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s