5 Most Common Mistakes Data Scientists Make – And How to Avoid Them
With data scientist consistently ranking as the best career path by LinkedIn and Glassdoor, it should be no surprise that the number of people pursuing this path is growing rapidly. Meanwhile, the field itself is still relatively new and difficult to define. Companies strive to make data-informed decisions, but the accuracy of that data relies on data scientists following sometimes vague best practices.
Like any professional group, data scientists are capable of errors in practice and judgment; however, when significant business risks are at play, these mistakes should be minimized.
Here is a collection of common mistakes made by data scientists and how to avoid them.
1. Poor documentation
While pressure to produce actionable insights and results is enormous, data scientists may overlook the need to document their steps properly. This results in significant problems. If a client or a superior requests minor adjustments to the code or algorithm, it can be challenging or impossible to retrace steps and recall how something came into fruition in the first place.
Additionally, if another team member is tasked with running the program, and there’s no clear documentation, they may not be able to decipher the process. Poor documentation risks losing the capability to manipulate and leverage the product to its fullest potential. Technical and analytic debt is easily accumulated in the data science space and is a key factor in why Google labeled machine learning, “The High-Interest Credit Card of Technical Debt.”
Solution: Take the extra time to document steps in a standardized manner. Even if there’s a rush to produce code, return to the project later to document each step. The result is a more sustainable and efficient process overall.
2. Overconfidence in data
Consulting firm Gartner estimated that 85% of data science projects end in failure. This unexpected insight caused significant speculation and uproar within the community about the root cause.
According to Gartner, the first pitfall was insufficient time spent preparing data. Without gathering high quantities of high-quality data, results won’t be reliable, and this practice can lead to poorly informed decisions.
Solution: Although it’s not the most romantic part of their job, it’s been said that 80% of a data scientist’s time is spent gathering and cleaning data. Highly visible competitions that draw newcomers to the field do not emphasize this reality. However, in order to increase a project’s potential for success, data scientists should spend adequate time on data preparation and ensuring quality data.
3. Inadequate visuals
Data science is about solving problems. The inherent challenge is perhaps the most gratifying part of the job; however, the job is not complete until insights are shared with stakeholders.
Data visualization is often perceived as a secondary or supplemental skill in data science and given less time and attention than things like model-building. The result is talented data scientists struggling to gain buy-in and an increase in miscommunication. Using visuals to display information is a key component of a project’s success and should be treated with equal gravity.
Solution: Invest time in learning helpful tools of the trade. While it’s unreasonable to expect data scientists to be superhuman in their abilities—see terms like “ninja,” “guru,” and “rockstar” in job descriptions—it is squarely within their realm to be able to share insights effectively using visuals, similar to professionals in sales, marketing, and operations.
4. Overcomplicating model explanations
Predictive models strive for accuracy; however, it’s essential to comprehend the inner workings of algorithms. Businesses and clients must explain the logic to stakeholders, and if a model can make a prediction but can’t be explained, then it’s of no use.
In certain circumstances, deep neural networks may be difficult or impossible to explain, but the bulk of commercial AI applications don’t require complex or hidden layers. Albert Einstein’s famous saying, “If you can't explain it simply, you don't understand it well enough,” is a great litmus test to ensure that data science solutions can be understood by the business stakeholders that charter and rely on them.
Solution: Don’t dive into the deep end unless a project requires it. To keep projects scalable and manageable, focus on producing simple and intuitive model explanations. If a project increases scope and complexity, closely document the reasoning behind each feature.
5. Interpreting correlation as causation
John Wills, a software engineer at Slack, garnered attention when he tweeted that a data scientist is a "person who is better at statistics than any software engineer and better at software engineering than any statistician."
In reviewing data, it’s a common problem to mistake correlation as causation, a bias known as false causality. When two things relate, it does not necessarily imply causation; it is equally likely for a third factor to be impacting each.
Solution: Before drawing conclusions, carefully evaluate which overlooked factors might be impacting the data, or whether it’s a coincidence. It’s true that data scientists are not statisticians by trade, but they must investigate these common biases in order to avoid mistakes.
Improving best practices in a developing field
Expectations of data scientists run high and vast. Similar to other fields like programming, this results in niche professions within the broader field. It’s not uncommon for contemporary data scientists to wear multiple hats as they navigate largely uncharted territory. The best way to ensure success is to maintain focus on quality, stick to the fundamentals, and constantly evolve best practices.
In an emerging field, mistakes are inevitable, and critics will be the first to announce them; however, an increasing number of enterprises are successfully using data science to achieve valuable results.