The Hard Truth about Data Science

nalbadia2
Dec 21, 2021
7 min read

Updated: Feb 12, 2022

One of the life-changing decisions that you must have faced discomforting emotions about; is the career path you have to follow. You must have asked, what will happen if I chose this and it turned out to be not at all your interest, or you might have realized that after a couple of years. In this article, I want to focus on choosing the path of being a data scientist, what the other side of data science that is not very well-known to new joiners is, and what data and data science mean outside the scientific realm.

Dilemma of Choice

With the sudden peak in popularity that Harvard Business Review contributed to in 2012 where they have annotated “Data Science” as the sexiest job of the 21st century, businesses started looking for data scientists to employ (Even when they sometimes don’t need to). Consequently, ambitious students started joining this demand wave by choosing this path.

If you were to look up on Google now “Why should I learn data science”, you will find multiple reasons summarized as such: To become good at problem-solving, having a lucrative career path, or due to the very high market demand. These reasons are too broad, not exclusive to data science, never guaranteed, and there might as well be better alternatives. However, they are being repeated everywhere missing out on one main point, people will never be great at something unless they are fully devoted to it, and people popularizing data science unknowingly mask out some challenges that are necessary to be successful. Hence, the title of this article.

Concealed Side

There’s always a difficult side for any field, let’s elaborate on what kind of predicaments or challenges data scientists might face but are not usually well-known.

Reading, Reading, and Reading

Not so many people enjoy reading every day, some of them are new joiners to data science. Data science is about reading books, academic literature, articles, and so on. To bring great ideas that are truly valuable which can improve your output, you must read a ton of knowledge. Following data scientists on social media platforms, subscribing to research organizations’ email lists (My favorite email list is DeepAI), and always being up-to-date is a must, your eyes must be everywhere. Most of what you think about is a byproduct of knowledge you have been introduced to, so be sure to have an abundance of it.

Furthermore, you have a strong backup when trying to fix/detect programming errors, exceptions are raised, program crashes, the output is clearly wrong,…etc, not so much with “Theoretical Bugs”. These bugs are too good at hiding, and you will never catch them if you were not a dedicated reader, you must understand a great level of the inner workings of what you are aiming to apply. Theoretical Bugs sometimes get detected after days, weeks, months, or never; where the model’s true quality is nowhere near to what has been reported.

Living Under Uncertainty

Imagine working for a whole month on a project, then throw it all away, how would that make you feel? Many people cannot accept failure and never let go. They go into a spiral of bad performance or multiple trials of reviving a machine learning project that is already a lost cause. Data science is uncertain, and it will always be, that’s why it’s distinguished by the word science. Managers as well must understand this uncertainty. To lead a successful data science project that is unique and valuable, you have to accept failure and be the first person who supports the team as failure is not so easy to consume. To account for the risk of failure (For AI projects), I have briefly summarized some of the points that boost the probability of success or at least mitigate its failure:

Switch your data science jargon off and accurately define and communicate the business requirements
Heavy research in order to define the algorithmic approaches and model’s quality KPI that are in alignment with business needs (e.g. Based on these references, we’re confident to mark a > 85% accuracy as a KPI for use-case X)
Be clear with stakeholders about requirements & KPI’s. Communicate exactly what the quality metric means (Further information in the Communication section).
Choose at least 3-5 fallback approaches if the chosen first approach failed and make sure you have your timeline buffered for this.
Fail fast, and let go if there’s no hope in achieving a value, or pushing the deadline

Communication

You must have heard this phrase before “Explain it like I’m 5”, data science communication is all about this. Translating extreme complexity to minimal simplicity is the hardest-to-improve skill for data scientists, as the better you get, the more complexity you will face, and the harder it will be. To mention a few cases where proper communication (AI-Specific) is a must:

Project Initiation: Convincing stakeholders to initiate a project necessitates grasping what the end goal is. You need to simulate how it looks like and attach it, always, to a business value. If your main goal is to directly support a decision-making process in a certain industry for example, when presenting a project, you should focus on simulating a decision-making scenario of which the data science project helps at.
Limitations: Limitations are unknown to stakeholders, but very well-studied by data scientists. Limitations must be clarified from the beginning as well as documented by focusing on cannot’s. For example: “The project cannot do X”.
Timeline: Project timeline choice should align with its value, and a proper Work Breakdown Structure must be prepared and communicated throughout the project life.
Performance Report and Continuous Monitoring: You must have communicated your model’s KPI beforehand, you have to bring examples sometimes, people have different perceptions about numbers. 85% accuracy might sound great for a person, but when introduced with an example, it becomes, for the same person, garbage! (I usually like flipping the quality metric by saying, for example, we will make 15 “mistakes” out of 100 “predictions” instead of saying 85% accuracy). Also, when monitoring the model’s performance in production, mistakes can happen, you always have to be ready to offer a proper defense or a proper retrospection when presented by mistakes. One of the things that are most of the time, unfortunately, not included in a data science curriculum is Interpretability. You need to know why the model has predicted an “Apple” instead of an “Orange”, and here where the conundrum peaks! Some projects are critical, and any prediction has a burden of responsibility, so account for the need for interpretability if the project expects it.

Bright Side

Allow me to coat this field with fascination using my own definitions sacrificing some of the scientific jargon.

“Data” in a Different Dimension

Data is our way to represent the real world around us in a slightly different format than what we’re used to. It is a way to share information with others in a more accurate way, it is a method that allows us to play easily with this information using a machine. It’s a technique to convince others with evidence, it’s a method where we capture moments and occurrences of certain real-life events in this world to be later used. Your five senses are a considered data channels to your brain, as much as you can consider your phone’s camera as its sense of sight, or the microphone will be its sense of sound. Each type of computer will have these channelling mechanisms whereby it can receive different data with different formats. What then? The data will set there without any use. Here comes data science!

“Data Science” in a Different Dimension

Data science is an inter-disciplinary field that uses scientific methods, statistics, mathematics, processes, algorithms, and systems to extract knowledge and insights from many structural and unstructured data V. Dhar

Let’s throw that away for a bit and go with a simpler overview. We previously mentioned that data is just a representation of the real world; texts, sounds, images, numbers …etc. but this has no value. Data science transforms this representation, into another representation whereby people can relate to, it adds value and more information to what was only vague data flowing around us into things that are easily understood. After that, it affects our decision, it makes us realize things that we didn’t know before, it changes our actions, and might as well be used to give us a prediction of what will happen if that action was changed. Also, it might tell us things that we could not have known unless we learned, or even if we have learned it, it can tell us in a faster and a more evidential way. Imagine that you spend some amount of money every day, wouldn’t it be useful to see where you spend that money, on a monthly basis, with respect to a certain type of spending. Also, you might have to ask your friend for some amount of money in the next month or reduce how much you spend every day if only you were able to estimate your next month’s budget.

Why Learn Data Science?

“Why Learn Data Science?”, is an interesting question… or… — Questions Alert! — is it? How interesting is it? Why is it interesting? And for whom exactly is it interesting? How many people find that interesting? How many people find it boring? Can I compare how interesting that question is with respect to other questions? But wait? How can I represent the concept “Interesting”? Also, Can I predict the number of people who would be interested in that question this year and in the coming year? Can I predict whether a person would be interested in that question or not before I ask?

Can I — Brainstorming Alert! — answer these questions by just seeing how many people searched for that question on google? Or how many people have clicked on websites that have the answer for that question? Or publish a survey that has related questions with that exact question being answered, and then publish the survey without that question and try to predict whether the person would answer “I am interested in that question” based on his other answers? Or can I just calculate the number of junior data scientists in a region at a certain time?

Data Science will give you the ability to ask questions about anything you see, read, or listen to in your everyday life whether it was as simple as the question above, or as hard as the Large Hadron Collider problem. It will make you capable of thinking about multiple approaches to overcome problems or answer questions. It will change the thought process you follow into an analytical thinker; it will change how you make decisions or receive factual claims from people or assess how truthful the claims are. It will provide you with a logical analytical domain of which you can tell when to accept a claim, reject a claim, or stay neutral.

Data Science is more of a lifestyle, and a philosophy, rather than just a career

Features