Why is data science not an exact science?

Organisations are working to adopt data science in order to gain more relevant answers to the complex issues concerning their business. However, these answers are not absolute. Why?

Illustration

Managers have traditionally viewed the world in specific terms and hard numbers. This, however, is an old monochrome perspective compared to what data science can offer them. Instead of a single number, such as 40%, data science offers a probability result which combines a level of confidence and an error rate. (Statistical calculations are, of course, much more complex.) This may seem complicated, but such a combined output helps non-technical people in decision making.

Think more critically about the numbers used in your decision

You need to understand that the predictions offered by data science are only probabilities, not absolute "truths". You compare the possibilities with a higher level of accuracy by understanding the mutual trade-offs of each number. As a result, you will engage in more meaningful and valuable discussions with data experts.

When working with data efficiently, you use statistics to model the real world. However, it is not clear that statistical models describe exactly what is actually happening. You can define a certain probability distribution but it is not clear that the world also follows such a model.

In fact, there are several reasons why data science is not an exact science:

Data

You may or may not have all the data you need to answer your question and even if you do, there may be data quality issues that could cause bias or other unwanted results. According to Gartner, "Poor data quality destroys business value" and costs organisations an average of $15 million a year.

If some necessary data is lacking, the results will be inaccurate as the data is not exactly what you are trying to measure. You may be able to obtain data from an external source, but keep in mind that third-party data may also suffer from quality issues. A current example is COVID-19 data, which is recorded and reported differently by different sources.

Questions

It is said that if someone wants better answers, they should ask better questions. Better questions come from scientists working with domains to solve a problem. Other aspects include assumptions, available resources, limitations, goals, potential risks, potential benefits, success metrics, and the form of the question.

Expectations

Data science is sometimes seen as a panacea or magic potion. In fact, it is neither.

There are significant limitations in data science and machine learning. Taking a real world problem and turning it into a pure mathematical problem results in the loss of a lot of information because we need somehow to reduce it in order to focus on key aspects of the problem.

Context

A model can work very well in one context and fail miserably in another. It is important to clarify that this model only applies in the given circumstances. These are extreme conditions and if they are not met, the assumptions are not valid and the model needs to be revised.

Even in the same case, the prediction model may be inaccurate. For example, a customer exit model based on historical data could give more weight to recent purchases than older ones or vice versa. The first thing that comes to mind is to build a prediction based on the existing data you have; however, building a traffic prediction model based on your existing data means you discount future data that you will collect.

Labels

Image recognition begins with tagged data, such as photos that are tagged "cat" and "dog," etc. However, tagging all content is not so easy. For example, it may vary across cultural standards and the norms of different countries. Much depends on the conditions and the initial assignment.

Similarly, if a neural network is taught to predict the type of images coming from a mobile phone, and has been trained in songs and photos from an iOS device, it will not be able to predict the same type of content coming from an Android device, and vice versa.

Many open source neural networks that address facial recognition have been tuned to a specific data set. So, if we try to use this neural network in real situations on real cameras, it doesn't work because the images coming from the new domains are a little different; thus the neural network can't process them properly and the accuracy decreases. Unfortunately, it is difficult to predict in which domain the model will work well or not. There are no estimates or formulas to help scientists find the best one.

 

-bb-

Article source InformationWeek - portál předního amerického magazínu InformationWeek věnovaný moderním technologiím a byznysu
Read more articles from InformationWeek