Your Prediction Gets As Good As Your Data

Omni-Space發表於2017-07-11

In the past, I have often seen that software engineers and data scientists assume that they can keep increasing their prediction accuracy by improving their machine learning algorithm. Here, I want to approach the classification problem from a different angle where I suggest data scientists analyze the distribution of their data to measure the information level in their data. This approach gives us an upper bound for how far we can improve the accuracy of a predictive algorithm and make sure our optimization efforts are not wasted.

Information and Entropy

In information theory, mathematician have developed useful measures such as entropy to compute the information level in the data. Let's think of a random coin with a head probability of 1%. If one filps this coin, she will collect more information if she sees the head events (i.e. rare event) compared to seeing a tail (i.e. moere likely event). One can formualte the information level in a random process with the negative logarithm of the random event probability.

This captures the described intuition. Mathmatician also formulated another measure called entropy by which they capture the average information in a random process in bits. Below we have shown the entropy formula for a discrete random variable:

For the first example, let's assume we have a coin with P(H)=0% and P(T)=100%. We can compute the entropy of the coin as follows:

For the second example, let's consider a coin where P(H)=1% and P(T)=1-P(H)=99%. Plugging numbers one can find that the entropy of such a coin is:

Finally, if the coin has P(H) = P(T) = 0.5 (i.e. a fair coin), its entropy is calculated as follows:

Entropy and Predictability

So, what these examples tell us? If we have a coin with head probability of zero, the coin's entropy is zero meaning that the average information in the coin is zero. This makes sense because flipping such a coin always comes as tail. Thus, the prediction accuracy is 100%. In other words, when the entropy is zero, we have the maximum predictibility.

In the second example, head probability is not zero but still very close to zero which again makes the coin to be very predictable with a low entropy.

Finally, in the last example we have 50/50 chance of seeing head/tail events which maximizes the entropy and consequently minimizes the predictability. In words, one can show that a fair coin has the meaximum entropy of 1 bit making the prediction as good as a random guess.

Kullback–Leibler Divergence

As last example, we show how we can borrow ideas from information theory to measure the distance between two probability distributions. Let's assume we are modeling two random processes by their pmf's: P(.) and Q(.). One can employ the entropy measure to compute the distance between two pmf's as follows:

Above distance function is known as KL Divergence which measures the distance of Q distribution from P's. The KL Divergence can be very useful in various applications such as NLP problems where we want to measure the distance between the distributions of two documents (e.g. modelled as bag of words).

Wrap-up

In this post, we showed that the entropy from information theory provides a way to measure how much information exists in a given dataset. We also highlighted the inverse relationship between the entropy and the predictability. This shows that one can use the entropy to calculate an upper bound for the accuracy of the prediction problem in hand.

Source: http://www.aioptify.com/informationbound.php

talk-to-your-data
2024-07-29
Your title
2024-08-05
Scan Your Truck Using Nexiq Adapter: Simplifying Your Diagnostic Process
2024-05-22
APT
Prettier your project
2019-04-15
Project
Offering Your Seat
2019-01-30
yii2 Unable to verify your data submission錯誤解決
2021-01-21
Structuring Your TensorFlow Models
2018-09-07
Struct
translate-your-site
2024-04-07
deploy-your-site
2024-04-07
your Android sdk is missing
2020-09-27
Android
connect your tunnel to Cloudflare
2024-06-07
Cloud
Do Your Data Recovery 安全可靠的資料恢復軟體
2022-06-24
資料恢復
Creating your first iOS Framework
2020-04-05
iOSFramework
RuneScape - To verify your level of combat
2022-02-21
BAT
Do Your Data Recovery for Mac安全可靠的資料恢復軟體
2020-12-29
Mac資料恢復
錯誤內容：You have an error in your SQL syntax; check the manual that corresponds to your MySQL server
2024-08-02
ErrorMySqlServer
Project Management - 2) Estimate Your Work
2019-01-11
Project
Boost Your Strategy With The Content Marketing Tools
2023-05-18
Your VM has become "inaccessible.
2019-11-03
Getting NOW() in your preferred time zone
2021-01-03
勇者鬥惡龍: “Your Story”失敗，但“Your RPG”或將續寫傳奇
2019-09-18
Make sure to include VueLoaderPlugin in your webpack config
2018-06-21
VuePluginWeb
Would you like to develop a story for your character?
2020-02-11
dev
JQuery Plugin 2 - Passing Options into Your Plugin
2020-04-07
jQueryPlugin
Writing your first Django app, part 1
2024-08-25
DjangoAPP
How to build your custom release bazel version?
2021-08-23
UI
godaddy 的 Monitoring performance to make your website faster
2018-10-12
GoORMWebAST
Your Tokens Are Mine: A Suspicious Scam Token in A Top Exchange
2018-07-09
使用yum報錯Your license is invalid.
2020-11-02
Vultr賬號被封：Your account is currently closed
2020-11-03
DrawERD makes it easy to visualize your database structure.
2020-04-29
DatabaseStruct
WWDC 2017：Your Apps and Evolving Network Security Standards
2019-02-03
APP
today, is it worth 1/30,000 of your life?
2024-06-29
Elevate Your Lead Generation Game with Maps Scraper AI
2024-05-25
GAMAI
How to prevent your jar packages from being decompiled?
2023-02-14
JARPackageCompile
ou have not concluded your merge (MERGE_HEAD exists)
2018-09-11
如何解決"You have an error in your SQL syntax"
2024-09-20
ErrorSQL
Your ApplicationContext is unlikely to start due to a @ComponentScan of the default package.
2022-03-25
APPContextPackage
Transformer網路-Self-attention is all your need
2023-04-15
ORM

Your Prediction Gets As Good As Your Data

Information and Entropy

Entropy and Predictability

Kullback–Leibler Divergence

Wrap-up

相關文章