Transform entropy
Python implementation of the Overview/Transform entropy
Sections
Definitions
Let $T$ be a one functional transform, $T \in \mathcal{T}_{U,\mathrm{f},1}$, having underlying variables $V = \mathrm{und}(T)$. Let $A$ be a histogram, $A \in \mathcal{A}$, in the underlying variables, $\mathrm{vars}(A) = V$, having size $z = \mathrm{size}(A) > 0$. The underlying volume is $v = |V^{\mathrm{C}}|$. The derived volume is $w = |T^{-1}|$.
Consider the deck of cards example,
def lluu(ll):
return listsSystem([(v,sset(ww)) for (v,ww) in ll])
[suit,rank] = map(VarStr, ["suit","rank"])
[hearts,clubs,diamonds,spades] = map(ValStr, ["hearts","clubs","diamonds","spades"])
[jack,queen,king,ace] = map(ValStr, ["J","Q","K","A"])
uu = lluu([
(suit, [hearts,clubs,diamonds,spades]),
(rank, [jack,queen,king,ace] + list(map(ValInt,range(2,10+1))))])
vv = sset([suit, rank])
uu
# {(rank, {A, J, K, Q, 2, 3, 4, 5, 6, 7, 8, 9, 10}), (suit, {clubs, diamonds, hearts, spades})}
vv
# {rank, suit}
aa = unit(cart(uu,vv))
rpln(aall(aa))
# ({(rank, A), (suit, clubs)}, 1 % 1)
# ({(rank, A), (suit, diamonds)}, 1 % 1)
# ({(rank, A), (suit, hearts)}, 1 % 1)
# ({(rank, A), (suit, spades)}, 1 % 1)
# ({(rank, J), (suit, clubs)}, 1 % 1)
# ({(rank, J), (suit, diamonds)}, 1 % 1)
# ...
# ({(rank, 9), (suit, hearts)}, 1 % 1)
# ({(rank, 9), (suit, spades)}, 1 % 1)
# ({(rank, 10), (suit, clubs)}, 1 % 1)
# ({(rank, 10), (suit, diamonds)}, 1 % 1)
# ({(rank, 10), (suit, hearts)}, 1 % 1)
# ({(rank, 10), (suit, spades)}, 1 % 1)
Also consider a game of cards which has a special deck such that spades and clubs are pip cards and hearts and diamonds are face cards. The suit and the rank are no longer independent,
bb = unit(sset(
[llss([(suit,s),(rank,r)]) for s in [spades,clubs] for r in [ace] + list(map(ValInt,range(2,10+1)))] +
[llss([(suit,s),(rank,r)]) for s in [hearts,diamonds] for r in [jack,queen,king]]))
rpln(aall(bb))
# ({(rank, A), (suit, clubs)}, 1 % 1)
# ({(rank, A), (suit, spades)}, 1 % 1)
# ({(rank, J), (suit, diamonds)}, 1 % 1)
...
# ({(rank, 9), (suit, spades)}, 1 % 1)
# ({(rank, 10), (suit, clubs)}, 1 % 1)
# ({(rank, 10), (suit, spades)}, 1 % 1)
Consider the transform relating the suit to the colour,
colour = VarStr("colour")
red = ValStr("red")
black = ValStr("black")
xx = llaa([(llss([(suit, u),(colour, w)]),1) for (u,w) in [(hearts, red), (clubs, black), (diamonds, red), (spades, black)]])
rpln(aall(xx))
# ({(colour, black), (suit, clubs)}, 1 % 1)
# ({(colour, black), (suit, spades)}, 1 % 1)
# ({(colour, red), (suit, diamonds)}, 1 % 1)
# ({(colour, red), (suit, hearts)}, 1 % 1)
ww = sset([colour])
kk = vars(xx) - ww
tt = trans(xx,ww)
ttaa(tt) == xx
# True
und(tt) == kk
# True
der(tt) == ww
# True
In order to compare the sized derived entropies of the two decks, we shall add together two special decks, $B + B$, to have the same size as whole deck, $A$,
size(aa)
# 52 % 1
size(bb)
# 26 % 1
bb = mul(scalar(2),unit(sset(
[llss([(suit,s),(rank,r)]) for s in [spades,clubs] for r in [ace] + list(map(ValInt,range(2,10+1)))] +
[llss([(suit,s),(rank,r)]) for s in [hearts,diamonds] for r in [jack,queen,king]])))
size(bb)
# 52 % 1
rpln(aall(bb))
# ({(rank, A), (suit, clubs)}, 2 % 1)
# ({(rank, A), (suit, spades)}, 2 % 1)
# ({(rank, J), (suit, diamonds)}, 2 % 1)
# ...
# ({(rank, 9), (suit, spades)}, 2 % 1)
# ({(rank, 10), (suit, clubs)}, 2 % 1)
# ({(rank, 10), (suit, spades)}, 2 % 1)
rpln(aall(tmul(aa,tt)))
# ({(colour, black)}, 26 % 1)
# ({(colour, red)}, 26 % 1)
rpln(aall(tmul(bb,tt)))
# ({(colour, black)}, 40 % 1
# ({(colour, red)}, 12 % 1)
The derived entropy or component size entropy is \[ \begin{eqnarray} \mathrm{entropy}(A * T) &:=& -\sum_{(R,\cdot) \in T^{-1}} (\hat{A} * T)_R \times \ln~(\hat{A} * T)_R \end{eqnarray} \]
ent = histogramsEntropy
ent (tmul(aa,tt))
# 0.6931471805599453
ent (tmul(bb,tt))
# 0.5402041423888608
The derived entropy is positive and less than or equal to the logarithm of the derived volume, $0 \leq \mathrm{entropy}(A * T) \leq \ln w$,
w = len(states(ared(xx,der(tt))))
w
# 2
log(w)
# 0.6931471805599453
ent(tmul(aa,tt)) <= log(w)
# True
ent(tmul(bb,tt)) <= log(w)
# True
Complementary to the derived entropy is the expected component entropy, \[ \begin{eqnarray} \mathrm{entropyComponent}(A,T) &:=& \sum_{(R,C) \in T^{-1}} (\hat{A} * T)_R \times \mathrm{entropy}(A * C)\\ &=&\sum_{(R,\cdot) \in T^{-1}} (\hat{A} * T)_R \times \mathrm{entropy}(\{R\}^{\mathrm{U}} * T^{\odot A}) \end{eqnarray} \]
transformsHistogramsEntropyComponent :: Transform -> Histogram -> Double
For example,
def cent(aa,tt):
return transformsHistogramsEntropyComponent(tt,aa)
cent(aa,tt)
# 3.2580965380214835
cent(bb,tt)
# 2.7178923956326213
The cartesian derived entropy or component cardinality entropy is \[ \begin{eqnarray} \mathrm{entropy}(V^{\mathrm{C}} * T) &:=& -\sum_{(R,\cdot) \in T^{-1}} (\hat{V}^{\mathrm{C}} * T)_R \times \ln~(\hat{V}^{\mathrm{C}} * T)_R \end{eqnarray} \]
vvc = unit(cart(uu,vv))
ent(tmul(vvc,tt))
# 0.6931471805599453
In the case of the whole deck of cards, the histogram is cartesian, $A = V^{\mathrm{C}}$, so the component cardinality entropy equals the derived entropy, $V^{\mathrm{C}} * T = A * T$,
ent(tmul(vvc,tt)) == ent(tmul(aa,tt))
# True
The cartesian derived entropy is positive and less than or equal to the logarithm of the derived volume, $0 \leq \mathrm{entropy}(V^{\mathrm{C}} * T) \leq \ln w$,
ent(tmul(vvc,tt)) <= log(w)
# True
The cartesian derived derived sum entropy or component size cardinality sum entropy is \[ \begin{eqnarray} \mathrm{entropy}(A * T) + \mathrm{entropy}(V^{\mathrm{C}} * T) \end{eqnarray} \]
ent(tmul(aa,tt)) + ent(tmul(vvc,tt))
# 1.3862943611198906
ent(tmul(bb,tt)) + ent(tmul(vvc,tt))
# 1.2333513229488062
The component size cardinality cross entropy is the negative derived histogram expected normalised cartesian derived count logarithm, \[ \begin{eqnarray} \mathrm{entropyCross}(A * T,V^{\mathrm{C}} * T) &:=& -\sum_{(R,\cdot) \in T^{-1}} (\hat{A} * T)_R \times \ln~(\hat{V}^{\mathrm{C}} * T)_R \end{eqnarray} \]
histogramsHistogramsEntropyCross :: Histogram -> Histogram -> Double
For example,
crent = histogramsHistogramsEntropyCross
crent(tmul(aa,tt),tmul(vvc,tt))
# 0.6931471805599453
crent(tmul(bb,tt),tmul(vvc,tt))
# 0.6931471805599453
The component size cardinality cross entropy is greater than or equal to the derived entropy, $\mathrm{entropyCross}(A * T,V^{\mathrm{C}} * T) \geq \mathrm{entropy}(A * T)$,
crent(tmul(aa,tt),tmul(vvc,tt)) >= ent(tmul(aa,tt))
# True
crent(tmul(bb,tt),tmul(vvc,tt)) >= ent(tmul(bb,tt))
# True
The component cardinality size cross entropy is the negative cartesian derived expected normalised derived histogram count logarithm, \[ \begin{eqnarray} \mathrm{entropyCross}(V^{\mathrm{C}} * T,A * T) &:=& -\sum_{(R,\cdot) \in T^{-1}} (\hat{V}^{\mathrm{C}} * T)_R \times \ln~(\hat{A} * T)_R \end{eqnarray} \]
crent(tmul(vvc,tt),tmul(aa,tt))
# 0.6931471805599453
crent(tmul(vvc,tt),tmul(bb,tt))
# 0.864350666630459
The component cardinality size cross entropy is greater than or equal to the cartesian derived entropy, $\mathrm{entropyCross}(V^{\mathrm{C}} * T,A * T) \geq \mathrm{entropy}(V^{\mathrm{C}} * T)$,
crent(tmul(vvc,tt),tmul(aa,tt)) >= ent(tmul(aa,tt))
# True
crent(tmul(vvc,tt),tmul(bb,tt)) >= ent(tmul(bb,tt))
# True
The component size cardinality sum cross entropy is \[ \begin{eqnarray} \mathrm{entropy}(A * T + V^{\mathrm{C}} * T) \end{eqnarray} \]
ent(add(tmul(aa,tt),tmul(vvc,tt)))
# 0.6931471805599453
ent(add(tmul(bb,tt),tmul(vvc,tt)))
# 0.6564535237245771
The component size cardinality sum cross entropy is positive and less than or equal to the logarithm of the derived volume, $0 \leq \mathrm{entropy}(A * T + V^{\mathrm{C}} * T) \leq \ln w$,
ent(add(tmul(aa,tt),tmul(vvc,tt))) <= log(w)
# True
ent(add(tmul(bb,tt),tmul(vvc,tt))) <= log(w)
# True
In all cases the cross entropy is maximised when high size components are low cardinality components, $(\hat{A} * T)_R \gg (\hat{V}^{\mathrm{C}} * T)_R$ or $\mathrm{size}(A * C)/z \gg |C|/v$, and low size components are high cardinality components, $(\hat{A} * T)_R \ll (\hat{V}^{\mathrm{C}} * T)_R$ or $\mathrm{size}(A * C)/z \ll |C|/v$, where $(R,C) \in T^{-1}$. To show this consider another transform $T’$,
tt1 = trans(cdaa([[1,1,1],[1,2,2],[1,3,2],[2,1,2],[2,2,1],[2,3,2],[3,1,2],[3,2,2],[3,3,1]]), sset([VarInt(3)]))
rpln(aall(ttaa(tt1)))
# ({(1, 1), (2, 1), (3, 1)}, 1 % 1)
# ({(1, 1), (2, 2), (3, 2)}, 1 % 1)
# ({(1, 1), (2, 3), (3, 2)}, 1 % 1)
# ({(1, 2), (2, 1), (3, 2)}, 1 % 1)
# ({(1, 2), (2, 2), (3, 1)}, 1 % 1)
# ({(1, 2), (2, 3), (3, 2)}, 1 % 1)
# ({(1, 3), (2, 1), (3, 2)}, 1 % 1)
# ({(1, 3), (2, 2), (3, 2)}, 1 % 1)
# ({(1, 3), (2, 3), (3, 1)}, 1 % 1)
Let $A’$ be a scaled regular diagonal histogram plus a scaled regular cartesian histogram,
aa1 = resize(9,add(norm(regdiag(3,2)),norm(regcart(3,2))))
rpln(aall(aa1))
# ({(1, 1), (2, 1)}, 2 % 1)
# ({(1, 1), (2, 2)}, 1 % 2)
# ({(1, 1), (2, 3)}, 1 % 2)
# ({(1, 2), (2, 1)}, 1 % 2)
# ({(1, 2), (2, 2)}, 2 % 1)
# ({(1, 2), (2, 3)}, 1 % 2)
# ({(1, 3), (2, 1)}, 1 % 2)
# ({(1, 3), (2, 2)}, 1 % 2)
# ({(1, 3), (2, 3)}, 2 % 1)
vvc1 = regcart(3,2)
rpln(aall(vvc1))
# ({(1, 1), (2, 1)}, 1 % 1)
# ({(1, 1), (2, 2)}, 1 % 1)
# ({(1, 1), (2, 3)}, 1 % 1)
# ({(1, 2), (2, 1)}, 1 % 1)
# ({(1, 2), (2, 2)}, 1 % 1)
# ({(1, 2), (2, 3)}, 1 % 1)
# ({(1, 3), (2, 1)}, 1 % 1)
# ({(1, 3), (2, 2)}, 1 % 1)
# ({(1, 3), (2, 3)}, 1 % 1)
rpln(aall(tmul(aa1,tt1)))
# ({(3, 1)}, 6 % 1)
# ({(3, 2)}, 3 % 1)
rpln(aall(tmul(vvc1,tt1)))
# ({(3, 1)}, 3 % 1)
# ({(3, 2)}, 6 % 1)
The derived entropy equals the cartesian derived entropy,
ent(tmul(aa1,tt1))
# 0.6365141682948128
ent(tmul(vvc1,tt1))
# 0.6365141682948128
but the cross entropy is greater than either,
ent(add(tmul(aa1,tt1),tmul(vvc1,tt1)))
# 0.6931471805599453
crent(tmul(aa1,tt1),tmul(vvc1,tt1))
# 0.8675632284814613
crent(tmul(vvc1,tt1),tmul(aa1,tt1))
# 0.8675632284814613
The cross entropy is minimised when the normalised derived histogram equals the normalised cartesian derived, $\hat{A} * T = \hat{V}^{\mathrm{C}} * T$ or $\forall (R,C) \in T^{-1}~(\mathrm{size}(A * C)/z = |C|/v)$. In this case the cross entropy equals the corresponding component entropy,
ent(add(tmul(vvc1,tt1),tmul(vvc1,tt1)))
# 0.6365141682948128
crent(tmul(vvc1,tt1),tmul(vvc1,tt1))
# 0.6365141682948128
The component size cardinality relative entropy is the component size cardinality cross entropy minus the component size entropy, \[ \begin{eqnarray} \mathrm{entropyRelative}(A * T,V^{\mathrm{C}} * T) &:=& \sum_{(R,\cdot) \in T^{-1}} (\hat{A} * T)_R \times \ln\frac{(\hat{A} * T)_R}{(\hat{V}^{\mathrm{C}} * T)_R}\\ &=& \mathrm{entropyCross}(A * T,V^{\mathrm{C}} * T)~-~\mathrm{entropy}(A * T) \end{eqnarray} \] The component size cardinality relative entropy is positive, $\mathrm{entropyRelative}(A * T,V^{\mathrm{C}} * T) \geq 0$,
crent(tmul(aa,tt),tmul(vvc,tt)) - ent(tmul(aa,tt))
# 0.0
crent(tmul(bb,tt),tmul(vvc,tt)) - ent(tmul(bb,tt))
# 0.15294303817108446
crent(tmul(aa1,tt1),tmul(vvc1,tt1)) - ent(tmul(aa1,tt1))
# 0.2310490601866485
The component cardinality size relative entropy is the component cardinality size cross entropy minus the component cardinality entropy, \[ \begin{eqnarray} \mathrm{entropyRelative}(V^{\mathrm{C}} * T,A * T) &:=& \sum_{(R,\cdot) \in T^{-1}} (\hat{V}^{\mathrm{C}} * T)_R \times \ln\frac{(\hat{V}^{\mathrm{C}} * T)_R}{(\hat{A} * T)_R}\\ &=& \mathrm{entropyCross}(V^{\mathrm{C}} * T,A * T)~-~\mathrm{entropy}(V^{\mathrm{C}} * T) \end{eqnarray} \] The component cardinality size relative entropy is positive, $\mathrm{entropyRelative}(V^{\mathrm{C}} * T,A * T) \geq 0$,
crent(tmul(vvc,tt),tmul(aa,tt)) - ent(tmul(vvc,tt))
# 0.0
crent(tmul(vvc,tt),tmul(bb,tt)) - ent(tmul(vvc,tt))
# 0.17120348607051372
crent(tmul(vvc1,tt1),tmul(aa1,tt1)) - ent(tmul(vvc1,tt1))
# 0.2310490601866485
The size-volume scaled component size cardinality sum relative entropy is the size-volume scaled component size cardinality sum cross entropy minus the size-volume scaled component size cardinality sum entropy, \[ \begin{eqnarray} (z+v) \times \mathrm{entropy}(A * T + V^{\mathrm{C}} * T)~-~z \times \mathrm{entropy}(A * T)~-~v \times \mathrm{entropy}(V^{\mathrm{C}} * T) \end{eqnarray} \] The size-volume scaled component size cardinality sum relative entropy is positive and less than the size-volume scaled logarithm of the derived volume, $(z+v) \ln w$,
z = size(aa)
v = vol(uu,vv)
(z+v) * ent(add(tmul(aa,tt),tmul(vvc,tt))) - z * ent(tmul(aa,tt)) - v * ent(tmul(vvc,tt))
# 0.0
(z+v) * ent(add(tmul(bb,tt),tmul(vvc,tt))) - z * ent(tmul(bb,tt)) - v * ent(tmul(vvc,tt))
# 4.136897674018108
(z+v) * log(w)
# 72.0873067782343
z1 = 9
v1 = 9
w1 = 2
(z1+v1) * ent(add(tmul(aa1,tt1),tmul(vvc1,tt1))) - z1 * ent(tmul(aa1,tt1)) - v1 * ent(tmul(vvc1,tt1))
# 1.0193942207723854
(z1+v1) * log(w1)
# 12.476649250079015
(z1+v1) * ent(add(tmul(vvc1,tt1),tmul(vvc1,tt1))) - z1 * ent(tmul(vvc1,tt1)) - v1 * ent(tmul(vvc1,tt1))
# 0.0
In all cases the relative entropy is maximised when (a) the cross entropy is maximised and (b) the component entropy is minimised. That is, the relative entropy is maximised when both (i) the component size entropy, $\mathrm{entropy}(A * T)$, and (ii) the component cardinality entropy, $\mathrm{entropy}(V^{\mathrm{C}} * T)$, are low, but low in different ways so that the component size cardinality sum cross entropy, $\mathrm{entropy}(A * T + V^{\mathrm{C}} * T)$, is high.
Model entropy
Let histogram $A$ have a set of variables $V = \mathrm{vars}(A)$ which is partitioned into query variables $K \subset V$ and label variables $V \setminus K$. Let $T \in \mathcal{T}_{U,\mathrm{f},1}$ be a one functional transform having underlying variables equal to the query variables, $\mathrm{und}(T) = K$. As shown above, given a query state $Q \in K^{\mathrm{CS}}$ that is effective in the sample derived, $R \in (A * T)^{\mathrm{FS}}$ where $\{R\} = (\{Q\}^{\mathrm{U}} * T)^{\mathrm{FS}}$, the probability histogram for the label is \[ \begin{eqnarray} \{Q\}^{\mathrm{U}} * T * T^{\odot A}~\%~(V \setminus K) &\in& \mathcal{A} \cap \mathcal{P} \end{eqnarray} \] In the deck of cards example, the model of the colours of the suits does not tell us anything about the rank given the suit in the case where the histogram is the entire deck,
qq = unit(sset([llss([(suit,clubs)])]))
vk = vv - kk
rpln(aall(norm(ared(mul(mul(tmul(qq,tt),xx),aa),vk))))
# ({(rank, A)}, 1 % 13)
# ({(rank, J)}, 1 % 13)
# ({(rank, K)}, 1 % 13)
# ({(rank, Q)}, 1 % 13)
# ({(rank, 2)}, 1 % 13)
# ({(rank, 3)}, 1 % 13)
# ({(rank, 4)}, 1 % 13)
# ({(rank, 5)}, 1 % 13)
# ({(rank, 6)}, 1 % 13)
# ({(rank, 7)}, 1 % 13)
# ({(rank, 8)}, 1 % 13)
# ({(rank, 9)}, 1 % 13)
# ({(rank, 10)}, 1 % 13)
So the entropy is high,
ent(ared(mul(mul(tmul(qq,tt),xx),aa),vk))
# 2.5649493574615376
In the case of the special deck, however, our model aligns the suit to the rank via colour, so a query on clubs is always a pip card,
rpln(aall(norm(ared(mul(mul(tmul(qq,tt),xx),bb),vk))))
# ({(rank, A)}, 1 % 10)
# ({(rank, 2)}, 1 % 10)
# ({(rank, 3)}, 1 % 10)
# ({(rank, 4)}, 1 % 10)
# ({(rank, 5)}, 1 % 10)
# ({(rank, 6)}, 1 % 10)
# ({(rank, 7)}, 1 % 10)
# ({(rank, 8)}, 1 % 10)
# ({(rank, 9)}, 1 % 10)
# ({(rank, 10)}, 1 % 10)
and the entropy is lower,
ent(ared(mul(mul(tmul(qq,tt),xx),bb),vk))
# 2.3025850929940455
Similarly, a query on hearts is always a face card,
qq = unit(sset([llss([(suit,hearts)])]))
rpln(aall(norm(ared(mul(mul(tmul(qq,tt),xx),bb),vk))))
# ({(rank, J)}, 1 % 3)
# ({(rank, K)}, 1 % 3)
# ({(rank, Q)}, 1 % 3)
which has still lower entropy,
ent(ared(mul(mul(tmul(qq,tt),xx),bb),vk))
# 1.0986122886681096
If the normalised histogram, $\hat{A} \in \mathcal{A} \cap \mathcal{P}$, is treated as a probability function of a single-state query, the scaled expected entropy of the modelled transformed conditional product, or scaled label entropy, is \[ \begin{eqnarray} &&\sum_{(R,C) \in T^{-1}} (A * T)_R \times \mathrm{entropy}(A * C~\%~(V \setminus K))\\ &=&\sum_{(R,\cdot) \in T^{-1}} (A * T)_R \times \mathrm{entropy}(\{R\}^{\mathrm{U}} * T^{\odot A}~\%~(V \setminus K)) \end{eqnarray} \]
setVarsTransformsHistogramsEntropyLabel :: Set.Set Variable -> Transform -> Histogram -> Double
For example,
def tlent(kk,aa,tt):
return setVarsTransformsHistogramsEntropyLabel(kk,tt,aa)
tlent(sset(),aa,tt)
# 169.42101997711714
tlent(kk,aa,tt)
# 133.37736658799994
tlent(sset(),bb,tt)
# 141.3304045728963
tlent(kk,bb,tt)
105.28675118377913
This is similar to the definition of the scaled expected component entropy, above, \[ \begin{eqnarray} z \times \mathrm{entropyComponent}(A,T) &:=& \sum_{(R,C) \in T^{-1}} (A * T)_R \times \mathrm{entropy}(A * C)\\ &=&\sum_{(R,\cdot) \in T^{-1}} (A * T)_R \times \mathrm{entropy}(\{R\}^{\mathrm{U}} * T^{\odot A}) \end{eqnarray} \] but now the component is reduced to the label variables, $V \setminus K$,
def cent(aa,tt):
return transformsHistogramsEntropyComponent(tt,aa)
z = size(aa)
z * cent(aa,tt)
# 169.42101997711714
z * cent(bb,tt)
# 141.3304045728963
The label entropy, may be contrasted with the alignment between the derived variables, $W$, and the label variables, $V \setminus K$, \[ \begin{eqnarray} \mathrm{algn}(A * \mathrm{his}(T)~\%~(W \cup V \setminus K)) \end{eqnarray} \]
algn(ared(mul(aa,ttaa(tt)),ww|vk))
# 0.0
algn(ared(mul(bb,ttaa(tt)),ww|vk))
# 17.15244186319102
The alignment varies against the scaled label entropy or scaled query conditional entropy. Let $B = A * \mathrm{his}(T)~\%~(W \cup V \setminus K)$, \[ \begin{eqnarray} &&\mathrm{algn}(A * \mathrm{his}(T)~\%~(W \cup V \setminus K)) \\ &&\hspace{5em}=\mathrm{algn}(B) \\ &&\hspace{5em}\approx z \times \mathrm{entropy}(B^{\mathrm{X}}) - z \times \mathrm{entropy}(B) \\ &&\hspace{5em}\sim z \times \mathrm{entropy}(B\%W) + z \times \mathrm{entropy}(B\%(V \setminus K)) - z \times \mathrm{entropy}(B) \\ &&\hspace{5em}\sim -(z \times \mathrm{entropy}(B) - z \times \mathrm{entropy}(B\%W)) \\ &&\hspace{5em}= -\sum_{R \in (B\%W)^{\mathrm{FS}}} (B\%W)_R \times \mathrm{entropy}(B * \{R\}^{\mathrm{U}}~\%~(V \setminus K))\\ &&\hspace{5em}= -\sum_{(R,C) \in T^{-1}} (A * T)_R \times \mathrm{entropy}(A * C~\%~(V \setminus K)) \end{eqnarray} \] The label entropy, may also be compared to the slice entropy, which is the sum of the sized entropies of the contingent slices reduced to the label variables, $V \setminus K$, \[ \sum_{R \in (A\%K)^{\mathrm{FS}}} (A\%K)_R \times \mathrm{entropy}(A * \{R\}^{\mathrm{U}}~\%~(V \setminus K)) \]
def lent(kk,aa):
return size(aa) * (ent(aa) - ent(ared(aa,sset(kk))))
lent(sset(),aa)
# 205.46467336623434
lent(kk,aa)
# 133.37736658800003
lent(sset(),bb)
# 169.42101997711714
lent(kk,bb)
# 105.28675118377922
In the case where the relation between the derived variables and the label variables is functional or causal, \[ \begin{eqnarray} \mathrm{split}(W,(A * \mathrm{his}(T)~\%~(W \cup V \setminus K))^{\mathrm{FS}}) &\in& W^{\mathrm{CS}} \to (V \setminus K)^{\mathrm{CS}} \end{eqnarray} \] the label entropy is zero, \[ \begin{eqnarray} \sum_{(R,C) \in T^{-1}} (A * T)_R \times \mathrm{entropy}(A * C~\%~(V \setminus K)) &=& 0 \end{eqnarray} \] This would be the case, for example, for a deck consisting of 26 ace of spades and 26 queen of hearts,
cc = mul(scalar(26),unit(sset([
llss([(suit,spades),(rank,ace)]),
llss([(suit,hearts),(rank,queen)])])))
rpln(aall(cc))
# ({(rank, A), (suit, spades)}, 26 % 1)
# ({(rank, Q), (suit, hearts)}, 26 % 1)
rpln(aall(tmul(cc,tt)))
# ({(colour, black)}, 26 % 1)
# ({(colour, red)}, 26 % 1)
rpln(aall(ared(mul(cc,ttaa(tt)),ww|vk)))
# ({(colour, black), (rank, A)}, 26 % 1
# ({(colour, red), (rank, Q)}, 26 % 1)
ssplit = setVarsSetStatesSplit
rpln(ssplit(ww,states(ared(mul(cc,ttaa(tt)),ww|vk))))
# ({(colour, black)}, {(rank, A)})
# ({(colour, red)}, {(rank, Q)})
tlent(kk,cc,tt)
# 0.0
algn(ared(mul(cc,ttaa(tt)),ww|vk))
# 32.31474810951032
Now the model predicts the rank given the suit,
qq = unit(sset([llss([(suit,clubs)])]))
rpln(aall(norm(ared(mul(mul(tmul(qq,tt),xx),cc),vk))))
# ({(rank, A)}, 1 % 1)
qq = unit(sset([llss([(suit,hearts)])]))
rpln(aall(norm(ared(mul(mul(tmul(qq,tt),xx),cc),vk))))
# ({(rank, Q)}, 1 % 1)
So label entropy is a measure of the ambiguity in the relation between the derived variables and the label variables. Negative label entropy may be viewed as the degree to which the derived variables of the model predict the label variables. In the cases of low label entropy, or high causality, the derived variables and the label variables are correlated and therefore aligned, $\mathrm{algn}(A * \mathrm{his}(T)~\%~(W \cup V \setminus K)) > 0$. In these cases the derived histogram tends to the diagonal, $\mathrm{algn}(A * T) > 0$.
Example - a weather forecast
Some of the concepts above regarding transform entropy can be demonstrated with the sample of some weather measurements created in States, histories and histograms,
def lluu(ll):
return listsSystem([(v,sset(ww)) for (v,ww) in ll])
def llhh(vv,ev):
return listsHistory([(IdInt(i), llss(zip(vv,ll))) for (i,ll) in ev])
def red(aa,ll):
return setVarsHistogramsReduce(sset(ll),aa)
def ssplit(ll,aa):
return setVarsSetStatesSplit(sset(ll),states(aa))
def aarr(aa):
return [(ss,float(q)) for (ss,q) in aall(aa)]
def lltt(kk,ww,qq):
return trans(unit(sset([llss(zip(kk + ww,ll)) for ll in qq])),sset(ww))
def query(qq,tt,aa,ll):
return norm(red(mul(mul(tmul(qq,tt),ttaa(tt)),aa),ll))
[pressure,cloud,wind,rain] = map(VarStr,["pressure","cloud","wind","rain"])
[low,medium,high,none,light,heavy,strong] = map(ValStr,["low","medium","high","none","light","heavy","strong"])
uu = lluu([
(pressure, [low,medium,high]),
(cloud, [none,light,heavy]),
(wind, [none,light,strong]),
(rain, [none,light,heavy])])
vv = uvars(uu)
hh = llhh([pressure,cloud,wind,rain],[
(1,[high,none,none,none]),
(2,[medium,light,none,light]),
(3,[high,none,light,none]),
(4,[low,heavy,strong,heavy]),
(5,[low,none,light,light]),
(6,[medium,none,light,light]),
(7,[low,heavy,light,heavy]),
(8,[high,none,light,none]),
(9,[medium,light,strong,heavy]),
(10,[medium,light,light,light]),
(11,[high,light,light,heavy]),
(12,[medium,none,none,none]),
(13,[medium,light,none,none]),
(14,[high,light,strong,light]),
(15,[medium,none,light,light]),
(16,[low,heavy,strong,heavy]),
(17,[low,heavy,light,heavy]),
(18,[high,none,none,none]),
(19,[low,light,none,light]),
(20,[high,none,none,none])])
aa = hhaa(hh)
uu
# {(cloud, {heavy, light, none}), (pressure, {high, low, medium}), (rain, {heavy, light, none}), (wind, {light, none, strong})}
vv
# {cloud, pressure, rain, wind}
rpln(aall(aa))
# ({(cloud, heavy), (pressure, low), (rain, heavy), (wind, light)}, 2 % 1)
# ({(cloud, heavy), (pressure, low), (rain, heavy), (wind, strong)}, 2 % 1)
# ({(cloud, light), (pressure, high), (rain, heavy), (wind, light)}, 1 % 1)
# ({(cloud, light), (pressure, high), (rain, light), (wind, strong)}, 1 % 1)
# ({(cloud, light), (pressure, low), (rain, light), (wind, none)}, 1 % 1)
# ({(cloud, light), (pressure, medium), (rain, heavy), (wind, strong)}, 1 % 1)
# ({(cloud, light), (pressure, medium), (rain, light), (wind, light)}, 1 % 1)
# ({(cloud, light), (pressure, medium), (rain, light), (wind, none)}, 1 % 1)
# ({(cloud, light), (pressure, medium), (rain, none), (wind, none)}, 1 % 1)
# ({(cloud, none), (pressure, high), (rain, none), (wind, light)}, 2 % 1)
# ({(cloud, none), (pressure, high), (rain, none), (wind, none)}, 3 % 1)
# ({(cloud, none), (pressure, low), (rain, light), (wind, light)}, 1 % 1)
# ({(cloud, none), (pressure, medium), (rain, light), (wind, light)}, 2 % 1)
# ({(cloud, none), (pressure, medium), (rain, none), (wind, none)}, 1 % 1)
size(aa)
# 20 % 1
We considered the case where we wish to predict the rain
given the pressure
, cloud
and wind
in
Transforms, by creating a transform which related cloud
and wind
,
cloud_and_wind = VarStr("cloud_and_wind")
tt = lltt([cloud,wind],[cloud_and_wind],[
[none, none, none],
[none, light, light],
[none, strong, light],
[light, none, light],
[light, light, light],
[light, strong, light],
[heavy, none, strong],
[heavy, light, strong],
[heavy, strong, strong]])
The derived, $A * T$, is
rpln(aall(tmul(aa,tt)))
# ({(cloud_and_wind, light)}, 12 % 1)
# ({(cloud_and_wind, none)}, 4 % 1)
# ({(cloud_and_wind, strong)}, 4 % 1)
rpln(aarr(norm(tmul(aa,tt))))
# ({(cloud_and_wind, light)}, 0.6)
# ({(cloud_and_wind, none)}, 0.2)
# ({(cloud_and_wind, strong)}, 0.2)
The derived entropy, $\mathrm{entropy}(A * T)$, is
ent = histogramsEntropy
ent(tmul(aa,tt))
# 0.9502705392332347
The derived entropy is positive and less than or equal to the logarithm of the derived volume, $0 \leq \mathrm{entropy}(A * T) \leq \ln w$,
w = 3
log(w)
# 1.0986122886681098
Complementary to the derived entropy is the expected component entropy, $\mathrm{entropyComponent}(A,T)$,
cent = transformsHistogramsEntropyComponent
cent(tt,aa)
# 1.603411018796562
The cartesian derived, $V^{\mathrm{C}} * T$, is
vvc = unit(cart(uu,vv))
size(vvc)
# 81 % 1
rpln(aall(tmul(vvc,tt)))
# ({(cloud_and_wind, light)}, 45 % 1)
# ({(cloud_and_wind, none)}, 9 % 1)
# ({(cloud_and_wind, strong)}, 27 % 1)
rpln(aarr(norm(tmul(vvc,tt))))
# ({(cloud_and_wind, light)}, 0.5555555555555556)
# ({(cloud_and_wind, none)}, 0.1111111111111111)
# ({(cloud_and_wind, strong)}, 0.3333333333333333)
The cartesian derived entropy, $\mathrm{entropy}(V^{\mathrm{C}} * T)$, is
ent(tmul(vvc,tt))
# 0.9368883075390159
The component size cardinality cross entropy, $\mathrm{entropyCross}(A * T,V^{\mathrm{C}} * T)$, is
crent = histogramsHistogramsEntropyCross
crent(tmul(aa,tt),tmul(vvc,tt))
# 1.0118393721421373
crent(tmul(aa,tt),tmul(vvc,tt)) >= ent(tmul(aa,tt))
# True
The component cardinality size cross entropy, $\mathrm{entropyCross}(V^{\mathrm{C}} * T,A * T)$, is
crent(tmul(vvc,tt),tmul(aa,tt))
# 0.9990977520629283
crent(tmul(vvc,tt),tmul(aa,tt)) >= ent(tmul(aa,tt))
# True
The sum of the derived and cartesian derived, $A * T + V^{\mathrm{C}} * T$, is
rpln(aall(add(tmul(aa,tt),tmul(vvc,tt))))
# ({(cloud_and_wind, light)}, 57 % 1)
# ({(cloud_and_wind, none)}, 13 % 1)
# ({(cloud_and_wind, strong)}, 31 % 1)
rpln(aarr(norm(add(tmul(aa,tt),tmul(vvc,tt)))))
# ({(cloud_and_wind, light)}, 0.5643564356435643)
# ({(cloud_and_wind, none)}, 0.12871287128712872)
# ({(cloud_and_wind, strong)}, 0.3069306930693069)
The component size cardinality sum cross entropy, $\mathrm{entropy}(A * T + V^{\mathrm{C}} * T)$, is
ent(add(tmul(aa,tt),tmul(vvc,tt)))
# 0.9492604450332509
ent(add(tmul(aa,tt),tmul(vvc,tt))) <= log(w)
# True
The component size cardinality relative entropy, $\mathrm{entropyRelative}(A * T,V^{\mathrm{C}} * T)$, is the component size cardinality cross entropy minus the component size entropy, $\mathrm{entropyCross}(A * T,V^{\mathrm{C}} * T)~-~\mathrm{entropy}(A * T)$,
crent(tmul(aa,tt),tmul(vvc,tt)) - ent(tmul(aa,tt))
# 0.06156883290890258
The component cardinality size relative entropy, $\mathrm{entropyRelative}(V^{\mathrm{C}} * T,A * T)$, is the component cardinality size cross entropy minus the component cardinality entropy, $\mathrm{entropyCross}(V^{\mathrm{C}} * T,A * T)~-~\mathrm{entropy}(V^{\mathrm{C}} * T)$,
crent(tmul(vvc,tt),tmul(aa,tt)) - ent(tmul(vvc,tt))
# 0.062209444523912416
The size-volume scaled component size cardinality sum relative entropy is the size-volume scaled component size cardinality sum cross entropy minus the size-volume scaled component size cardinality sum entropy, \[ \begin{eqnarray} (z+v) \times \mathrm{entropy}(A * T + V^{\mathrm{C}} * T) - z \times \mathrm{entropy}(A * T) - v \times \mathrm{entropy}(V^{\mathrm{C}} * T) \end{eqnarray} \]
z = size(aa)
v = vol(uu,vv)
(z+v) * ent(add(tmul(aa,tt),tmul(vvc,tt))) - z * ent(tmul(aa,tt)) - v * ent(tmul(vvc,tt))
# 0.9819412530333693
(z+v) * log(w)
# 110.95984115547908
Define the abbreviation rent
for the size-volume scaled component size cardinality sum relative entropy,
def rent(aa,bb):
a = size(aa)
b = size(bb)
return (a+b) * ent(add(aa,bb)) - a * ent(aa) - b * ent(bb)
rent(tmul(aa,tt),tmul(vvc,tt))
# 0.9819412530333693
It was shown that the alignment between cloud_and_wind
and rain
is greater than the alignments between any of cloud
, wind
or pressure
and rain
,
algn(red(aa,[pressure,rain]))
# 4.278766678519384
algn(red(aa,[cloud,rain]))
# 6.4150379630063465
algn(red(aa,[wind,rain]))
# 3.930131313218345
algn(red(mul(aa,ttaa(tt)),[cloud_and_wind,rain]))
# 6.743705969634357
Define the abbreviation tlalgn
for the alignment of the derived variables and the label variables,
def ared(aa,vv):
return setVarsHistogramsReduce(vv,aa)
def tlalgn(tt,aa,ll):
return algn(ared(mul(aa,ttaa(tt)),der(tt)|sset(ll)))
tlalgn(tt,aa,[rain])
# 6.743705969634357
The alignments are all zero for a cartesian sample,
algn(vvc)
# 0.0
algn(tmul(vvc,tt))
# 0.0
and for the independent and formal,
algn(ind(aa))
# 0.0
algn(tmul(ind(aa),tt))
# 0.0
In the case of medium pressure, heavy cloud and light winds, the forecast for rain
is heavy
,
qq1 = hhaa(llhh([pressure,cloud,wind],[(1,[medium,heavy,light])]))
rpln(aarr(query(qq1,tt,aa,[rain])))
# ({(rain, heavy)}, 1.0)
So the entropy for this query is zero,
ent(query(qq1,tt,aa,[rain]))
# -0.0
Compare this to the cartesian where all outcomes are equally probable,
rpln(aarr(query(qq1,tt,vvc,[rain])))
# ({(rain, heavy)}, 0.3333333333333333)
# ({(rain, light)}, 0.3333333333333333)
# ({(rain, none)}, 0.3333333333333333)
ent(query(qq1,tt,vvc,[rain]))
# 1.0986122886681096
For some queries the model is ambiguous. For example, when the pressure is low, but there is no cloud and winds are light, the forecast is usually for light rain, but not always,
qq2 = hhaa(llhh([pressure,cloud,wind],[(1,[low,none,light])]))
rpln(aarr(query(qq2,tt,aa,[rain])))
# ({(rain, heavy)}, 0.16666666666666666)
# ({(rain, light)}, 0.5833333333333334)
# ({(rain, none)}, 0.25)
In this case the entropy is higher,
ent(query(qq2,tt,aa,[rain]))
# 0.9596147939120492
but still lower than for the cartesian,
ent(query(qq2,tt,vvc,[rain]))
# 1.0986122886681096
If the normalised histogram, $\hat{A} \in \mathcal{A} \cap \mathcal{P}$, is treated as a probability function of a single-state query, the scaled label entropy, is \[ \begin{eqnarray} \sum_{(R,C) \in T^{-1}} (A * T)_R \times \mathrm{entropy}(A * C~\%~(V \setminus K)) \end{eqnarray} \]
def tlent(tt,aa,ll):
return setVarsTransformsHistogramsEntropyLabel(vars(aa)-sset(ll),tt,aa)
tlent(tt,aa,[rain])
# 11.51537752694459
An idea of the scale of the label entropy can be obtained from the cartesian,
z/v * tlent(tt,vvc,[rain])
# 21.97224577336219
This is similar to the definition of the scaled expected component entropy, $z \times \mathrm{entropyComponent}(A,T)$,
z * cent(tt,aa)
# 32.06822037593124
z * cent(tt,vvc)
# 69.15121694266844
The label entropy, may be contrasted with the alignment between the derived variables, $W$, and the label variables, $V \setminus K$, \[ \mathrm{algn}(A * \mathrm{his}(T)~\%~(W \cup V \setminus K)) \]
algn(red(mul(aa,ttaa(tt)),[cloud_and_wind,rain]))
# 6.743705969634357
or
tlalgn(tt,aa,[rain])
# 6.743705969634357
This may be compared to the diagonalised for an idea of scale,
algn(resize(size(aa),regdiag(3,2)))
# 15.413144235093519
The label entropy, may also be compared to the slice entropy, which is the sum of the sized entropies of the contingent slices reduced to the label variables, $V \setminus K$, \[ \sum_{R \in (A\%K)^{\mathrm{FS}}} (A\%K)_R \times \mathrm{entropy}(A * \{R\}^{\mathrm{U}}~\%~(V \setminus K)) \]
def lent(aa,ll):
return size(aa) * (ent(aa) - ent(ared(aa,vars(aa)-sset(ll))))
lent(aa,[rain])
# 1.3862943611198908
z/v * lent(vvc,[rain])
# 21.972245773362047
That is, the model label entropy is much higher than the sample label entropy, but model queries may be applied to ineffective sample states.
Now let us compare the entropy properties of several models. First redefine the cloud_and_wind
model as $T_{\mathrm{cw}}$,
ttcw = tt
Now consider a model $T_{\mathrm{c}}$ which consists of a literal reframe of the cloud
variable,
cloud2 = VarStr("cloud2")
ttc = lltt([cloud],[cloud2],[
[none, none],
[light, light],
[heavy, heavy]])
rpln(aarr(norm(tmul(aa,ttc))))
# ({(cloud2, heavy)}, 0.2)
# ({(cloud2, light)}, 0.35)
# ({(cloud2, none)}, 0.45)
ent(tmul(aa,ttc))
# 1.0486537893593546
So the simpler model, $T_{\mathrm{c}}$, has higher derived entropy than $T_{\mathrm{cw}}$.
Consider the relative entropy,
rent(tmul(aa,ttc),tmul(vvc,ttc))
# 0.8099580712542576
Now consider the alignment between the derived variable and the label variable,
tlalgn(ttc,aa,[rain])
# 6.4150379630063465
algn(red(aa,[cloud,rain]))
# 6.4150379630063465
So the simpler model, $T_{\mathrm{c}}$, has both lower relative entropy and lower label alignment than $T_{\mathrm{cw}}$.
Now consider queries on the model,
qq1 = hhaa(llhh([pressure,cloud,wind],[(1,[medium,heavy,light])]))
rpln(aarr(query(qq1,ttc,aa,[rain])))
# ({(rain, heavy)}, 1.0)
qq2 = hhaa(llhh([pressure,cloud,wind],[(1,[low,none,light])]))
rpln(aarr(query(qq2,ttc,aa,[rain])))
# ({(rain, light)}, 0.3333333333333333)
# ({(rain, none)}, 0.6666666666666666)
tlent(ttc,aa,[rain])
# 12.418526752441055
So the simpler model, $T_{\mathrm{c}}$, has higher label entropy than $T_{\mathrm{cw}}$. In short, the simpler model, $T_{\mathrm{c}}$, is generally a worse predictor of label than $T_{\mathrm{cw}}$.
Consider if a better predictor of the rain
can be made by constructing a transform $T_{\mathrm{cp}}$ that relates cloud
and pressure
,
algn(red(aa,[pressure,cloud]))
# 4.6232784937782885
cloud_and_pressure = VarStr("cloud_and_pressure")
ttcp = lltt([cloud,pressure],[cloud_and_pressure],[
[none, high, none],
[none, medium, light],
[none, low, light],
[light, high, light],
[light, medium, light],
[light, low, light],
[heavy, high, strong],
[heavy, medium, strong],
[heavy, low, strong]])
rpln(aarr(norm(tmul(aa,ttcp))))
# ({(cloud_and_pressure, light)}, 0.55)
# ({(cloud_and_pressure, none)}, 0.25)
# ({(cloud_and_pressure, strong)}, 0.2)
ent(tmul(aa,ttcp))
# 0.9972715231823841
So the simpler model, $T_{\mathrm{cp}}$, has higher derived entropy than $T_{\mathrm{cw}}$, but not as high as $T_{\mathrm{c}}$.
Consider the relative entropy,
rent(tmul(aa,ttcp),tmul(vvc,ttcp))
# 1.4736881918377236
Now consider the alignment between the derived variable and the label variable,
tlalgn(ttcp,aa,[rain])
# 8.020893995655356
So the new model, $T_{\mathrm{cp}}$, has both higher relative entropy and higher label alignment than $T_{\mathrm{cw}}$, although the derived entropy is higher.
Now consider queries on the model,
rpln(aarr(query(qq1,ttcp,aa,[rain])))
# ({(rain, heavy)}, 1.0)
rpln(aarr(query(qq2,ttcp,aa,[rain])))
# ({(rain, heavy)}, 0.18181818181818182)
# ({(rain, light)}, 0.6363636363636364)
# ({(rain, none)}, 0.18181818181818182)
tlent(ttcp,aa,[rain])
# 9.982888235155102
So the new model, $T_{\mathrm{cp}}$, has lower label entropy than $T_{\mathrm{cw}}$. In short, the new model, $T_{\mathrm{cp}}$, is generally a better predictor of label than $T_{\mathrm{cw}}$.
To summarise,
[ent(tmul(aa,tt)) for tt in [ttc, ttcw, ttcp]]
# [1.0486537893593546, 0.9502705392332347, 0.9972715231823841]
[cent(tt,aa) for tt in [ttc, ttcw, ttcp]]
# [1.5050277686704419, 1.603411018796562, 1.5564100348474128]
[rent(tmul(aa,tt),tmul(vvc,tt)) for tt in [ttc, ttcw, ttcp]]
# [0.8099580712542576, 0.9819412530333693, 1.4736881918377236]
[tlalgn(tt,aa,[rain]) for tt in [ttc, ttcw, ttcp]]
# [6.4150379630063465, 6.743705969634357, 8.020893995655356]
[tlent(tt,aa,[rain]) for tt in [ttc, ttcw, ttcp]]
# [12.418526752441055, 11.51537752694459, 9.982888235155102]
The weather forecast example continues in Functional definition sets.