AMES - House Prices

Sections

Introduction

Properties of the sample

Predicting sale price without modelling

Induced modelling of sale price

Introduction

The Ames Housing dataset describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. It was compiled by Dean De Cock for use in data science education. Full details of the dataset are in Kaggle Data Set - House Prices: Advanced Regression Techniques.

The dataset contains 1460 events of 80 variables including SalePrice. There is also a test dataset containing 1459 events of 79 variables excluding SalePrice.

Here’s a brief version of what you’ll find in the data description file:

We shall analyse this dataset using the AMESPy repository which depends on the AlignmentRepaPy repository. The AlignmentRepaPy repository is a fast Python implementation of some of the practicable inducers described in the paper. The code in this section can be executed by copying and pasting the code into a Python interpreter, see README. Also see the Introduction in Notation.

Properties of the sample

First load the training sample $A_{\mathrm{tr}}$ and the test sample $A_{\mathrm{te}}$,

from AMESDev import *

(uu,aatr,aate) = amesIO()
vv = uvars(uu) - sset([VarStr("Id")])
vvl = sset([VarStr("SalePrice")])
vvk = vv - vvl

size(aatr)
# 1460 % 1

size(aate)
# 1459 % 1

len(vv)
80

The system is $U$. The sample substrate variables are $V = \mathrm{vars}(A_{\mathrm{tr}}) \setminus \{\mathrm{Id}\}$, the label variables are $V_{\mathrm{l}} = \{\mathrm{SalePrice}\}$, and the query variables form the remainder, $V_{\mathrm{k}} = V \setminus V_{\mathrm{l}}$.

Now create a joint sample on the query variables $A = A_{\mathrm{tr}}\%V_{\mathrm{k}} + A_{\mathrm{te}}\%V_{\mathrm{k}}$,

aa = add(red(aatr,vvk),red(aate,vvk))

size(aa)
# 2919 % 1

So $\mathrm{vars}(A) = V_{\mathrm{k}}$.

The variable valencies are $\{(w,|U_w|) : w \in V\}$,

rpln(sset([(vol(uu,sset([w])),w) for w in vv]))
# (2, CentralAir)
# (2, Street)
# (3, Alley)
# ...
# (10, OverallQual)
# (10, SaleType)
# (12, MoSold)
# (14, PoolArea)
# (14, TotRmsAbvGrd)
# (16, Exterior1st)
# (16, MSSubClass)
# (17, Exterior2nd)
# (25, Neighborhood)
# (31, 3SsnPorch)
# (36, LowQualFinSF)
# (38, MiscVal)
# (61, YearRemodAdd)
# (104, GarageYrBlt)
# (118, YearBuilt)
# (121, ScreenPorch)
# (129, LotFrontage)
# (183, EnclosedPorch)
# (252, OpenPorchSF)
# (273, BsmtFinSF2)
# (379, WoodDeckSF)
# (445, MasVnrArea)
# (604, GarageArea)
# (635, 2ndFlrSF)
# (663, SalePrice)
# (992, BsmtFinSF1)
# (1059, TotalBsmtSF)
# (1083, 1stFlrSF)
# (1136, BsmtUnfSF)
# (1292, GrLivArea)
# (1951, LotArea)

In order to construct tuples with more than one variable, the valencies of some of the variables with ordered values can be reframed into buckets. Module AMESDev has a function isOrd that determines which variables can be bucketed,

rpln(sset([(vol(uu,sset([w])),w) for w in vv if isOrd(uu,w)]))
# (3, HalfBath)
# (4, BsmtHalfBath)
# ...
# (14, PoolArea)
# (14, TotRmsAbvGrd)
# (16, MSSubClass)
# (31, 3SsnPorch)
# ...
# (1136, BsmtUnfSF)
# (1292, GrLivArea)
# (1951, LotArea)

rpln(sset([(u,w) for w in vv for u in [vol(uu,sset([w]))] if isOrd(uu,w) if u > 16]))
# (31, 3SsnPorch)
# (36, LowQualFinSF)
# (38, MiscVal)
# (61, YearRemodAdd)
# (104, GarageYrBlt)
# (118, YearBuilt)
# (121, ScreenPorch)
# (129, LotFrontage)
# (183, EnclosedPorch)
# (252, OpenPorchSF)
# (273, BsmtFinSF2)
# (379, WoodDeckSF)
# (445, MasVnrArea)
# (604, GarageArea)
# (635, 2ndFlrSF)
# (663, SalePrice)
# (992, BsmtFinSF1)
# (1059, TotalBsmtSF)
# (1083, 1stFlrSF)
# (1136, BsmtUnfSF)
# (1292, GrLivArea)
# (1951, LotArea)

vvo = sset([w for w in vv for u in [vol(uu,sset([w]))] if isOrd(uu,w) if u > 16])

rpln(aall(red(aa,sset([VarStr("3SsnPorch")]))))
# ({(3SsnPorch, 0)}, 2882 % 1)
# ({(3SsnPorch, 23)}, 1 % 1)
# ({(3SsnPorch, 86)}, 1 % 1)
# ...
# ({(3SsnPorch, 360)}, 1 % 1)
# ({(3SsnPorch, 407)}, 1 % 1)
# ({(3SsnPorch, 508)}, 1 % 1)

rpln(aall(red(aa,sset([VarStr("LotArea")]))))
# ({(LotArea, 1300)}, 1 % 1)
# ({(LotArea, 1470)}, 1 % 1)
# ({(LotArea, 1476)}, 1 % 1)
# ...
# ({(LotArea, 159000)}, 1 % 1)
# ({(LotArea, 164660)}, 1 % 1)
# ({(LotArea, 215245)}, 1 % 1)

Let us determine which variables treat ValStr "null" as a special case,

rpln(sset([(size(bb),w) for w in vvk & vvo for rr in [unit(sset([llss([(w,ValStr("null"))])]))] for bb in [mul(red(aa,sset([w])),rr)] if size(bb) > 0]))
# (1 % 1, BsmtFinSF1)
# (1 % 1, BsmtFinSF2)
# (1 % 1, BsmtUnfSF)
# (1 % 1, GarageArea)
# (1 % 1, TotalBsmtSF)
# (23 % 1, MasVnrArea)
# (159 % 1, GarageYrBlt)
# (486 % 1, LotFrontage)

rpln(sset([(size(bb),w) for w in vvk & vvo for rr in [unit(sset([llss([(w,ValStr("null"))])]))] for bb in [mul(red(aatr,sset([w])),rr)] if size(bb) > 0]))
# (8 % 1, MasVnrArea)
# (81 % 1, GarageYrBlt)
# (259 % 1, LotFrontage)

Let us determine which variables treat ValInt 0 as a special case,

rpln(sset([(size(bb),w) for w in vvk & vvo for rr in [unit(sset([llss([(w,ValInt(0))])]))] for bb in [mul(red(aa,sset([w])),rr)] if size(bb) > 200]))
# (241 % 1, BsmtUnfSF)
# (929 % 1, BsmtFinSF1)
# (1298 % 1, OpenPorchSF)
# (1523 % 1, WoodDeckSF)
# (1668 % 1, 2ndFlrSF)
# (1738 % 1, MasVnrArea)
# (2460 % 1, EnclosedPorch)
# (2571 % 1, BsmtFinSF2)
# (2663 % 1, ScreenPorch)
# (2816 % 1, MiscVal)
# (2879 % 1, LowQualFinSF)
# (2882 % 1, 3SsnPorch)

rpln(sset([(size(bb),w) for w in vvk & vvo for rr in [unit(sset([llss([(w,ValInt(0))])]))] for bb in [mul(red(aatr,sset([w])),rr)] if size(bb) > 100]))
# (118 % 1, BsmtUnfSF)
# (467 % 1, BsmtFinSF1)
# (656 % 1, OpenPorchSF)
# (761 % 1, WoodDeckSF)
# (829 % 1, 2ndFlrSF)
# (861 % 1, MasVnrArea)
# (1252 % 1, EnclosedPorch)
# (1293 % 1, BsmtFinSF2)
# (1344 % 1, ScreenPorch)
# (1408 % 1, MiscVal)
# (1434 % 1, LowQualFinSF)
# (1436 % 1, 3SsnPorch)

vvoz = sset([w for w in vvk & vvo for rr in [unit(sset([llss([(w,ValInt(0))])]))] for bb in [mul(red(aatr,sset([w])),rr)] if size(bb) > 100])

len(vvo)
22

len(vvoz)
12

There are 22 orderable variables, of which 12 treat ValInt 0 as a special case.

Now let us reframe to valencies of 20,

xx = sdict()

for v in vvk & (vvo - vvoz):
	xx[v] = (VarStr(str(v)+"B"),bucket(20,aa,v))

xx[VarStr("SalePrice")] = (VarStr("SalePrice"+"B"),bucket(20,aatr,VarStr("SalePrice")))

for v in vvk & vvoz:
	rr = unit(sset([llss([(v,ValInt(0))])]))
	bb = mul(red(aa,sset([v])),rr)
	aa1 = trim(sub(red(aa,sset([v])),bb))
	xx[v] = (VarStr(str(v)+"B"),bucket(20,aa1,v))

aab = reframeb(aa,xx)

aatrb = reframeb(aatr,xx)

aateb = reframeb(aate,xx)

uub = uunion(sys(aab),uunion(sys(aatrb),sys(aateb)))
vvb = uvars(uub) - sset([VarStr("Id")])
vvbl = sset([VarStr("SalePriceB")])
vvbk = vvb - vvbl

rpln(sset([(vol(uub,sset([w])),w) for w in vvb]))
# (2, CentralAir)
# (2, Street)
# (3, Alley)
# (3, HalfBath)
# (3, LandSlope)
# (3, PavedDrive)
# (3, Utilities)
# (4, BsmtHalfBath)
# ...
# (14, TotRmsAbvGrd)
# (16, Exterior1st)
# (16, MSSubClass)
# (17, Exterior2nd)
# (18, LotFrontageB)
# (19, YearRemodAddB)
# ...
# (23, ScreenPorchB)
# (25, Neighborhood)
# (31, 3SsnPorchB)

rpln([(ss,q) for w in vvbk for (ss,q) in aall(red(aab,sset([w])))])

rpln([(ss,q) for w in vvbk for (ss,q) in aall(red(aatrb,sset([w])))])

The bucketed system is $U_{\mathrm{b}}$. The bucketed joint sample is $A_{\mathrm{b}}$, the bucketed training sample is $A_{\mathrm{trb}}$ and the bucketed test sample is $A_{\mathrm{teb}}$. The bucketed sample substrate variables are $V_{\mathrm{b}}$, the bucketed label variables are $V_{\mathrm{bl}} = \{\mathrm{SalePriceB}\}$, and the bucketed query variables are $V_{\mathrm{bk}}$.

For convenience, the bucketing is encapsulated in amesBucketedIO in AMESDev,

from AMESDev import *

(uub,aab,aatrb,aateb) = amesBucketedIO(20)
vvb = uvars(uub) - sset([VarStr("Id")])
vvbl = sset([VarStr("SalePriceB")])
vvbk = vvb - vvbl

The mean query bucketed valency, $|V_{\mathrm{b}}^{\mathrm{C}}|^{1/|V_{\mathrm{b}}|}$, is,

exp(log(vol(uub,vvb))/len(vvb))
8.421852632661576

The label variable dimension, $|V_{\mathrm{bl}}|$, is,

len(vvbl)
1

The label bucketed variable volume, $|V_{\mathrm{bl}}^{\mathrm{C}}|$, is,

vol(uub,vvbl)
20

The query variable dimension, $|V_{\mathrm{bk}}|$, is,

len(vvbk)
79

The geometric mean query bucketed valency, $|V_{\mathrm{bk}}^{\mathrm{C}}|^{1/|V_{\mathrm{bk}}|}$, is,

exp(log(vol(uub,vvbk))/len(vvbk))
8.330151968320083

The bucketed sample size, $\mathrm{size}(A_{\mathrm{b}})$, is

size(aab)
# 2919 % 1

Nearly all effective states correspond to exactly one event,

size(eff(aab))
# 2916 % 1

The bucketed training sample size, $\mathrm{size}(A_{\mathrm{trb}})$, is

size(aatrb)
# 1460 % 1

All bucketed effective states correspond to exactly one event, $A_{\mathrm{trb}} = A_{\mathrm{trb}}^{\mathrm{F}}$,

size(eff(aatrb))
# 1460 % 1

Now consider how highly aligned variables might be grouped together. See Entropy and alignment. First consider pairs in the substrate, $V_{\mathrm{b}}$, \[ \{(\mathrm{algn}(A_{\mathrm{trb}}\%\{w,x\}),~w,~x) : w,x \in V_{\mathrm{b}},~w < x\} \]

rpln(reversed(list(sset([(algn(red(aatrb,sset([w,x]))),w,x) for w in vvb for x in vvb if w < x]))))
# (2465.5152987646425, GarageYrBltB, YearBuiltB)
# (2152.5485832484987, Exterior1st, Exterior2nd)
# (1978.2349802971114, YearBuiltB, YearRemodAddB)
# (1858.3580587963286, 1stFlrSFB, TotalBsmtSFB)
# (1724.7876321508177, GarageYrBltB, YearRemodAddB)
# (1599.6055353637776, HouseStyle, MSSubClass)
# (1568.0764166986735, 1stFlrSFB, GrLivAreaB)
# (1324.8418737030024, Neighborhood, YearBuiltB)
# (1142.4421276400722, GarageAreaB, GarageCars)
# (1058.2972140759603, GarageYrBltB, Neighborhood)
# (1016.1760916141229, 2ndFlrSFB, MSSubClass)
# (1002.6330777071075, BsmtFinSF1B, BsmtFinType1)
# (1001.7919977485235, 2ndFlrSFB, HouseStyle)
# (999.2949685620983, GrLivAreaB, TotalBsmtSFB)
# (997.9601406145066, FireplaceQu, Fireplaces)
# (959.3978900891489, MasVnrAreaB, MasVnrType)
# (856.1859130216717, Foundation, YearBuiltB)
# (837.2461935016695, MSSubClass, YearBuiltB)
# (835.5545116214507, MSSubClass, Neighborhood)
# (819.6253930916721, Neighborhood, YearRemodAddB)
# ...

We can see that some of the variables that are in highly aligned pairs are also in other highly aligned pairs, e.g. YearBuiltB or Neighborhood. This suggests that we should also consider tuple dimensions greater than two.

Now consider using the tupler to group together highly aligned variables in the substrate, $V_{\mathrm{b}}$. Note that for performance reasons we must first construct a HistoryRepa from the sample histogram, $A_{\mathrm{trb}}$. See History and HistoryRepa.

First consider the tuple dimension by choosing a volume limit, xmax,

8.330151968320083 ** 3
578.041172320828

8.421852632661576 ** 3
597.341809663622

25*31
775

2*2*3*3*3*3*3
972

size(aatrb)
# 1460 % 1

size(aab)
# 2919 % 1

2*2*3*3*3*3*3*4
3888

8.330151968320083 ** 4
4815.170809378394

8.421852632661576 ** 4
5030.724692314405

Now create a shuffled sample, $A_{\mathrm{trbr}}$,

hhtrb = hrhrred(aahr(uub,aatrb),vvb)

hhtrbr = historyRepasShuffle_u(hhtrb,1)

hrsize(hhtrbr)
1460

Now optimise the shuffle content alignment with the tuple set builder, $I_{P,U,\mathrm{B,ns,me}}$, \[ \{(\mathrm{algn}(A_{\mathrm{trb}}\%K)-\mathrm{algn}(A_{\mathrm{trbr}}\%K),~K) : ((K,\cdot,\cdot),\cdot) \in I_{P,U,\mathrm{B,ns,me}}^{ * }((V_{\mathrm{b}},~\emptyset,~A_{\mathrm{trb}},~A_{\mathrm{trbr}}))\} \]

def buildtuprr(xmax,omax,bmax,uu,vv,xx,xxrr):
    return reversed(list(sset([(algn(rraa(uu,hrred(xx,kk))) - algn(rraa(uu,hrred(xxrr,kk))), kk) for ((kk,_),_) in parametersSystemsBuilderTupleNoSumlayerMultiEffectiveRepa_ui(xmax,omax,bmax,1,uu,vv,fudEmpty(),xx,hrhx(xx),xxrr,hrhx(xxrr))[0]])))

rpln(buildtuprr(1460,10,10,uub,vvb,hhtrb,hhtrbr))
# (2289.22205538067, {GarageYrBltB, YearBuiltB})
# (2287.0736209675033, {GarageYrBltB, Utilities, YearBuiltB})
# (2281.3123335781984, {GarageYrBltB, Street, YearBuiltB})
# (2268.490561832467, {BldgType, CentralAir, HouseStyle, MSSubClass})
# (2266.4793952001714, {GarageYrBltB, PavedDrive, YearBuiltB})
# (2263.6360419083876, {CentralAir, GarageYrBltB, YearBuiltB})
# (2230.8809389216776, {Alley, GarageYrBltB, YearBuiltB})
# (2219.1688869176696, {BldgType, HouseStyle, MSSubClass})
# (2217.2589976365557, {BldgType, HouseStyle, MSSubClass, Street})
# (2204.396018461853, {GarageYrBltB, LandSlope, YearBuiltB})

Now optimise again having removed the top tuple from the substrate, \[ Q_1~=~\{\mathrm{GarageYrBltB},~\mathrm{YearBuiltB}\} \] and \[ \{(\mathrm{algn}(A_{\mathrm{trb}}\%K)-\mathrm{algn}(A_{\mathrm{trbr}}\%K),~K) : ((K,\cdot,\cdot),\cdot) \in I_{P,U,\mathrm{B,ns,me}}^{ * }((V_{\mathrm{b}} \setminus Q_1,~\emptyset,~A_{\mathrm{trb}},~A_{\mathrm{trbr}}))\} \]

qq1 = sset([VarStr(s) for s in ["GarageYrBltB","YearBuiltB"]])

rpln(buildtuprr(1460,10,10,uub,vvb-qq1,hhtrb,hhtrbr))
# (2268.490561832467, {BldgType, CentralAir, HouseStyle, MSSubClass})
# (2219.1688869176696, {BldgType, HouseStyle, MSSubClass})
# (2217.2589976365557, {BldgType, HouseStyle, MSSubClass, Street})
# (2188.1493956203217, {ExterQual, Exterior1st, Exterior2nd})
# (2173.3056986406436, {Exterior1st, Exterior2nd, HeatingQC})
# (2168.0794578894715, {BsmtQual, Exterior1st, Exterior2nd})
# (2148.7255972894404, {CentralAir, Exterior1st, Exterior2nd})
# (2142.8877269264362, {CentralAir, Exterior1st, Exterior2nd, Street})
# (2132.5055547558827, {Exterior1st, Exterior2nd, FullBath})
# (2125.3962435527606, {Exterior1st, Exterior2nd, PoolQC})

Now optimise again having removed the top two tuples from the substrate, \[ Q_2~=~\{\mathrm{BldgType},~\mathrm{CentralAir},~\mathrm{HouseStyle},~\mathrm{MSSubClass},~\mathrm{Street}\} \] and \[ \{(\mathrm{algn}(A_{\mathrm{trb}}\%K)-\mathrm{algn}(A_{\mathrm{trbr}}\%K),~K) : ((K,\cdot,\cdot),\cdot) \in I_{P,U,\mathrm{B,ns,me}}^{ * }((V_{\mathrm{b}} \setminus Q_1 \setminus Q_2,~\emptyset,~A_{\mathrm{trb}},~A_{\mathrm{trbr}}))\} \]

qq2 = sset([VarStr(s) for s in ["BldgType","CentralAir","HouseStyle","MSSubClass","Street"]])

rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2,hhtrb,hhtrbr))
# (2188.1493956203217, {ExterQual, Exterior1st, Exterior2nd})
# (2173.3056986406436, {Exterior1st, Exterior2nd, HeatingQC})
# (2168.0794578894715, {BsmtQual, Exterior1st, Exterior2nd})
# (2132.5055547558827, {Exterior1st, Exterior2nd, FullBath})
# (2125.3962435527606, {Exterior1st, Exterior2nd, PoolQC})
# (2123.4159165046276, {Exterior1st, Exterior2nd})
# (2121.5205221457086, {Exterior1st, Exterior2nd, Utilities})
# (2118.647962265917, {Exterior1st, Exterior2nd, GarageFinish})
# (2111.4491188041366, {Exterior1st, Exterior2nd, KitchenQual})
# (2110.495612404761, {Exterior1st, Exterior2nd, KitchenAbvGr})

Then continue in the same manner,

qq3 = sset([VarStr(s) for s in ["ExterQual","Exterior1st","Exterior2nd"]])

rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2-qq3,hhtrb,hhtrbr))
# (1694.5493452101775, {1stFlrSFB, TotalBsmtSFB, Utilities})
# (1693.1630508490578, {1stFlrSFB, TotalBsmtSFB})
# (1620.8665578986738, {1stFlrSFB, LandSlope, TotalBsmtSFB})
# (1617.1756900313953, {1stFlrSFB, PavedDrive, TotalBsmtSFB})
# (1611.3552018153305, {1stFlrSFB, Alley, TotalBsmtSFB})
# (1491.5366482964555, {1stFlrSFB, HalfBath, TotalBsmtSFB})
# (1473.1977583285745, {1stFlrSFB, GrLivAreaB, HalfBath})
# (1412.9034649265147, {GarageAreaB, GarageCars, GarageFinish})
# (1402.9331016819592, {1stFlrSFB, GrLivAreaB})
# (1401.4667646131659, {1stFlrSFB, GrLivAreaB, Utilities})

qq4 = sset([VarStr(s) for s in ["1stFlrSFB","TotalBsmtSFB","Utilities"]])

rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2-qq3-qq4,hhtrb,hhtrbr))
# (1412.9034649265147, {GarageAreaB, GarageCars, GarageFinish})
# (1331.0976721076222, {GarageAreaB, GarageCars, GarageType})
# (1323.8535812808213, {GarageAreaB, GarageCars, GarageQual})
# (1310.561464900003, {GarageAreaB, GarageCars, GarageCond})
# (1309.2152777779943, {BsmtFinSF1B, BsmtFinType1, BsmtFullBath})
# (1285.4814209909187, {BsmtQual, GarageAreaB, GarageCars})
# (1271.7784622620688, {FullBath, GarageAreaB, GarageCars})
# (1224.471189313343, {Foundation, GarageAreaB, GarageCars})
# (1223.657567001395, {GarageAreaB, GarageCars, KitchenQual})
# (1214.9020691181922, {BsmtQual, Foundation, Neighborhood})

qq5 = sset([VarStr(s) for s in ["GarageAreaB","GarageCars","GarageFinish","GarageType","GarageQual","GarageCond"]])

rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2-qq3-qq4-qq5,hhtrb,hhtrbr))
# (1453.4771115474682, {BsmtQual, FireplaceQu, Fireplaces, Foundation})
# (1363.822442537943, {BsmtQual, FireplaceQu, Fireplaces, KitchenQual})
# (1333.0081210204607, {BsmtFinType1, BsmtQual, FireplaceQu, Fireplaces})
# (1319.9957903960371, {BsmtQual, FireplaceQu, Fireplaces, FullBath})
# (1309.2152777779943, {BsmtFinSF1B, BsmtFinType1, BsmtFullBath})
# (1227.7053704690293, {BsmtExposure, BsmtQual, FireplaceQu, Fireplaces})
# (1214.9020691181922, {BsmtQual, Foundation, Neighborhood})
# (1188.545222003212, {BsmtCond, BsmtQual, FireplaceQu, Fireplaces})
# (1187.2707174927878, {BsmtQual, FireplaceQu, Fireplaces, HeatingQC})
# (1186.9910298796954, {BsmtFinSF1B, BsmtFinType1, BsmtQual})

qq6 = sset([VarStr(s) for s in ["BsmtQual","FireplaceQu","Fireplaces","Foundation","KitchenQual","FullBath","BsmtFinType1","BsmtFinType1","BsmtFullBath","BsmtExposure","BsmtCond"]])

rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2-qq3-qq4-qq5-qq6,hhtrb,hhtrbr))
# (1021.5872634666134, {MasVnrAreaB, MasVnrType, OverallQual})
# (1008.192002264328, {LandContour, LandSlope, MasVnrAreaB, MasVnrType})
# (985.2892683218874, {HalfBath, KitchenAbvGr, MasVnrAreaB, MasVnrType})
# (975.6594546625283, {MasVnrAreaB, MasVnrType, SaleCondition})
# (974.0412841880125, {MasVnrAreaB, MasVnrType, SaleType})
# (971.2954032113157, {HalfBath, MasVnrAreaB, MasVnrType})
# (967.3208590480517, {HalfBath, MasVnrAreaB, MasVnrType, PoolQC})
# (963.0704190077258, {BsmtHalfBath, HalfBath, MasVnrAreaB, MasVnrType})
# (958.3037608474738, {HalfBath, LandSlope, MasVnrAreaB, MasVnrType})
# (953.7850012379358, {MasVnrAreaB, MasVnrType, RoofStyle})

qq7 = sset([VarStr(s) for s in ["MasVnrAreaB","MasVnrType","OverallQual","LandContour","LandSlope","SaleCondition","BsmtHalfBath","HalfBath","KitchenAbvGr","PoolQC","RoofStyle","SaleType"]])

rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2-qq3-qq4-qq5-qq6-qq7,hhtrb,hhtrbr))
# (746.2103566256701, {Alley, MSZoning, Neighborhood, PavedDrive})
# (746.0276233365403, {HeatingQC, MSZoning, Neighborhood})
# (732.7649052052839, {GrLivAreaB, TotRmsAbvGrd})
# (731.0542770424454, {Alley, GrLivAreaB, TotRmsAbvGrd})
# (726.4489731521944, {GrLivAreaB, PavedDrive, TotRmsAbvGrd})
# (720.8607196839353, {MSZoning, Neighborhood, OverallCond})
# (680.9994890584201, {Neighborhood, PavedDrive, YearRemodAddB})
# (676.152201640848, {BedroomAbvGr, MSZoning, Neighborhood})
# (663.0878461206503, {Alley, MSZoning, Neighborhood})
# (658.3548208268667, {Neighborhood, YearRemodAddB})

len(vvb-qq1-qq2-qq3-qq4-qq5-qq6-qq7)
39

After this selection of 7 tuples there are 39 less closely aligned variables remaining.

That is, there is a possible partition of the substrate as follows, $\bigcup\{Q_1,~Q_2,~Q_3,~Q_4,~Q_5,~Q_6,~Q_7,~V_{\mathrm{b}} \setminus \{Q_1,Q_2,Q_3,Q_4,Q_5,Q_6,Q_7\}\} = V_{\mathrm{b}}$,

qq1 
# {GarageYrBltB, YearBuiltB}

qq2
# {BldgType, CentralAir, HouseStyle, MSSubClass, Street}

qq3
# {ExterQual, Exterior1st, Exterior2nd}

qq4
# {1stFlrSFB, TotalBsmtSFB, Utilities}

qq5
# {GarageAreaB, GarageCars, GarageCond, GarageFinish, GarageQual, GarageType}

qq6
# {BsmtCond, BsmtExposure, BsmtFinType1, BsmtFullBath, BsmtQual, FireplaceQu, Fireplaces, Foundation, FullBath, KitchenQual}

qq7
# {BsmtHalfBath, HalfBath, KitchenAbvGr, LandContour, LandSlope, MasVnrAreaB, MasVnrType, OverallQual, PoolQC, RoofStyle, SaleCondition, SaleType}

vvb-qq1-qq2-qq3-qq4-qq5-qq6-qq7
# {2ndFlrSFB, 3SsnPorchB, Alley, BedroomAbvGr, BsmtFinSF1B, BsmtFinSF2B, BsmtFinType2, BsmtUnfSFB, Condition1, Condition2, Electrical, EnclosedPorchB, ExterCond, Fence, Functional, GrLivAreaB, Heating, HeatingQC, LotAreaB, LotConfig, LotFrontageB, LotShape, LowQualFinSFB, MSZoning, MiscFeature, MiscValB, MoSold, Neighborhood, OpenPorchSFB, OverallCond, PavedDrive, PoolArea, RoofMatl, SalePriceB, ScreenPorchB, TotRmsAbvGrd, WoodDeckSFB, YearRemodAddB, YrSold}

Predicting sale price without modelling

The sample query variables predict edibility. That is, there is a functional or causal relationship between the query variables and the label variables, $(A_{\mathrm{trb}}\%V_{\mathrm{bk}})^{\mathrm{FS}} \to (A_{\mathrm{trb}}\%V_{\mathrm{bl}})^{\mathrm{FS}}$. So the label entropy or query conditional entropy is zero. See Entropy and alignment. The label entropy is \[ \begin{eqnarray} \mathrm{lent}(A,W,L)~:=~\mathrm{entropy}(A~\%~(W \cup L)) - \mathrm{entropy}(A~\%~W) \end{eqnarray} \]

def lent(aa,ww,vvl):
    return ent(red(aa,ww|vvl)) - ent(red(aa,ww))

Then $\mathrm{lent}(A_{\mathrm{trb}},V_{\mathrm{bk}},V_{\mathrm{bl}}) = 0$,

lent(aatrb,vvbk,vvbl)
0.0

We can determine which of the query variables has the least conditional entropy, \[ \begin{eqnarray} \{(\mathrm{lent}(A_{\mathrm{trb}},\{w\},V_{\mathrm{bl}}),~w) : w \in V_{\mathrm{bk}}\} \end{eqnarray} \]

rpln(sset([(lent(aatrb,sset([w]),vvbl),w) for w in vvbk]))
# (2.3688094014030585, Neighborhood)
# (2.401775936638896, OverallQual)
# (2.438375883488403, GrLivAreaB)
# (2.505387552265462, GarageAreaB)
# (2.5141550725111537, TotalBsmtSFB)
# (2.5286425658331457, YearBuiltB)
# (2.575379618536724, GarageYrBltB)
# (2.5772799825200803, 1stFlrSFB)
# (2.6072098566399333, GarageCars)
# (2.626737288605211, YearRemodAddB)
# (2.633387515941217, MSSubClass)
# (2.652820702365173, BsmtQual)
# (2.660976881941984, 2ndFlrSFB)
# (2.662266429504159, ExterQual)
# ...
# (2.9763251921097145, LandSlope)
# (2.980341704620559, PoolArea)
# (2.9844981605029184, PoolQC)
# (2.9889645372889797, Street)
# (2.9928053874878207, Utilities)

This may be compared to the entropy of the label variables, $\mathrm{entropy}(A_{\mathrm{trb}}\%V_{\mathrm{bl}})$,

ent(red(aatrb,vvbl))
2.9948072760546887

Utilities has the highest conditional entropy, and so makes very little prediction of sale price. Neighborhood has the least conditional entropy, and so is more predictive of sale price. Its label entropy is $\mathrm{lent}(A_{\mathrm{trb}},\{\mathrm{Neighborhood}\},V_{\mathrm{bl}})$,

vNeighborhood = VarStr("Neighborhood")

lent(aatrb,sset([vNeighborhood]),vvbl)
2.3688094014030585

Let us reduce the sample, $A_{\mathrm{trb}}~\%~(\{\mathrm{Neighborhood}\} \cup V_{\mathrm{bl}})$, to see the relationship,

rpln(aall(red(aatrb,sset([vNeighborhood])|vvbl)))
# ({(Neighborhood, Blmngtn), (SalePriceB, 163000)}, 2 % 1)
# ({(Neighborhood, Blmngtn), (SalePriceB, 172500)}, 2 % 1)
# ({(Neighborhood, Blmngtn), (SalePriceB, 179200)}, 3 % 1)
# ...
# ({(Neighborhood, Veenker), (SalePriceB, 278000)}, 1 % 1)
# ({(Neighborhood, Veenker), (SalePriceB, 326000)}, 2 % 1)
# ({(Neighborhood, Veenker), (SalePriceB, 755000)}, 1 % 1)

rpln(ssplit(vvbk,states(red(aatrb,sset([vNeighborhood])|vvbl))))
# ({(Neighborhood, Blmngtn)}, {(SalePriceB, 163000)})
# ({(Neighborhood, Blmngtn)}, {(SalePriceB, 172500)})
# ({(Neighborhood, Blmngtn)}, {(SalePriceB, 179200)})
# ...
# ({(Neighborhood, Veenker)}, {(SalePriceB, 278000)})
# ({(Neighborhood, Veenker)}, {(SalePriceB, 326000)})
# ({(Neighborhood, Veenker)}, {(SalePriceB, 755000)})

We can determine minimum subsets of the query variables that are causal or predictive by using the repa conditional entropy tuple set builder. We shall also calculate the shuffle content derived alignment and the size-volume-sized-shuffle relative entropy. \[ \{(\mathrm{lent}(A_{\mathrm{trb}},M,V_{\mathrm{bl}}),~M) : M \in \mathrm{botd}(\mathrm{qmax})(\mathrm{elements}(Z_{P,A_{\mathrm{trb}},\mathrm{L}}))\} \]

def buildcondrr(vvl,aa,kmax,omax,qmax):
    return sset([(b,a) for (a,b) in parametersBuilderConditionalVarsRepa(kmax,omax,qmax,vvl,aa).items()])

(kmax,omax,qmax) = (1, 60, 10)

ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)

rpln(ll)
# (2.3688094014030727, {Neighborhood})
# (2.4017759366388978, {OverallQual})
# (2.4383758834884066, {GrLivAreaB})
# (2.5053875522654745, {GarageAreaB})
# (2.5141550725111603, {TotalBsmtSFB})
# (2.5286425658331524, {YearBuiltB})
# (2.5753796185367293, {GarageYrBltB})
# (2.577279982520081, {1stFlrSFB})
# (2.607209856639932, {GarageCars})
# (2.626737288605217, {YearRemodAddB})

Let us sort by shuffle content derived alignment descending. Let $L = \mathrm{botd}(\mathrm{qmax})(\mathrm{elements}(Z_{P,A_{\mathrm{trb}},\mathrm{L}}))$. Then calculate \[ \{(\mathrm{algn}(A_{\mathrm{trb}}\%X)-\mathrm{algn}(A_{\mathrm{trbr}}\%X),~X) : (e,X) \in L\} \]

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for aar1 in [hhaa(hrhh(uub,hrhrred(hhtrbr,xx)))]]))))
# (0.0, {YearRemodAddB})
# (0.0, {YearBuiltB})
# (0.0, {TotalBsmtSFB})
# (0.0, {OverallQual})
# (0.0, {Neighborhood})
# (0.0, {GrLivAreaB})
# (0.0, {GarageYrBltB})
# (0.0, {GarageCars})
# (0.0, {GarageAreaB})
# (0.0, {1stFlrSFB})

and by size-volume-sized-shuffle relative entropy descending, \[ \{(\mathrm{rent}(A_{\mathrm{trb}}~\%~X,~Z_X * \hat{A}_{\mathrm{trbr}}~\%~X),~X) : (e,X) \in L\} \] where $Z_X = \mathrm{scalar}(|X^{\mathrm{C}}|)$,

def vsize(uu,xx,aa):
    return resize(vol(uu,xx),aa)

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (3.694822225952521e-13, {GrLivAreaB})
# (1.7763568394002505e-13, {GarageAreaB})
# (1.7763568394002505e-13, {1stFlrSFB})
# (1.2789769243681803e-13, {GarageYrBltB})
# (2.842170943040401e-14, {YearBuiltB})
# (2.6645352591003757e-15, {GarageCars})
# (-2.2382096176443156e-13, {OverallQual})
# (-3.765876499528531e-13, {TotalBsmtSFB})
# (-3.979039320256561e-13, {Neighborhood})
# (-6.323830348264892e-13, {YearRemodAddB})

Choose a tuple $X$ with the maximum relative entropy,

xx = sset([VarStr(s) for s in ["GrLivAreaB"]])

len(xx)
1

The label entropy, $\mathrm{lent}(A_{\mathrm{trb}},X,V_{\mathrm{bl}})$, is,

lent(aatrb,xx,vvbl)
2.438375883488403

This tuple has a volume of $|X^{\mathrm{C}}| = 21$,

vol(uub,xx)
21

Now consider the query effectiveness against the test set, $\mathrm{size}(A_{\mathrm{teb}} * (A_{\mathrm{trb}}\%X)^{\mathrm{F}})$,

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 1459 % 1

So there exists a prediction for each of the test set for the mono-variate tuple.

(kmax,omax,qmax) = (2, 60, 10)

ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)

rpln(ll)
# (1.17259972145533, {GarageYrBltB, GrLivAreaB})
# (1.1763717756588035, {GrLivAreaB, YearBuiltB})
# (1.1825182247882395, {BsmtUnfSFB, GarageAreaB})
# (1.1902468919522748, {GarageAreaB, GrLivAreaB})
# (1.1963275646526252, {1stFlrSFB, GarageYrBltB})
# (1.2007710029111403, {GarageYrBltB, TotalBsmtSFB})
# (1.2044059887997465, {BsmtUnfSFB, GrLivAreaB})
# (1.2080664696090002, {1stFlrSFB, YearBuiltB})
# (1.218109353181501, {BsmtUnfSFB, LotAreaB})
# (1.218337507644863, {1stFlrSFB, GarageAreaB})

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for aar1 in [hhaa(hrhh(uub,hrhrred(hhtrbr,xx)))]]))))
# (235.08640048183634, {GarageAreaB, GrLivAreaB})
# (205.02701940728434, {GrLivAreaB, YearBuiltB})
# (199.00732245944096, {1stFlrSFB, GarageAreaB})
# (196.09625918630286, {1stFlrSFB, YearBuiltB})
# (172.6125934542572, {GarageYrBltB, TotalBsmtSFB})
# (170.12391234916913, {BsmtUnfSFB, GrLivAreaB})
# (153.73732911898333, {1stFlrSFB, GarageYrBltB})
# (147.96457141870314, {GarageYrBltB, GrLivAreaB})
# (62.39292498274949, {BsmtUnfSFB, GarageAreaB})
# (56.5385461499186, {BsmtUnfSFB, LotAreaB})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (195.89974557002188, {GarageAreaB, GrLivAreaB})
# (173.08992727414716, {1stFlrSFB, YearBuiltB})
# (168.78883116031693, {BsmtUnfSFB, GrLivAreaB})
# (167.82327612434983, {GrLivAreaB, YearBuiltB})
# (164.30059739055696, {1stFlrSFB, GarageAreaB})
# (160.0849264746389, {BsmtUnfSFB, GarageAreaB})
# (158.6702662871935, {GarageYrBltB, TotalBsmtSFB})
# (155.68492715470074, {1stFlrSFB, GarageYrBltB})
# (153.5325087147944, {GarageYrBltB, GrLivAreaB})
# (149.44578750669734, {BsmtUnfSFB, LotAreaB})

xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB"]])

len(xx)
2

lent(aatrb,xx,vvbl)
1.2044059887997252

vol(uub,xx)
462

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 1395 % 1

1459 - 1395
64

In the case of the bi-variate tuple with the highest relative entropy, the query on the test set is ineffective for 64 events.

(kmax,omax,qmax) = (3, 60, 10)

ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)

rpln(ll)
# (0.1583902593862545, {BsmtUnfSFB, GarageAreaB, LotAreaB})
# (0.16321608158264933, {BsmtUnfSFB, GrLivAreaB, LotAreaB})
# (0.17534692586874456, {BsmtUnfSFB, GarageYrBltB, LotAreaB})
# (0.17551868584768204, {GarageAreaB, GrLivAreaB, LotAreaB})
# (0.18785610344571868, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB})
# (0.19170918418881655, {BsmtUnfSFB, GarageAreaB, GrLivAreaB})
# (0.1920984001311803, {GarageYrBltB, GrLivAreaB, LotAreaB})
# (0.1925880641418205, {BsmtUnfSFB, LotAreaB, YearRemodAddB})
# (0.19796856294348153, {1stFlrSFB, GarageAreaB, LotAreaB})
# (0.2007710037546273, {BsmtUnfSFB, LotAreaB, YearBuiltB})

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for aar1 in [hhaa(hrhh(uub,hrhrred(hhtrbr,xx)))]]))))
# (130.3754279569223, {1stFlrSFB, GarageAreaB, LotAreaB})
# (124.43940524792288, {BsmtUnfSFB, GarageAreaB, GrLivAreaB})
# (115.96748840137627, {GarageAreaB, GrLivAreaB, LotAreaB})
# (113.42933723706949, {GarageYrBltB, GrLivAreaB, LotAreaB})
# (102.01380411810794, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB})
# (100.30833097059985, {BsmtUnfSFB, LotAreaB, YearBuiltB})
# (92.29286671992077, {BsmtUnfSFB, GrLivAreaB, LotAreaB})
# (78.63867541400259, {BsmtUnfSFB, LotAreaB, YearRemodAddB})
# (73.51335432366454, {BsmtUnfSFB, GarageAreaB, LotAreaB})
# (72.3502035138589, {BsmtUnfSFB, GarageYrBltB, LotAreaB})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (3703.571392190337, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB})
# (3661.5241122377047, {BsmtUnfSFB, GarageAreaB, GrLivAreaB})
# (3658.9316126172052, {BsmtUnfSFB, GrLivAreaB, LotAreaB})
# (3658.1801888467744, {BsmtUnfSFB, GarageAreaB, LotAreaB})
# (3636.9144274921127, {GarageAreaB, GrLivAreaB, LotAreaB})
# (3599.7471796129685, {BsmtUnfSFB, LotAreaB, YearBuiltB})
# (3559.347827119549, {GarageYrBltB, GrLivAreaB, LotAreaB})
# (3558.9595478408883, {BsmtUnfSFB, GarageYrBltB, LotAreaB})
# (3519.053657775017, {1stFlrSFB, GarageAreaB, LotAreaB})
# (3509.535390459998, {BsmtUnfSFB, LotAreaB, YearRemodAddB})

xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB","LotAreaB"]])

len(xx)
3

lent(aatrb,xx,vvbl)
0.1632160815826591

vol(uub,xx)
9702

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 369 % 1

In the case of the tri-variate tuple with the highest relative entropy, the query on the test set is effective for only 369 events.

xx = sset([VarStr(s) for s in ["1stFlrSFB","GarageAreaB","LotAreaB"]])

len(xx)
3

lent(aatrb,xx,vvbl)
0.19796856294348153

vol(uub,xx)
9261

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 436 % 1

The tri-variate tuple with the highest content alignment is effective for only 436 events.

(kmax,omax,qmax) = (4, 60, 10)

ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)

rpln(ll)
# (0.017264363040668584, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold})
# (0.021653557329598172, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (0.025045822967731723, {GarageYrBltB, GrLivAreaB, LotAreaB, MoSold})
# (0.02517147370072692, {BsmtUnfSFB, GrLivAreaB, MoSold, YearBuiltB})
# (0.02658646719956348, {BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})
# (0.028611151303957527, {1stFlrSFB, GarageYrBltB, LotAreaB, MoSold})
# (0.0290169524086199, {GrLivAreaB, LotAreaB, MoSold, YearRemodAddB})
# (0.029142603141615098, {1stFlrSFB, LotAreaB, MoSold, YearRemodAddB})
# (0.029422753513280497, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, LotAreaB})
# (0.029435017256659535, {BsmtUnfSFB, LotAreaB, MoSold, YearBuiltB})

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for aar1 in [hhaa(hrhh(uub,hrhrred(hhtrbr,xx)))]]))))
# (37.233006349854804, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, LotAreaB})
# (25.412008122784982, {GrLivAreaB, LotAreaB, MoSold, YearRemodAddB})
# (21.37090807508173, {1stFlrSFB, LotAreaB, MoSold, YearRemodAddB})
# (18.950539946431263, {1stFlrSFB, GarageYrBltB, LotAreaB, MoSold})
# (17.969710693419643, {BsmtUnfSFB, GrLivAreaB, MoSold, YearBuiltB})
# (17.446462549655052, {GarageYrBltB, GrLivAreaB, LotAreaB, MoSold})
# (17.26414099286103, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (15.772486116083314, {BsmtUnfSFB, LotAreaB, MoSold, YearBuiltB})
# (12.124428656489613, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold})
# (11.613603032723745, {BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (8586.179143302841, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, LotAreaB})
# (7750.455680972198, {BsmtUnfSFB, GrLivAreaB, MoSold, YearBuiltB})
# (7696.60521695565, {BsmtUnfSFB, LotAreaB, MoSold, YearBuiltB})
# (7679.156435784535, {BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})
# (7662.92102955794, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold})
# (7655.871371792047, {1stFlrSFB, GarageYrBltB, LotAreaB, MoSold})
# (7641.2282313114265, {GarageYrBltB, GrLivAreaB, LotAreaB, MoSold})
# (7639.854904487263, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (7518.573606714141, {1stFlrSFB, LotAreaB, MoSold, YearRemodAddB})
# (7504.227315635071, {GrLivAreaB, LotAreaB, MoSold, YearRemodAddB})

xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GarageAreaB","GrLivAreaB","LotAreaB"]])

len(xx)
4

lent(aatrb,xx,vvbl)
0.029422753513281386

vol(uub,xx)
203742

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 65 % 1

(kmax,omax,qmax) = (5, 60, 10)

ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)

rpln(ll)
# (0.0037980667427950365, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.0037980667427950365, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.0047475834284931295, {BsmtUnfSFB, GrLivAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0047475834284931295, {GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.005105972568058448, {1stFlrSFB, BsmtFinSF1B, GarageYrBltB, LotAreaB, MoSold})
# (0.005697100114192111, {BsmtUnfSFB, GrLivAreaB, LotFrontageB, MoSold, YrSold})
# (0.005697100114192999, {1stFlrSFB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.006055489253756541, {BsmtFinSF1B, GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (0.006055489253756541, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB, MoSold, YrSold})
# (0.006055489253757429, {1stFlrSFB, BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (12281.642656929791, {1stFlrSFB, BsmtFinSF1B, GarageYrBltB, LotAreaB, MoSold})
# (12281.642656926066, {1stFlrSFB, BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})
# (12273.219033710659, {BsmtFinSF1B, GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (10183.691543019377, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB, MoSold, YrSold})
# (10141.754584021866, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (10095.169639604632, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (9996.724686695263, {BsmtUnfSFB, GrLivAreaB, MoSold, YearRemodAddB, YrSold})
# (9963.535168956965, {1stFlrSFB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (9949.84899427509, {GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (9918.384667370003, {BsmtUnfSFB, GrLivAreaB, LotFrontageB, MoSold, YrSold})

xx = sset([VarStr(s) for s in ["1stFlrSFB","BsmtFinSF1B","GarageYrBltB","LotAreaB","MoSold"]])

len(xx)
5

lent(aatrb,xx,vvbl)
0.005105972568058448

vol(uub,xx)
2444904

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 18 % 1

(kmax,omax,qmax) = (6, 60, 10)

ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)

rpln(ll)
# (0.0018990333713970742, {1stFlrSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0018990333713970742, {BedroomAbvGr, BsmtFinSF1B, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.0018990333713970742, {BedroomAbvGr, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, LotFrontageB, MoSold, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GarageAreaB, LotFrontageB, MoSold, TotalBsmtSFB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GrLivAreaB, LotFrontageB, MoSold, OpenPorchSFB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GrLivAreaB, MoSold, OpenPorchSFB, YearRemodAddB, YrSold})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (14549.764237865806, {BsmtUnfSFB, GrLivAreaB, MoSold, OpenPorchSFB, YearRemodAddB, YrSold})
# (14501.807925760746, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (14501.807925760746, {BsmtUnfSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (14491.875180616975, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, MoSold, YearRemodAddB, YrSold})
# (14470.93934185803, {BsmtUnfSFB, GrLivAreaB, LotFrontageB, MoSold, OpenPorchSFB, YrSold})
# (14433.893293440342, {1stFlrSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (14422.87513013184, {BsmtUnfSFB, GarageAreaB, LotFrontageB, MoSold, TotalBsmtSFB, YrSold})
# (14403.117766991258, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, LotFrontageB, MoSold, YrSold})
# (13193.704952552915, {BedroomAbvGr, BsmtFinSF1B, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (12989.3541607894, {BedroomAbvGr, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})

xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB","MoSold","OpenPorchSFB","YearRemodAddB","YrSold"]])

len(xx)
6

lent(aatrb,xx,vvbl)
0.0018990333713979624

vol(uub,xx)
11586960

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 19 % 1

(kmax,omax,qmax) = (7, 60, 10)

ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)

rpln(ll)
# (0.0, {BsmtExposure, BsmtUnfSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0, {BsmtExposure, BsmtUnfSFB, GarageAreaB, LotFrontageB, MoSold, YearRemodAddB, YrSold})
# (0.0, {BsmtExposure, BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, BsmtExposure, BsmtFinSF1B, GarageYrBltB, LotAreaB, MoSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, BsmtFinSF1B, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, BsmtFinType1, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, BsmtUnfSFB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, LotAreaB, MasVnrAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, LotAreaB, MoSold, WoodDeckSFB, YearRemodAddB, YrSold})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (17683.828986644745, {1stFlrSFB, BedroomAbvGr, BsmtExposure, BsmtFinSF1B, GarageYrBltB, LotAreaB, MoSold})
# (17537.708243489265, {1stFlrSFB, BedroomAbvGr, LotAreaB, MoSold, WoodDeckSFB, YearRemodAddB, YrSold})
# (17537.708243370056, {1stFlrSFB, BedroomAbvGr, BsmtUnfSFB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (17537.708243370056, {1stFlrSFB, BedroomAbvGr, BsmtFinSF1B, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (17513.683985829353, {1stFlrSFB, BedroomAbvGr, LotAreaB, MasVnrAreaB, MoSold, YearRemodAddB, YrSold})
# (17469.78956949711, {1stFlrSFB, BedroomAbvGr, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (16851.51015740633, {BsmtExposure, BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (16851.51015740633, {BsmtExposure, BsmtUnfSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (16615.06543546915, {BsmtExposure, BsmtUnfSFB, GarageAreaB, LotFrontageB, MoSold, YearRemodAddB, YrSold})
# (15844.106867194176, {1stFlrSFB, BedroomAbvGr, BsmtFinType1, LotAreaB, MoSold, YearRemodAddB, YrSold})

So the minimum tuple dimension is 7. Choose a tuple $X$ with the maximum relative entropy,

xx = sset([VarStr(s) for s in ["BsmtExposure","BsmtUnfSFB","GarageAreaB","LotAreaB","MoSold","YearRemodAddB","YrSold"]])

len(xx)
7

lent(aatrb,xx,vvbl)
0.0

vol(uub,xx)
55301400

but classifies the sample into only $|(A_{\mathrm{trb}}~\%~(X \cup V_{\mathrm{bl}}))^{\mathrm{F}}| = |(A_{\mathrm{trb}}\%X)^{\mathrm{F}}| = 1460$ effective states or slices,

rpln(aall(red(aatrb,xx|vvl)))
# ({(BsmtExposure, Av), (BsmtUnfSFB, 0), (GarageAreaB, 0), (LotAreaB, 3182), (MoSold, 5), (SalePriceB, 88000), (YearRemodAddB, 1974), (YrSold, 2010)}, 1 % 1)
# ({(BsmtExposure, Av), (BsmtUnfSFB, 0), (GarageAreaB, 0), (LotAreaB, 3182), (MoSold, 12), (SalePriceB, 88000), (YearRemodAddB, 1970), (YrSold, 2007)}, 1 % 1)
# ({(BsmtExposure, Av), (BsmtUnfSFB, 0), (GarageAreaB, 0), (LotAreaB, 12150), (MoSold, 5), (SalePriceB, 124000), (YearRemodAddB, 1993), (YrSold, 2007)}, 1 % 1)
# ...
# ({(BsmtExposure, No), (BsmtUnfSFB, 2336), (GarageAreaB, 844), (LotAreaB, 12150), (MoSold, 3), (SalePriceB, 326000), (YearRemodAddB, 2006), (YrSold, 2007)}, 1 % 1)
# ({(BsmtExposure, No), (BsmtUnfSFB, 2336), (GarageAreaB, 844), (LotAreaB, 14175), (MoSold, 7), (SalePriceB, 755000), (YearRemodAddB, 2009), (YrSold, 2009)}, 1 % 1)
# ({(BsmtExposure, No), (BsmtUnfSFB, 2336), (GarageAreaB, 1488), (LotAreaB, 13005), (MoSold, 8), (SalePriceB, 278000), (YearRemodAddB, 2009), (YrSold, 2009)}, 1 % 1)

size(eff(red(aatrb,xx|vvbl)))
# 1460 % 1

This, however, is the cardinality of effective states of the bucketed training sample. So, even though the relative entropy is the highest obtained so far, which implies a robust or likely model, it is doubtful that there is sufficient size in each component to make the tuple very query effective. This can be seen by considering the query effectiveness of the test set, $\mathrm{size}(A_{\mathrm{teb}} * (A_{\mathrm{trb}}\%X)^{\mathrm{F}})$,

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 7 % 1

Of course, if a query fails with a model of 7 variables we can retry with the less likely model of 6 variables, and so on until a prediction is made.

Instead of determining minimum subsets of the query variables that are causal or predictive by using the conditional entropy tuple set builder, consider instead the conditional entropy fud decomper. The resultant decomposition consists of singleton fuds of self partition transforms of smaller tuples. In this way a set of paths of different tuples for different slices can reduce the label entropy,

def decompercondrr(ll,uu,aa,kmax,omax,fmax):
    return parametersSystemsHistoryRepasDecomperConditionalFmaxRepa(kmax,omax,fmax,uu,ll,aa)

(kmax,omax) = (1,5)

(uub1,df) = decompercondrr(vvbl,uub,hhtrb,kmax,omax,31)

dfund(df)
# {1stFlrSFB, BsmtFinSF1B, BsmtUnfSFB, GarageAreaB, GarageYrBltB, GrLivAreaB, LotAreaB, MasVnrAreaB, MoSold, Neighborhood, TotalBsmtSFB}

len(dfund(df))
11

rpln(treesPaths(funcsTreesMap(lambda xx:(fder(xx[1]),fund(xx[1])),dfzz(df))))
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<20,1>,1>}, {GarageAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<23,1>,1>}, {GarageYrBltB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<24,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<26,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<27,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<28,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<29,1>,1>}, {LotAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<3,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<4,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<5,1>,1>}, {BsmtUnfSFB}), ({<<22,1>,1>}, {1stFlrSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<6,1>,1>}, {TotalBsmtSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<7,1>,1>}, {BsmtUnfSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<8,1>,1>}, {TotalBsmtSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<9,1>,1>}, {TotalBsmtSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<10,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<11,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<12,1>,1>}, {MasVnrAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<13,1>,1>}, {TotalBsmtSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<14,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<15,1>,1>}, {BsmtUnfSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<16,1>,1>}, {GarageAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<17,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<18,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<19,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<21,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<25,1>,1>}, {MoSold})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<30,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<31,1>,1>}, {BsmtFinSF1B})]

ff = systemsDecompFudsNullablePracticable(uub1,df,1)

uub1 = uunion(uub,fsys(ff))

hhtrbb = hrfmul(uub1,ff,hhtrb)

def hrlent(uu,hh,ww,vvl):
    return ent(hhaa(hrhh(uu,hrhrred(hh,ww|vvl)))) - ent(hhaa(hrhh(uu,hrhrred(hh,ww))))

hrlent(uub1,hhtrbb,fder(ff),vvbl)
1.0078399275847598

hhteb = hrhrred(aahr(uub,aateb),vvbk)

hhtebb = hrfmul(uub1,ff,hhteb)

size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 1352 % 1

We can see that the query effectiveness and label entropy are similar to the 2-tuple case above.

(uub1,df) = decompercondrr(vvbl,uub,hhtrb,kmax,omax,63)

len(dfund(df))
15

ff = systemsDecompFudsNullablePracticable(uub1,df,1)

uub1 = uunion(uub,fsys(ff))

hhtrbb = hrfmul(uub1,ff,hhtrb)

hrlent(uub1,hhtrbb,fder(ff),vvbl)
0.7144285840050593

hhtebb = hrfmul(uub1,ff,hhteb)

size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 1243 % 1

(uub1,df) = decompercondrr(vvbl,uub,hhtrb,kmax,omax,127)

len(dfund(df))
21

ff = systemsDecompFudsNullablePracticable(uub1,df,1)

uub1 = uunion(uub,fsys(ff))

hhtrbb = hrfmul(uub1,ff,hhtrb)

hrlent(uub1,hhtrbb,fder(ff),vvbl)
0.38921596419687177

hhtebb = hrfmul(uub1,ff,hhteb)

size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 1064 % 1

(uub1,df) = decompercondrr(vvbl,uub,hhtrb,kmax,omax,255)

len(dfund(df))
28

ff = systemsDecompFudsNullablePracticable(uub1,df,1)

uub1 = uunion(uub,fsys(ff))

hhtrbb = hrfmul(uub1,ff,hhtrb)

hrlent(uub1,hhtrbb,fder(ff),vvbl)
0.1053963521125576

hhtebb = hrfmul(uub1,ff,hhteb)

size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 864 % 1

The 255-fud has higher query effectiveness and lower label entropy than the 3-tuple case above.

(uub1,df) = decompercondrr(vvbl,uub,hhtrb,kmax,omax,511)

len(dfund(df))
34

ff = systemsDecompFudsNullablePracticable(uub1,df,1)

uub1 = uunion(uub,fsys(ff))

hhtrbb = hrfmul(uub1,ff,hhtrb)

hrlent(uub1,hhtrbb,fder(ff),vvbl)
1.7763568394002505e-15

hhtebb = hrfmul(uub1,ff,hhteb)

size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 769 % 1

The 511-fud is causal, like the 7-tuple, and more query effective, but still only for around half of the test sample.

To conclude, the choice between models consisting of only substrate variables is a trade-off between model likelihood and accuracy/effectiveness given the sample size and substrate valencies.

Induced modelling of sale price

Consider an unsupervised induced model $D$ on the query variables, $V_{\mathrm{bk}}$, which exclude sale price. We shall analyse this model, $D$, to find a smaller submodel that predicts the label variables, $V_{\mathrm{bl}}$, or sale price. That is, we shall search in the decomposition fud for a submodel that optimises conditional entropy.

Here the induced model is created by the limited-nodes highest-layer excluded-self maximum-roll-by-derived-dimension fud decomper, $(\cdot,D) = I_{P,U_{\mathrm{b}},\mathrm{D,F,mm,xs,d,f}}((V_{\mathrm{bk}},A_{\mathrm{trb}}))$.

There is an example of model induction in the AMESPy repository.

First consider the fud decomposition AMES_model1.json (see Model induction),

from AMESDev import *

(uub,aab,aatrb,aateb) = amesBucketedIO(20)
vvb = uvars(uub) - sset([VarStr("Id")])
vvbl = sset([VarStr("SalePriceB")])
vvbk = vvb - vvbl

df = persistentsDecompFud_u(json.load(open('./AMES_model1.json','r')))

uub1 = uunion(uub,fsys(dfff(df)))

len(uvars(uub1))
354

Let us examine the tree of the fud decomposition, \[ \begin{eqnarray} \{\{(S,~\mathrm{und}(F),~\mathrm{der}(F)) : (S,F) \in L\} : L \in \mathrm{paths}(D)\} \end{eqnarray} \]

rpln(treesPaths(funcsTreesMap(lambda xx:(xx[0],fund(xx[1]),fder(xx[1])),dfzz(df))))
...

The decomposition tree contains 20 nodes with fud variables as follows, \[ \begin{eqnarray} \{\{\mathrm{fid}(F) : (\cdot,F) \in L\} : L \in \mathrm{paths}(D)\} \end{eqnarray} \]

def fid(ff):
    return variablesVariableFud(fder(ff)[0])

rpln(treesSubPaths(funcsTreesMap(lambda xx:fid(xx[1]),dfzz(df))))
# [1]
# [1, 2]
# [1, 2, 7]
# [1, 2, 7, 13]
# [1, 2, 10]
# [1, 3]
# [1, 3, 5]
# [1, 3, 5, 12]
# [1, 3, 5, 14]
# [1, 3, 5, 15]
# [1, 3, 6]
# [1, 3, 9]
# [1, 4]
# [1, 4, 8]
# [1, 4, 11]

Now consider the summed alignment and the summed alignment valency-density, $\mathrm{summation}(U_{\mathrm{b}1},D,A_{\mathrm{b}}))$,

hhb = hrhrred(aahr(uub,aab),vvb)

(wmax,lmax,xmax,omax,bmax,mmax,umax,pmax,fmax,mult,seed) = (2919, 8, 2919, 20, (20*3), 3, 1459, 1, 15, 7, 5)

summation(mult,seed,uub1,df,hhb)
(23920.38686712143, 10661.48762745227)

\[ \begin{eqnarray} \{(\mathrm{fid}(F),~z_C,~a) : ((S,F),(z_C,(a,a_{\mathrm{d}}))) \in \mathrm{nodes}(\mathrm{sumtree}(U_{\mathrm{b}1},D,A_{\mathrm{b}}))\} \end{eqnarray} \]

sumtree = systemsDecompFudsHistoryRepasTreeAlignmentContentShuffleSummation_u

rpln(sorted([(fid(ff),zc,a) for ((ss,ff),(zc,(a,ad))) in sumtree(mult,seed,uub1,df,hhb).items()]))
# (1, 2919, 12657.721873016915)
# (2, 1084, 2529.6565315872417)
# (3, 981, 2809.7648124861553)
# (4, 686, 1658.96614868414)
# (5, 432, 1095.8853558730061)
# (6, 312, 662.7996448348581)
# (7, 256, 465.3573699577162)
# (8, 224, 480.4560693360156)
# (9, 176, 424.64359529323536)
# (10, 120, 179.02315566112912)
# (11, 118, 225.63355887734627)
# (12, 117, 178.25089693539442)
# (13, 103, 229.9593097628051)
# (14, 90, 184.9645923908373)
# (15, 80, 137.30395242463544)

We can see that the root fud has the highest slice size and shuffle content derived alignment, while the leaf fuds have small slice sizes and shuffle content derived alignments.

The bare model is a fud decomposition. As noted in Conversion to fud, the tree of a fud decomposition is sometimes unwieldy, so consider the fud decomposition fud, $F = D^{\mathrm{F}} \in \mathcal{F}$, (see Practicable fud decomposition fud),

ff = systemsDecompFudsNullablePracticable(uub1,df,1)

uub2 = uunion(uub,fsys(ff))

len(uvars(uub2))
518

The model, $F$, has 172 derived variables, $W_F = \mathrm{der}(F)$, and a large derived volume, $|W_F^{\mathrm{C}}|$,

len(fder(ff))
139

fder(ff)
# {<<1,n>,1>, <<1,n>,2>,...,<<15,n>,11>}

vol(uub2,fder(ff))
1081689568857469337014175565968052911323215747000367892356501629566976

The model has 50 underlying variables, $V_F = \mathrm{und}(F)$,

len(fund(ff))
50

vvbk - fund(ff)
# {3SsnPorchB, BsmtFinSF2B, BsmtFinType2, Condition1, Condition2, Electrical, EnclosedPorchB, Functional, GarageAreaB, GarageCars, Heating, LotAreaB, LotConfig, LotFrontageB, LotShape, LowQualFinSFB, MasVnrAreaB, MiscValB, MoSold, Neighborhood, OpenPorchSFB, OverallCond, OverallQual, PoolArea, RoofMatl, SaleType, ScreenPorchB, WoodDeckSFB, YrSold}

That is, a substantial part of substrate is ignored by the model.

The underlying volume, $|V_F^{\mathrm{C}}|$, is

vol(uub,fund(ff))
214786071852303726298697564160000000000000

The derived entropy, $\mathrm{entropy}(A_{\mathrm{b}} * F)$, is

aab1 = hhaa(hrhh(uub2,hrhrred(hrfmul(uub2,ff,hhb),fder(ff))))

ent(aab1)
4.6043656843985366

This may be compared to the logarithm of the derived volume, $\ln |W_F^{\mathrm{C}}|$,

w = vol(uub2,fder(ff))

log(w)
158.9568956509107

So derived entropy is quite low. This is because there are only 243 effective derived states,

size(eff(aab1))
# 243 % 1

rpln([c for (ss,c) in aall(aab1)])
# 1 % 1
# 1 % 1
# 2 % 1
# 1 % 1
# 4 % 1
# ...
# 62 % 1
# 50 % 1
# 1 % 1
# 3 % 1
# 2 % 1
# 1 % 1

Now apply the model to the sample. Let $A_{\mathrm{trbb}} = A_{\mathrm{trb}}~\%~V_{\mathrm{b}} * \prod\mathrm{his}(F)$,

hhtrb = hrhrred(aahr(uub,aatrb),vvb)

hhtrbb = hrfmul(uub2,ff,hhtrb)

hrsize(hhtrbb)
1460

hhteb = hrhrred(aahr(uub,aateb),vvbk)

hhtebb = hrfmul(uub2,ff,hhteb)

hrsize(hhtebb)
1459

rpln(aall(hhaa(hrhh(uub2,hrhrred(hhtrbb,fder(ff)|vvbl)))))
# ({(SalePriceB, 88000), (<<1,n>,1>, 0),...,(<<15,n>,11>, null)}, 1 % 1)
# ...

size(eff(hhaa(hrhh(uub2,hrhrred(hhtrbb,fder(ff)|vvbl)))))
# 722 % 1

The model’s label entropy or query conditional entropy is less than that of Neighborhood, \[ \mathrm{lent}(A_{\mathrm{trbb}},W_F,V_{\mathrm{bl}}) < \mathrm{lent}(A_{\mathrm{trbb}},\{\mathrm{Neighbourhood}\},V_{\mathrm{bl}}) < \mathrm{ent}(A_{\mathrm{trbb}}\%V_{\mathrm{bl}}) \]

def hrlent(uu,hh,ww,vvl):
    return ent(hhaa(hrhh(uu,hrhrred(hh,ww|vvl)))) - ent(hhaa(hrhh(uu,hrhrred(hh,ww))))

hrlent(uub2,hhtrbb,fder(ff),vvbl)
1.7888987910580774

lent(aatrb,sset([VarStr("Neighborhood")]),vvbl)
2.3688094014030585

ent(red(aatrb,vvbl))
2.9948072760546887

That is, the model is more predictive of sale price than Neighborhood.

rpln(sset([(hrlent(uub2,hhtrbb,sset([w]),vvbl),w) for w in fder(ff)]))
# (2.713881719423317, <<1,n>,2>)
# (2.7241739949478916, <<1,n>,3>)
# (2.7391770885661, <<3,n>,5>)
# ...
# (2.818326655730092, <<2,n>,4>)
# (2.819116050902175, <<2,n>,1>)
# (2.8204246647147992, <<2,n>,8>)
# (2.873366670608093, <<5,n>,6>)
# (2.8737196824165805, <<6,n>,7>)
# ...
# (2.9591678742238914, <<15,n>,9>)
# (2.9591678742238914, <<15,n>,10>)
# (2.960553324175016, <<8,n>,4>)
# ...
# (2.9670421547242567, <<11,n>,3>)
# (2.9670421547242567, <<11,n>,4>)
# (2.968514119368421, <<10,n>,2>)
# (2.968514119368421, <<10,n>,5>)
# (2.968514119368421, <<10,n>,8>)

We can see that the derived variables nearest the root fud tend to have the lowest label entropy. None have zero label entropy by themselves. Consider derived variable <<1,n>,7> in the root fud,

w1n2 = stringsVariable("<<1,n>,2>")

fund(fdep(ff,sset([w1n2])))
# {BsmtQual, GarageYrBltB, YearRemodAddB}

hrlent(uub2,hhtrbb,sset([w1n2]),vvbl)
2.713881719423317

rpln(aall(hhaa(hrhh(uub2,hrhrred(hhtrbb,sset([w1n2])|vvbl)))))
# ({(SalePriceB, 88000), (<<1,n>,2>, 0)}, 60 % 1)
# ({(SalePriceB, 88000), (<<1,n>,2>, 2)}, 15 % 1)
# ({(SalePriceB, 106250), (<<1,n>,2>, 0)}, 56 % 1)
# ({(SalePriceB, 106250), (<<1,n>,2>, 2)}, 15 % 1)
# ...
# ({(SalePriceB, 326000), (<<1,n>,2>, 0)}, 4 % 1)
# ({(SalePriceB, 326000), (<<1,n>,2>, 1)}, 57 % 1)
# ({(SalePriceB, 326000), (<<1,n>,2>, 2)}, 11 % 1)
# ({(SalePriceB, 755000), (<<1,n>,2>, 0)}, 4 % 1)
# ({(SalePriceB, 755000), (<<1,n>,2>, 1)}, 64 % 1)
# ({(SalePriceB, 755000), (<<1,n>,2>, 2)}, 5 % 1)

Now consider the label entropy for all of the fud variables, not just the fud derived variables. We can determine minimum subsets of the fud variables that are causal or predictive by using the repa conditional entropy tuple set builder to do the conditional entropy minimise, \[ \{(\mathrm{lent}(A_{\mathrm{trbb}},M,V_{\mathrm{bl}}),~M) : M \in \mathrm{botd}(\mathrm{qmax})(\mathrm{elements}(Z_{P,A_{\mathrm{trbb}},\mathrm{L}}))\} \]

def buildcondrr(vvl,aa,kmax,omax,qmax):
    return sset([(b,a) for (a,b) in parametersBuilderConditionalVarsRepa(kmax,omax,qmax,vvl,aa).items()])

(kmax,omax,qmax) = (1, 30, 30)

ll = buildcondrr(vvbl,hhtrbb,kmax,omax,qmax)

rpln(ll)
# (2.3688094014030727, {Neighborhood})
# (2.4017759366388978, {OverallQual})
# (2.4383758834884066, {GrLivAreaB})
# ...
# (2.669795100904241, {<<1,1>,3>})
# (2.6829848274547423, {<<7,1>,2>})
# (2.6857543503089576, {BsmtFinSF1B})
# (2.6909428641116366, {OpenPorchSFB})
# (2.6998273379134146, {<<1,1>,2>})
# (2.703002915017576, {LotAreaB})
# (2.7083055009674, {GarageFinish})
# (2.712732624759936, {MasVnrAreaB})
# (2.713881719423318, {<<1,n>,2>})
# ...
# (2.7187679919428582, {<<1,1>,7>})
# (2.719031983537922, {FullBath})

Let us sort by shuffle content derived alignment descending. Let $L = \mathrm{botd}(\mathrm{qmax})(\mathrm{elements}(Z_{P,A_{\mathrm{trbb}},\mathrm{L}}))$. Then calculate \[ \{(\mathrm{algn}(A_{\mathrm{trbb}}\%X)-\mathrm{algn}(A_{\mathrm{trbrb}}\%X),~X) : (e,X) \in L\} \] where $A_{\mathrm{trbrb}} = A_{\mathrm{trbr}}~\%~V_{\mathrm{b}} * \prod\mathrm{his}(F)$,

hhtrbr = historyRepasShuffle_u(hhtrb,1)

hhtrbrb = hrfmul(uub2,ff,hhtrbr)

hrsize(hhtrbrb)
1460

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb,xx)))]]))))
# (0.0, {<<7,3>,2>})
# (0.0, {<<7,1>,2>})
# (0.0, {<<1,2>,46>})
# ...
# (0.0, {2ndFlrSFB})
# (0.0, {1stFlrSFB})

and by size-volume-sized-shuffle relative entropy descending, \[ \{(\mathrm{rent}(A_{\mathrm{trbb}}~\%~X,~Z_F * \hat{A}_{\mathrm{trbrb}}~\%~X),~X) : (e,X) \in L\} \] where $Z_F = \mathrm{scalar}(|V_F^{\mathrm{C}}|)$,

def vsize(uu,xx,aa):
    return resize(vol(uu,xx),aa)

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb,xx))))]]))))
# (328.5675967076595, {<<1,2>,46>})
# (38.520839912467636, {<<1,2>,44>})
# (38.520839912467636, {<<1,n>,2>})
# (23.77367675193318, {<<7,3>,2>})
# (5.730394727029356, {<<1,1>,9>})
# (7.105427357601002e-13, {GrLivAreaB})
# ...
# (-2.4868995751603507e-13, {<<1,1>,7>})
# (-3.836930773104541e-13, {GarageYrBltB})
# (-8.406608742461685e-13, {TotalBsmtSFB})

xx = [stringsVariable("<<1,2>,46>")]

len(xx)
1

The label entropy of the tuple, $X$, is $\mathrm{lent}(A_{\mathrm{trbb}},X,V_{\mathrm{bl}})$,

hrlent(uub2,hhtrbb,xx,vvbl)
2.7182017226256883

vol(uub2,xx)
3

The tuple, $X$, is very query effective, $\mathrm{size}(A_{\mathrm{tebb}}\%X * (A_{\mathrm{trbb}}\%X)^{\mathrm{F}})$,

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb,xx))))))
# 1459 % 1

The substrate variables are usually more predictive of sale price than the fud variables. This is because the substrate variables generally have larger valencies and so fewer are needed to partition the volume,

(kmax,omax,qmax) = (5, 10, 10)

ll = buildcondrr(vvbl,hhtrbb,kmax,omax,qmax)

rpln(ll)
# (0.0037980667427950365, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.0037980667427950365, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.005105972568058448, {1stFlrSFB, BsmtFinSF1B, GarageYrBltB, LotAreaB, MoSold})
# (0.006055489253756541, {BsmtFinSF1B, GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (0.006055489253756541, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB, MoSold, YrSold})
# (0.006055489253757429, {1stFlrSFB, BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})
# (0.007596133485589185, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, WoodDeckSFB})
# (0.007596133485589185, {GarageYrBltB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.007596133485590073, {1stFlrSFB, GarageAreaB, LotAreaB, MoSold, YrSold})
# (0.007596133485590073, {1stFlrSFB, GarageYrBltB, LotAreaB, MoSold, YrSold})

None of the 5-tuples contain model variables.

Now optimise for larger tuples, excluding the substrate. Let $A_{\mathrm{trbb}2} = A_{\mathrm{trbb}}~\%~(\mathrm{vars}(F) \setminus V_{\mathrm{b}} \cup V_{\mathrm{bl}}$, $A_{\mathrm{trbrb}2} = A_{\mathrm{trbrb}}~\%~(\mathrm{vars}(F) \setminus V_{\mathrm{b}} \cup V_{\mathrm{bl}}$ and $A_{\mathrm{tebb}2} = A_{\mathrm{tebb}}~\%~(\mathrm{vars}(F) \setminus V_{\mathrm{b}} \cup V_{\mathrm{bl}}$,

hhtrbb2 = hrhrred(hhtrbb,fvars(ff)-vvb|vvbl)

hhtrbrb2 = hrhrred(hhtrbrb,fvars(ff)-vvb|vvbl)

hhtebb2 = hrhrred(hhtebb,fvars(ff)-vvb|vvbl)

(kmax,omax,qmax) = (1, 10, 10)

ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)

rpln(ll)
# (2.669795100904241, {<<1,1>,3>})
# (2.6829848274547423, {<<7,1>,2>})
# (2.6998273379134146, {<<1,1>,2>})
# (2.713881719423318, {<<1,n>,2>})
# (2.713881719423318, {<<1,2>,44>})
# (2.715230020247958, {<<1,1>,9>})
# (2.718001054594302, {<<7,3>,2>})
# (2.7182017226256883, {<<1,2>,46>})
# (2.7187679919428582, {<<1,1>,7>})
# (2.7237328733746518, {<<8,1>,5>})

The shuffle content derived alignment is \[ \{(\mathrm{algn}(A_{\mathrm{trbb}2}\%X)-\mathrm{algn}(A_{\mathrm{trbrb}2}\%X),~X) : (e,X) \in L\} \]

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx)))]]))))
# (0.0, {<<8,1>,5>})
# (0.0, {<<7,3>,2>})
# (0.0, {<<7,1>,2>})
# (0.0, {<<1,2>,46>})
# (0.0, {<<1,2>,44>})
# (0.0, {<<1,1>,9>})
# (0.0, {<<1,1>,7>})
# (0.0, {<<1,1>,3>})
# (0.0, {<<1,1>,2>})
# (0.0, {<<1,n>,2>})

and the size-volume-sized-shuffle relative entropy is \[ \{(\mathrm{rent}(A_{\mathrm{trbb}2}~\%~X,~Z_F * \hat{A}_{\mathrm{trbrb}2}~\%~X),~X) : (e,X) \in L\} \]

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (328.5675967076595, {<<1,2>,46>})
# (38.520839912467636, {<<1,2>,44>})
# (38.520839912467636, {<<1,n>,2>})
# (23.77367675193318, {<<7,3>,2>})
# (5.730394727029356, {<<1,1>,9>})
# (9.237055564881302e-14, {<<1,1>,3>})
# (-1.0658141036401503e-14, {<<7,1>,2>})
# (-4.263256414560601e-14, {<<1,1>,2>})
# (-9.947598300641403e-14, {<<8,1>,5>})
# (-2.4868995751603507e-13, {<<1,1>,7>})

xx = [stringsVariable("<<1,2>,46>")]

len(xx)
1

fund(fdep(ff,xx))
# {BsmtQual, Foundation, GarageYrBltB, YearRemodAddB}

The label entropy of the tuple, $X$, is $\mathrm{lent}(A_{\mathrm{trbb}},X,V_{\mathrm{bl}})$,

hrlent(uub2,hhtrbb,xx,vvbl)
2.7182017226256883

vol(uub2,xx)
3

vol(uub2,fund(fdep(ff,xx)))
11970

The tuple, $X$, is also very query effective, $\mathrm{size}(A_{\mathrm{tebb}2}\%X * (A_{\mathrm{trbb}2}\%X)^{\mathrm{F}})$,

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 1459 % 1

(kmax,omax,qmax) = (2, 10, 10)

ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)

rpln(ll)
# (2.305341144972077, {<<1,1>,3>, <<7,1>,2>})
# (2.3198579037518825, {<<1,1>,3>, <<7,1>,39>})
# (2.3393213381047397, {<<1,1>,3>, <<7,1>,44>})
# (2.3395583059234233, {<<1,1>,2>, <<7,1>,2>})
# (2.3430591600610113, {<<1,1>,7>, <<7,1>,2>})
# (2.355056115647031, {<<1,1>,2>, <<7,1>,39>})
# (2.3583556507794876, {<<1,1>,3>, <<7,1>,1>})
# (2.3635828963123915, {<<1,1>,7>, <<7,1>,39>})
# (2.365723758707168, {<<1,1>,3>, <<7,3>,2>})
# (2.3695799074638098, {<<1,1>,3>, <<7,1>,42>})

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx)))]]))))
# (115.12929318577699, {<<1,1>,3>, <<7,1>,2>})
# (113.76007930535798, {<<1,1>,2>, <<7,1>,2>})
# (109.91491481954017, {<<1,1>,3>, <<7,3>,2>})
# (104.49829987349767, {<<1,1>,3>, <<7,1>,44>})
# (97.43300394335802, {<<1,1>,3>, <<7,1>,1>})
# (93.26220109565656, {<<1,1>,7>, <<7,1>,2>})
# (92.82522411873833, {<<1,1>,3>, <<7,1>,39>})
# (86.5499693790789, {<<1,1>,3>, <<7,1>,42>})
# (84.22246529301538, {<<1,1>,2>, <<7,1>,39>})
# (68.85149937955248, {<<1,1>,7>, <<7,1>,39>})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (161.96050031032064, {<<1,1>,3>, <<7,3>,2>})
# (42.298556271586676, {<<1,1>,3>, <<7,1>,2>})
# (38.16857739141096, {<<1,1>,3>, <<7,1>,44>})
# (33.52068556603558, {<<1,1>,2>, <<7,1>,2>})
# (30.92740438162491, {<<1,1>,3>, <<7,1>,39>})
# (28.63735377895341, {<<1,1>,3>, <<7,1>,1>})
# (28.207405835408736, {<<1,1>,7>, <<7,1>,2>})
# (25.927168307674037, {<<1,1>,3>, <<7,1>,42>})
# (25.51855137601865, {<<1,1>,2>, <<7,1>,39>})
# (20.197547061751266, {<<1,1>,7>, <<7,1>,39>})

xx = list(map(stringsVariable,["<<1,1>,3>","<<7,3>,2>"]))

len(xx)
2

fund(fdep(ff,xx))
# {Foundation, GrLivAreaB, TotRmsAbvGrd, YearBuiltB}

hrlent(uub2,hhtrbb,xx,vvbl)
2.365723758707158

vol(uub2,xx)
18

vol(uub2,fund(fdep(ff,xx)))
37044

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 1459 % 1

(kmax,omax,qmax) = (3, 10, 10)

ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)

rpln(ll)
# (1.9386864544632831, {<<1,1>,3>, <<5,2>,66>, <<7,1>,2>})
# (1.9395045487534865, {<<1,1>,3>, <<5,2>,66>, <<7,1>,39>})
# (1.9473458643540305, {<<1,1>,2>, <<5,2>,66>, <<7,1>,39>})
# (1.948943978266302, {<<1,1>,2>, <<5,2>,66>, <<7,1>,2>})
# (1.9522572171883503, {<<1,1>,7>, <<5,2>,66>, <<7,1>,2>})
# (1.9544643431992634, {<<1,1>,7>, <<5,2>,66>, <<7,1>,39>})
# (1.9557094113731153, {<<1,1>,3>, <<5,2>,66>, <<7,1>,44>})
# (1.964120788397755, {<<1,1>,3>, <<5,2>,67>, <<7,1>,39>})
# (1.96735774869191, {<<1,1>,3>, <<5,2>,67>, <<7,1>,2>})
# (1.9688461943837092, {<<1,1>,2>, <<5,2>,67>, <<7,1>,39>})

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx)))]]))))
# (450.43720437793763, {<<1,1>,3>, <<5,2>,67>, <<7,1>,2>})
# (423.8150201085655, {<<1,1>,3>, <<5,2>,67>, <<7,1>,39>})
# (326.52854136239284, {<<1,1>,2>, <<5,2>,67>, <<7,1>,39>})
# (289.34063860936385, {<<1,1>,3>, <<5,2>,66>, <<7,1>,2>})
# (277.0541070490017, {<<1,1>,3>, <<5,2>,66>, <<7,1>,44>})
# (267.12296731992546, {<<1,1>,3>, <<5,2>,66>, <<7,1>,39>})
# (236.89657582315158, {<<1,1>,2>, <<5,2>,66>, <<7,1>,2>})
# (218.99769854904662, {<<1,1>,7>, <<5,2>,66>, <<7,1>,2>})
# (206.352038190636, {<<1,1>,2>, <<5,2>,66>, <<7,1>,39>})
# (193.11755265618422, {<<1,1>,7>, <<5,2>,66>, <<7,1>,39>})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (558.1237513674423, {<<1,1>,3>, <<5,2>,67>, <<7,1>,2>})
# (501.6570569383912, {<<1,1>,3>, <<5,2>,67>, <<7,1>,39>})
# (471.61421102145687, {<<1,1>,2>, <<5,2>,67>, <<7,1>,39>})
# (385.22984929557424, {<<1,1>,3>, <<5,2>,66>, <<7,1>,2>})
# (379.7860253904946, {<<1,1>,3>, <<5,2>,66>, <<7,1>,44>})
# (365.09582845144905, {<<1,1>,2>, <<5,2>,66>, <<7,1>,2>})
# (353.56139129481744, {<<1,1>,3>, <<5,2>,66>, <<7,1>,39>})
# (347.1745320268674, {<<1,1>,7>, <<5,2>,66>, <<7,1>,2>})
# (338.9905341231497, {<<1,1>,2>, <<5,2>,66>, <<7,1>,39>})
# (317.58896323957015, {<<1,1>,7>, <<5,2>,66>, <<7,1>,39>})

xx = list(map(stringsVariable,["<<1,1>,3>","<<5,2>,67>","<<7,1>,2>"]))

len(xx)
3

fund(fdep(ff,xx))
# {BsmtQual, GrLivAreaB, MSSubClass, TotalBsmtSFB, YearBuiltB}

hrlent(uub2,hhtrbb,xx,vvbl)
1.9673577486918812

vol(uub2,xx)
96

vol(uub2,fund(fdep(ff,xx)))
740880

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 1454 % 1

Continuing on to the 5-tuple,

(kmax,omax,qmax) = (5, 10, 10)

ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)

rpln(ll)
# (1.151595152132888, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (1.1541639205451997, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (1.159747665972092, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (1.163605416577954, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (1.164104463623639, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (1.1660351927564374, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (1.1697783650332028, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})
# (1.1715271391890552, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})
# (1.1839921214905944, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,42>})
# (1.1854837763031236, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,2>, <<10,1>,77>})

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx)))]]))))
# (1169.7369261368972, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,42>})
# (1122.8132005772654, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (1083.6899385564275, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (991.344032497169, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (953.5517213466175, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (952.6896983821832, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (945.5980858986261, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,2>, <<10,1>,77>})
# (908.15744405081, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (790.7029253629565, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})
# (757.1364269623466, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (6800.9578676223755, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,42>})
# (6441.833786427975, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (6371.953254342079, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (6345.806191205978, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,2>, <<10,1>,77>})
# (6234.429318904877, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (6038.452056646347, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (5983.207999944687, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (5953.347136735916, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (3819.1537833809853, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})
# (3760.2451288998127, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})

xx = list(map(stringsVariable,["<<1,1>,7>","<<1,1>,53>","<<5,2>,66>","<<7,1>,39>","<<7,1>,42>"]))

len(xx)
5

fund(fdep(ff,xx))
# {1stFlrSFB, BldgType, GarageYrBltB, GrLivAreaB, HalfBath, TotalBsmtSFB, YearRemodAddB}

hrlent(uub2,hhtrbb,xx,vvbl)
1.183992121490582

vol(uub2,xx)
1920

vol(uub2,fund(fdep(ff,xx)))
55427085

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 1271 % 1

The 5-tuple model may be compared to the 2-tuple substrate model, above,

xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB"]])

len(xx)
2

lent(aatrb,xx,vvbl)
1.2044059887997252

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for aar1 in [hhaa(hrhh(uub,hrhrred(hhtrbr,xx)))]]))))
# ...
# (170.12391234916913, {BsmtUnfSFB, GrLivAreaB})
# ...

rpln $ reverse $ sort [(rent aa' vaar', xx) | let aa' = hhaa (hrhh uub (hhtrb `hrhrred` xx)), let vaar' = vsize uub xx (hhaa (hrhh uub (hhtrbr `hrhrred` xx)))] 
# ...
# (168.78883116031693, {BsmtUnfSFB, GrLivAreaB})
# ...

vol(uub,xx)
462

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 1395 % 1

The 2-tuple substrate model is more query effective but has lower derived alignment and lower relative entropy, so the 5-tuple model is a more robust model.

(kmax,omax,qmax) = (7, 10, 10)

ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)

rpln(ll)
# (0.6510428779469537, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (0.6527333874114545, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (0.6579842790563353, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (0.6593833136968277, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (0.6604468154146392, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (0.6614815479610705, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (0.6625079223227841, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (0.6637523396828247, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,68>, <<10,1>,73>, <<12,1>,23>})
# (0.6665422572749868, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (0.6678034581419459, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>})

rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx)))]]))))
# (669.8102531949735, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (636.9428039305611, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (632.1125474520368, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>})
# (617.1921275716891, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (605.8336896027176, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (604.8143918693091, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (599.0865094568629, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,68>, <<10,1>,73>, <<12,1>,23>})
# (586.828691073807, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (539.0592319851964, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (534.4426187070898, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (19016.152709960938, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (18762.087646484375, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (18434.231048583984, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (18311.437713623047, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (16355.118576049805, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (16229.837440490723, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>})
# (15821.124588012695, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (14755.907958984375, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (14458.577049255371, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (14223.993101119995, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,68>, <<10,1>,73>, <<12,1>,23>})

xx = list(map(stringsVariable,["<<1,1>,2>","<<1,1>,53>","<<5,2>,66>","<<7,1>,39>","<<10,1>,77>","<<11,3>,61>","<<12,1>,23>"]))

len(xx)
7

fund(fdep(ff,xx))
# {1stFlrSFB, Alley, BldgType, BsmtCond, BsmtFullBath, FullBath, GarageYrBltB, GrLivAreaB, HalfBath, PavedDrive, Street, TotalBsmtSFB, YearRemodAddB}

hrlent(uub2,hhtrbb,xx,vvbl)
0.6527333874114696

vol(uub2,xx)
8640

vol(uub2,fund(fdep(ff,xx)))
124710941250

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 1029 % 1

Note that the 7-tuple model derived alignments are lower than for the 5-tuple model, although the relative entropies are higher.

Now skip to the 10-tuple,

(kmax,omax,qmax) = (10, 10, 10)

ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)

rpln(ll)
# (0.26196615178403615, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.2625353845839111, {<<1,1>,7>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.26511267145079387, {<<1,1>,2>, <<1,1>,53>, <<5,2>,43>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.26593728122492166, {<<1,1>,2>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.2669168808919986, {<<1,1>,7>, <<1,1>,53>, <<5,2>,43>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.2670790785114949, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (0.2673522747237458, {<<1,1>,7>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (0.2674560307104965, {<<1,1>,2>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.2707630250485886, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,43>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.2707823213240923, {<<1,1>,2>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (36967.0, {<<1,1>,2>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (36811.25, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (36647.75, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,43>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (35137.3125, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (35091.5625, {<<1,1>,2>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (34303.0625, {<<1,1>,2>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (34221.75, {<<1,1>,7>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (33440.578125, {<<1,1>,2>, <<1,1>,53>, <<5,2>,43>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (33389.84375, {<<1,1>,7>, <<1,1>,53>, <<5,2>,43>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (33076.046875, {<<1,1>,7>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})

xx = list(map(stringsVariable,["<<1,1>,2>","<<1,1>,53>","<<2,1>,41>","<<5,2>,66>","<<6,1>,92>","<<7,1>,39>","<<10,1>,73>","<<11,3>,61>","<<12,1>,23>","<<13,1>,9>"]))

len(xx)
10

fund(fdep(ff,xx))
# {1stFlrSFB, Alley, BldgType, BsmtCond, BsmtExposure, BsmtFullBath, CentralAir, Exterior1st, FireplaceQu, FullBath, GarageYrBltB, GrLivAreaB, HalfBath, KitchenAbvGr, PavedDrive, TotalBsmtSFB, YearRemodAddB}

hrlent(uub2,hhtrbb,xx,vvbl)
0.2659372812249323

vol(uub2,xx)
155520

vol(uub2,fund(fdep(ff,xx)))
239445007200000

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 616 % 1

Note that the derived volume is now very large so we have not calculated the derived alignment.

The 10-tuple sub-model of the induced model may be compared to the 3-tuple substrate model, above,

xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB","LotAreaB"]])

len(xx)
3

lent(aatrb,xx,vvbl)
0.1632160815826591

rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# ...
# (3658.9316126172052, {BsmtUnfSFB, GrLivAreaB, LotAreaB})
# ...

vol(uub,xx)
9702

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 369 % 1

The 3-tuple substrate model has similar label entropy and query effectiveness to the 10-tuple sub-model. The 3-tuple model has lower relative entropy, however, so the 10-tuple model is the more likely model. That is, the 10-tuple model is more accurate when effective.

The underlying of the conditional entropy fud decompositions, above, is just the query substrate, $V_{\mathrm{bk}}$. Now let us consider taking the fud decomposition fud variables, $\mathrm{vars}(F)$, of the induced model, $D$, as the underlying. First we must reframe the fud variables,

def refr1(k):
    def refr1k(v):
        if isinstance(v, VarPair):
            (w,i) = v._rep
            if isinstance(w, VarPair):
                (f,l) = w._rep
                if isinstance(f, VarInt):
                    return VarPair((VarPair((VarPair((VarInt(k),f)),l)),i))
        return v
    return refr1k

def tframe(f,tt):
    reframe = transformsMapVarsFrame
    nn = sdict([(v,f(v)) for v in tvars(tt)])
    return reframe(tt,nn)

def fframe(f,ff):
    return qqff([tframe(f,tt) for tt in ffqq(ff)])

ff1 = fframe(refr1(1),ff)
uub1 = uunion(uub,fsys(ff1))

Now we apply the reframed fud to the sample,

hhtrbb = hrfmul(uub1,ff1,hhtrb)
hhtrbb2 = hrhrred(hhtrbb,fvars(ff1)-vvb|vvbl)

Now apply the conditional entropy fud decomper to minimise the label entropy,

def decompercondrr(ll,uu,aa,kmax,omax,fmax):
    return parametersSystemsHistoryRepasDecomperConditionalFmaxRepa(kmax,omax,fmax,uu,ll,aa)

(kmax,omax) = (1,5)

(uub2,df2) = decompercondrr(vvbl,uub1,hhtrbb2,kmax,omax,15)

dfund(df2)
# {<<<1,1>,1>,3>, <<<1,1>,1>,53>, <<<1,5>,2>,66>, <<<1,5>,2>,76>, <<<1,6>,1>,3>, <<<1,6>,1>,18>, <<<1,6>,1>,20>, <<<1,7>,1>,2>, <<<1,7>,1>,39>, <<<1,7>,1>,42>, <<<1,15>,1>,2>}

len(dfund(df2))
11

rpln(treesPaths(funcsTreesMap(lambda xx:(fder(xx[1]),fund(xx[1])),dfzz(df2))))
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<2,1>,1>}, {<<<1,7>,1>,2>}), ({<<5,1>,1>}, {<<<1,7>,1>,42>}), ({<<15,1>,1>}, {<<<1,1>,1>,53>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<2,1>,1>}, {<<<1,7>,1>,2>}), ({<<11,1>,1>}, {<<<1,5>,2>,76>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<2,1>,1>}, {<<<1,7>,1>,2>}), ({<<12,1>,1>}, {<<<1,1>,1>,53>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<3,1>,1>}, {<<<1,7>,1>,2>}), ({<<6,1>,1>}, {<<<1,5>,2>,66>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<3,1>,1>}, {<<<1,7>,1>,2>}), ({<<13,1>,1>}, {<<<1,5>,2>,66>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<4,1>,1>}, {<<<1,7>,1>,39>}), ({<<8,1>,1>}, {<<<1,6>,1>,3>}), ({<<14,1>,1>}, {<<<1,15>,1>,2>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<7,1>,1>}, {<<<1,7>,1>,2>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<9,1>,1>}, {<<<1,6>,1>,18>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<10,1>,1>}, {<<<1,6>,1>,20>})]

Consider this model as a predictor of label,


ff2 = systemsDecompFudsNullablePracticable(uub2,df2,1)
ff2 = fdep(funion(ff2,ff1),fder(ff2))

uub2 = uunion(uub,fsys(ff2))

hhtrbc = hrfmul(uub2,ff2,hhtrb)

hrlent(uub2,hhtrbc,fder(ff2),vvbl)
2.047763453782215

hhtebc = hrfmul(uub2,ff2,hhteb)

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebc,fder(ff2)))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbc,fder(ff2)))))))
# 1459 % 1

Continuing on with larger fuds,

(uub2,df2) = decompercondrr(vvbl,uub1,hhtrbb2,kmax,omax,63)

len(dfund(df2))
38

ff2 = systemsDecompFudsNullablePracticable(uub2,df2,1)
ff2 = fdep(funion(ff2,ff1),fder(ff2))

uub2 = uunion(uub,fsys(ff2))

hhtrbc = hrfmul(uub2,ff2,hhtrb)

hrlent(uub2,hhtrbc,fder(ff2),vvbl)
1.46016484022873

hhtebc = hrfmul(uub2,ff2,hhteb)

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebc,fder(ff2)))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbc,fder(ff2)))))))
# 1455 % 1

(uub2,df2) = decompercondrr(vvbl,uub1,hhtrbb2,kmax,omax,127)

len(dfund(df2))
70

ff2 = systemsDecompFudsNullablePracticable(uub2,df2,1)
ff2 = fdep(funion(ff2,ff1),fder(ff2))

uub2 = uunion(uub,fsys(ff2))

hhtrbc = hrfmul(uub2,ff2,hhtrb)

hrlent(uub2,hhtrbc,fder(ff2),vvbl)
1.0787486822622814

hhtebc = hrfmul(uub2,ff2,hhteb)

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebc,fder(ff2)))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbc,fder(ff2)))))))
# 1454 % 1

The 127-fud conditional over 15-fud induced model has lower label entropy than the 2-tuple substrate model, but is more query effective,

xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB"]])

len(xx)
2

lent(aatrb,xx,vvbl)
1.2044059887997252

size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 1395 % 1

The 127-fud conditional over 15-fud induced model has similar label entropy to the 31-fud conditional substrate model, but is more query effective,

len(dfund(df))
11

hrlent(uub1,hhtrbb,fder(ff),vvbl)
1.0078399275847598

size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 1352 % 1

(uub2,df2) = decompercondrr(vvbl,uub1,hhtrbb2,kmax,omax,255)

len(dfund(df2))
106

ff2 = systemsDecompFudsNullablePracticable(uub2,df2,1)
ff2 = fdep(funion(ff2,ff1),fder(ff2))

uub2 = uunion(uub,fsys(ff2))

hhtrbc = hrfmul(uub2,ff2,hhtrb)

hrlent(uub2,hhtrbc,fder(ff2),vvbl)
0.6685934841580492

hhtebc = hrfmul(uub2,ff2,hhteb)

size(mul(hhaa(hrhh(uub2,hrhrred(hhtebc,fder(ff2)))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbc,fder(ff2)))))))
# 1448 % 1

The 255-fud conditional over 15-fud induced model has similar label entropy to the 63-fud conditional substrate model, but is more query effective,

len(dfund(df))
15

hrlent(uub1,hhtrbb,fder(ff),vvbl)
0.7144285840050593

size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 1243 % 1

With respect to sale price, we can see that there are sub-models of the induced model which have similar properties as models consisting of subsets of the substrate. Again, when choosing between sub-models of the induced model there is a trade-off between model likelihood and query effectiveness. When choosing between a sub-model of the induced model and a corresponding substrate variable model of similar label entropy and query effectiveness, however, the sub-model is, in general, the more likely model. That is, the sub-model of the induced model is preferable to the substrate model because it is more accurate when it is query effective.


top