AMES - House Prices
Sections
Predicting sale price without modelling
Induced modelling of sale price
Introduction
The Ames Housing dataset describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. It was compiled by Dean De Cock for use in data science education. Full details of the dataset are in Kaggle Data Set - House Prices: Advanced Regression Techniques.
The dataset contains 1460 events of 80 variables including SalePrice
. There is also a test dataset containing 1459 events of 79 variables excluding SalePrice
.
Here’s a brief version of what you’ll find in the data description file:
- SalePrice - the property’s sale price in dollars.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale
We shall analyse this dataset using the AMESPy repository which depends on the AlignmentRepaPy repository. The AlignmentRepaPy repository is a fast Python implementation of some of the practicable inducers described in the paper. The code in this section can be executed by copying and pasting the code into a Python interpreter, see README. Also see the Introduction in Notation.
Properties of the sample
First load the training sample $A_{\mathrm{tr}}$ and the test sample $A_{\mathrm{te}}$,
from AMESDev import *
(uu,aatr,aate) = amesIO()
vv = uvars(uu) - sset([VarStr("Id")])
vvl = sset([VarStr("SalePrice")])
vvk = vv - vvl
size(aatr)
# 1460 % 1
size(aate)
# 1459 % 1
len(vv)
80
The system is $U$. The sample substrate variables are $V = \mathrm{vars}(A_{\mathrm{tr}}) \setminus \{\mathrm{Id}\}$, the label variables are $V_{\mathrm{l}} = \{\mathrm{SalePrice}\}$, and the query variables form the remainder, $V_{\mathrm{k}} = V \setminus V_{\mathrm{l}}$.
Now create a joint sample on the query variables $A = A_{\mathrm{tr}}\%V_{\mathrm{k}} + A_{\mathrm{te}}\%V_{\mathrm{k}}$,
aa = add(red(aatr,vvk),red(aate,vvk))
size(aa)
# 2919 % 1
So $\mathrm{vars}(A) = V_{\mathrm{k}}$.
The variable valencies are $\{(w,|U_w|) : w \in V\}$,
rpln(sset([(vol(uu,sset([w])),w) for w in vv]))
# (2, CentralAir)
# (2, Street)
# (3, Alley)
# ...
# (10, OverallQual)
# (10, SaleType)
# (12, MoSold)
# (14, PoolArea)
# (14, TotRmsAbvGrd)
# (16, Exterior1st)
# (16, MSSubClass)
# (17, Exterior2nd)
# (25, Neighborhood)
# (31, 3SsnPorch)
# (36, LowQualFinSF)
# (38, MiscVal)
# (61, YearRemodAdd)
# (104, GarageYrBlt)
# (118, YearBuilt)
# (121, ScreenPorch)
# (129, LotFrontage)
# (183, EnclosedPorch)
# (252, OpenPorchSF)
# (273, BsmtFinSF2)
# (379, WoodDeckSF)
# (445, MasVnrArea)
# (604, GarageArea)
# (635, 2ndFlrSF)
# (663, SalePrice)
# (992, BsmtFinSF1)
# (1059, TotalBsmtSF)
# (1083, 1stFlrSF)
# (1136, BsmtUnfSF)
# (1292, GrLivArea)
# (1951, LotArea)
In order to construct tuples with more than one variable, the valencies of some of the variables with ordered values can be reframed into buckets. Module AMESDev
has a function isOrd
that determines which variables can be bucketed,
rpln(sset([(vol(uu,sset([w])),w) for w in vv if isOrd(uu,w)]))
# (3, HalfBath)
# (4, BsmtHalfBath)
# ...
# (14, PoolArea)
# (14, TotRmsAbvGrd)
# (16, MSSubClass)
# (31, 3SsnPorch)
# ...
# (1136, BsmtUnfSF)
# (1292, GrLivArea)
# (1951, LotArea)
rpln(sset([(u,w) for w in vv for u in [vol(uu,sset([w]))] if isOrd(uu,w) if u > 16]))
# (31, 3SsnPorch)
# (36, LowQualFinSF)
# (38, MiscVal)
# (61, YearRemodAdd)
# (104, GarageYrBlt)
# (118, YearBuilt)
# (121, ScreenPorch)
# (129, LotFrontage)
# (183, EnclosedPorch)
# (252, OpenPorchSF)
# (273, BsmtFinSF2)
# (379, WoodDeckSF)
# (445, MasVnrArea)
# (604, GarageArea)
# (635, 2ndFlrSF)
# (663, SalePrice)
# (992, BsmtFinSF1)
# (1059, TotalBsmtSF)
# (1083, 1stFlrSF)
# (1136, BsmtUnfSF)
# (1292, GrLivArea)
# (1951, LotArea)
vvo = sset([w for w in vv for u in [vol(uu,sset([w]))] if isOrd(uu,w) if u > 16])
rpln(aall(red(aa,sset([VarStr("3SsnPorch")]))))
# ({(3SsnPorch, 0)}, 2882 % 1)
# ({(3SsnPorch, 23)}, 1 % 1)
# ({(3SsnPorch, 86)}, 1 % 1)
# ...
# ({(3SsnPorch, 360)}, 1 % 1)
# ({(3SsnPorch, 407)}, 1 % 1)
# ({(3SsnPorch, 508)}, 1 % 1)
rpln(aall(red(aa,sset([VarStr("LotArea")]))))
# ({(LotArea, 1300)}, 1 % 1)
# ({(LotArea, 1470)}, 1 % 1)
# ({(LotArea, 1476)}, 1 % 1)
# ...
# ({(LotArea, 159000)}, 1 % 1)
# ({(LotArea, 164660)}, 1 % 1)
# ({(LotArea, 215245)}, 1 % 1)
Let us determine which variables treat ValStr "null"
as a special case,
rpln(sset([(size(bb),w) for w in vvk & vvo for rr in [unit(sset([llss([(w,ValStr("null"))])]))] for bb in [mul(red(aa,sset([w])),rr)] if size(bb) > 0]))
# (1 % 1, BsmtFinSF1)
# (1 % 1, BsmtFinSF2)
# (1 % 1, BsmtUnfSF)
# (1 % 1, GarageArea)
# (1 % 1, TotalBsmtSF)
# (23 % 1, MasVnrArea)
# (159 % 1, GarageYrBlt)
# (486 % 1, LotFrontage)
rpln(sset([(size(bb),w) for w in vvk & vvo for rr in [unit(sset([llss([(w,ValStr("null"))])]))] for bb in [mul(red(aatr,sset([w])),rr)] if size(bb) > 0]))
# (8 % 1, MasVnrArea)
# (81 % 1, GarageYrBlt)
# (259 % 1, LotFrontage)
Let us determine which variables treat ValInt 0
as a special case,
rpln(sset([(size(bb),w) for w in vvk & vvo for rr in [unit(sset([llss([(w,ValInt(0))])]))] for bb in [mul(red(aa,sset([w])),rr)] if size(bb) > 200]))
# (241 % 1, BsmtUnfSF)
# (929 % 1, BsmtFinSF1)
# (1298 % 1, OpenPorchSF)
# (1523 % 1, WoodDeckSF)
# (1668 % 1, 2ndFlrSF)
# (1738 % 1, MasVnrArea)
# (2460 % 1, EnclosedPorch)
# (2571 % 1, BsmtFinSF2)
# (2663 % 1, ScreenPorch)
# (2816 % 1, MiscVal)
# (2879 % 1, LowQualFinSF)
# (2882 % 1, 3SsnPorch)
rpln(sset([(size(bb),w) for w in vvk & vvo for rr in [unit(sset([llss([(w,ValInt(0))])]))] for bb in [mul(red(aatr,sset([w])),rr)] if size(bb) > 100]))
# (118 % 1, BsmtUnfSF)
# (467 % 1, BsmtFinSF1)
# (656 % 1, OpenPorchSF)
# (761 % 1, WoodDeckSF)
# (829 % 1, 2ndFlrSF)
# (861 % 1, MasVnrArea)
# (1252 % 1, EnclosedPorch)
# (1293 % 1, BsmtFinSF2)
# (1344 % 1, ScreenPorch)
# (1408 % 1, MiscVal)
# (1434 % 1, LowQualFinSF)
# (1436 % 1, 3SsnPorch)
vvoz = sset([w for w in vvk & vvo for rr in [unit(sset([llss([(w,ValInt(0))])]))] for bb in [mul(red(aatr,sset([w])),rr)] if size(bb) > 100])
len(vvo)
22
len(vvoz)
12
There are 22 orderable variables, of which 12 treat ValInt 0
as a special case.
Now let us reframe to valencies of 20,
xx = sdict()
for v in vvk & (vvo - vvoz):
xx[v] = (VarStr(str(v)+"B"),bucket(20,aa,v))
xx[VarStr("SalePrice")] = (VarStr("SalePrice"+"B"),bucket(20,aatr,VarStr("SalePrice")))
for v in vvk & vvoz:
rr = unit(sset([llss([(v,ValInt(0))])]))
bb = mul(red(aa,sset([v])),rr)
aa1 = trim(sub(red(aa,sset([v])),bb))
xx[v] = (VarStr(str(v)+"B"),bucket(20,aa1,v))
aab = reframeb(aa,xx)
aatrb = reframeb(aatr,xx)
aateb = reframeb(aate,xx)
uub = uunion(sys(aab),uunion(sys(aatrb),sys(aateb)))
vvb = uvars(uub) - sset([VarStr("Id")])
vvbl = sset([VarStr("SalePriceB")])
vvbk = vvb - vvbl
rpln(sset([(vol(uub,sset([w])),w) for w in vvb]))
# (2, CentralAir)
# (2, Street)
# (3, Alley)
# (3, HalfBath)
# (3, LandSlope)
# (3, PavedDrive)
# (3, Utilities)
# (4, BsmtHalfBath)
# ...
# (14, TotRmsAbvGrd)
# (16, Exterior1st)
# (16, MSSubClass)
# (17, Exterior2nd)
# (18, LotFrontageB)
# (19, YearRemodAddB)
# ...
# (23, ScreenPorchB)
# (25, Neighborhood)
# (31, 3SsnPorchB)
rpln([(ss,q) for w in vvbk for (ss,q) in aall(red(aab,sset([w])))])
rpln([(ss,q) for w in vvbk for (ss,q) in aall(red(aatrb,sset([w])))])
The bucketed system is $U_{\mathrm{b}}$. The bucketed joint sample is $A_{\mathrm{b}}$, the bucketed training sample is $A_{\mathrm{trb}}$ and the bucketed test sample is $A_{\mathrm{teb}}$. The bucketed sample substrate variables are $V_{\mathrm{b}}$, the bucketed label variables are $V_{\mathrm{bl}} = \{\mathrm{SalePriceB}\}$, and the bucketed query variables are $V_{\mathrm{bk}}$.
For convenience, the bucketing is encapsulated in amesBucketedIO
in AMESDev
,
from AMESDev import *
(uub,aab,aatrb,aateb) = amesBucketedIO(20)
vvb = uvars(uub) - sset([VarStr("Id")])
vvbl = sset([VarStr("SalePriceB")])
vvbk = vvb - vvbl
The mean query bucketed valency, $|V_{\mathrm{b}}^{\mathrm{C}}|^{1/|V_{\mathrm{b}}|}$, is,
exp(log(vol(uub,vvb))/len(vvb))
8.421852632661576
The label variable dimension, $|V_{\mathrm{bl}}|$, is,
len(vvbl)
1
The label bucketed variable volume, $|V_{\mathrm{bl}}^{\mathrm{C}}|$, is,
vol(uub,vvbl)
20
The query variable dimension, $|V_{\mathrm{bk}}|$, is,
len(vvbk)
79
The geometric mean query bucketed valency, $|V_{\mathrm{bk}}^{\mathrm{C}}|^{1/|V_{\mathrm{bk}}|}$, is,
exp(log(vol(uub,vvbk))/len(vvbk))
8.330151968320083
The bucketed sample size, $\mathrm{size}(A_{\mathrm{b}})$, is
size(aab)
# 2919 % 1
Nearly all effective states correspond to exactly one event,
size(eff(aab))
# 2916 % 1
The bucketed training sample size, $\mathrm{size}(A_{\mathrm{trb}})$, is
size(aatrb)
# 1460 % 1
All bucketed effective states correspond to exactly one event, $A_{\mathrm{trb}} = A_{\mathrm{trb}}^{\mathrm{F}}$,
size(eff(aatrb))
# 1460 % 1
Now consider how highly aligned variables might be grouped together. See Entropy and alignment. First consider pairs in the substrate, $V_{\mathrm{b}}$, \[ \{(\mathrm{algn}(A_{\mathrm{trb}}\%\{w,x\}),~w,~x) : w,x \in V_{\mathrm{b}},~w < x\} \]
rpln(reversed(list(sset([(algn(red(aatrb,sset([w,x]))),w,x) for w in vvb for x in vvb if w < x]))))
# (2465.5152987646425, GarageYrBltB, YearBuiltB)
# (2152.5485832484987, Exterior1st, Exterior2nd)
# (1978.2349802971114, YearBuiltB, YearRemodAddB)
# (1858.3580587963286, 1stFlrSFB, TotalBsmtSFB)
# (1724.7876321508177, GarageYrBltB, YearRemodAddB)
# (1599.6055353637776, HouseStyle, MSSubClass)
# (1568.0764166986735, 1stFlrSFB, GrLivAreaB)
# (1324.8418737030024, Neighborhood, YearBuiltB)
# (1142.4421276400722, GarageAreaB, GarageCars)
# (1058.2972140759603, GarageYrBltB, Neighborhood)
# (1016.1760916141229, 2ndFlrSFB, MSSubClass)
# (1002.6330777071075, BsmtFinSF1B, BsmtFinType1)
# (1001.7919977485235, 2ndFlrSFB, HouseStyle)
# (999.2949685620983, GrLivAreaB, TotalBsmtSFB)
# (997.9601406145066, FireplaceQu, Fireplaces)
# (959.3978900891489, MasVnrAreaB, MasVnrType)
# (856.1859130216717, Foundation, YearBuiltB)
# (837.2461935016695, MSSubClass, YearBuiltB)
# (835.5545116214507, MSSubClass, Neighborhood)
# (819.6253930916721, Neighborhood, YearRemodAddB)
# ...
We can see that some of the variables that are in highly aligned pairs are also in other highly aligned pairs, e.g. YearBuiltB
or Neighborhood
. This suggests that we should also consider tuple dimensions greater than two.
Now consider using the tupler to group together highly aligned variables in the substrate, $V_{\mathrm{b}}$. Note that for performance reasons we must first construct a HistoryRepa
from the sample histogram, $A_{\mathrm{trb}}$. See History and HistoryRepa.
First consider the tuple dimension by choosing a volume limit, xmax
,
8.330151968320083 ** 3
578.041172320828
8.421852632661576 ** 3
597.341809663622
25*31
775
2*2*3*3*3*3*3
972
size(aatrb)
# 1460 % 1
size(aab)
# 2919 % 1
2*2*3*3*3*3*3*4
3888
8.330151968320083 ** 4
4815.170809378394
8.421852632661576 ** 4
5030.724692314405
Now create a shuffled sample, $A_{\mathrm{trbr}}$,
hhtrb = hrhrred(aahr(uub,aatrb),vvb)
hhtrbr = historyRepasShuffle_u(hhtrb,1)
hrsize(hhtrbr)
1460
Now optimise the shuffle content alignment with the tuple set builder, $I_{P,U,\mathrm{B,ns,me}}$, \[ \{(\mathrm{algn}(A_{\mathrm{trb}}\%K)-\mathrm{algn}(A_{\mathrm{trbr}}\%K),~K) : ((K,\cdot,\cdot),\cdot) \in I_{P,U,\mathrm{B,ns,me}}^{ * }((V_{\mathrm{b}},~\emptyset,~A_{\mathrm{trb}},~A_{\mathrm{trbr}}))\} \]
def buildtuprr(xmax,omax,bmax,uu,vv,xx,xxrr):
return reversed(list(sset([(algn(rraa(uu,hrred(xx,kk))) - algn(rraa(uu,hrred(xxrr,kk))), kk) for ((kk,_),_) in parametersSystemsBuilderTupleNoSumlayerMultiEffectiveRepa_ui(xmax,omax,bmax,1,uu,vv,fudEmpty(),xx,hrhx(xx),xxrr,hrhx(xxrr))[0]])))
rpln(buildtuprr(1460,10,10,uub,vvb,hhtrb,hhtrbr))
# (2289.22205538067, {GarageYrBltB, YearBuiltB})
# (2287.0736209675033, {GarageYrBltB, Utilities, YearBuiltB})
# (2281.3123335781984, {GarageYrBltB, Street, YearBuiltB})
# (2268.490561832467, {BldgType, CentralAir, HouseStyle, MSSubClass})
# (2266.4793952001714, {GarageYrBltB, PavedDrive, YearBuiltB})
# (2263.6360419083876, {CentralAir, GarageYrBltB, YearBuiltB})
# (2230.8809389216776, {Alley, GarageYrBltB, YearBuiltB})
# (2219.1688869176696, {BldgType, HouseStyle, MSSubClass})
# (2217.2589976365557, {BldgType, HouseStyle, MSSubClass, Street})
# (2204.396018461853, {GarageYrBltB, LandSlope, YearBuiltB})
Now optimise again having removed the top tuple from the substrate, \[ Q_1~=~\{\mathrm{GarageYrBltB},~\mathrm{YearBuiltB}\} \] and \[ \{(\mathrm{algn}(A_{\mathrm{trb}}\%K)-\mathrm{algn}(A_{\mathrm{trbr}}\%K),~K) : ((K,\cdot,\cdot),\cdot) \in I_{P,U,\mathrm{B,ns,me}}^{ * }((V_{\mathrm{b}} \setminus Q_1,~\emptyset,~A_{\mathrm{trb}},~A_{\mathrm{trbr}}))\} \]
qq1 = sset([VarStr(s) for s in ["GarageYrBltB","YearBuiltB"]])
rpln(buildtuprr(1460,10,10,uub,vvb-qq1,hhtrb,hhtrbr))
# (2268.490561832467, {BldgType, CentralAir, HouseStyle, MSSubClass})
# (2219.1688869176696, {BldgType, HouseStyle, MSSubClass})
# (2217.2589976365557, {BldgType, HouseStyle, MSSubClass, Street})
# (2188.1493956203217, {ExterQual, Exterior1st, Exterior2nd})
# (2173.3056986406436, {Exterior1st, Exterior2nd, HeatingQC})
# (2168.0794578894715, {BsmtQual, Exterior1st, Exterior2nd})
# (2148.7255972894404, {CentralAir, Exterior1st, Exterior2nd})
# (2142.8877269264362, {CentralAir, Exterior1st, Exterior2nd, Street})
# (2132.5055547558827, {Exterior1st, Exterior2nd, FullBath})
# (2125.3962435527606, {Exterior1st, Exterior2nd, PoolQC})
Now optimise again having removed the top two tuples from the substrate, \[ Q_2~=~\{\mathrm{BldgType},~\mathrm{CentralAir},~\mathrm{HouseStyle},~\mathrm{MSSubClass},~\mathrm{Street}\} \] and \[ \{(\mathrm{algn}(A_{\mathrm{trb}}\%K)-\mathrm{algn}(A_{\mathrm{trbr}}\%K),~K) : ((K,\cdot,\cdot),\cdot) \in I_{P,U,\mathrm{B,ns,me}}^{ * }((V_{\mathrm{b}} \setminus Q_1 \setminus Q_2,~\emptyset,~A_{\mathrm{trb}},~A_{\mathrm{trbr}}))\} \]
qq2 = sset([VarStr(s) for s in ["BldgType","CentralAir","HouseStyle","MSSubClass","Street"]])
rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2,hhtrb,hhtrbr))
# (2188.1493956203217, {ExterQual, Exterior1st, Exterior2nd})
# (2173.3056986406436, {Exterior1st, Exterior2nd, HeatingQC})
# (2168.0794578894715, {BsmtQual, Exterior1st, Exterior2nd})
# (2132.5055547558827, {Exterior1st, Exterior2nd, FullBath})
# (2125.3962435527606, {Exterior1st, Exterior2nd, PoolQC})
# (2123.4159165046276, {Exterior1st, Exterior2nd})
# (2121.5205221457086, {Exterior1st, Exterior2nd, Utilities})
# (2118.647962265917, {Exterior1st, Exterior2nd, GarageFinish})
# (2111.4491188041366, {Exterior1st, Exterior2nd, KitchenQual})
# (2110.495612404761, {Exterior1st, Exterior2nd, KitchenAbvGr})
Then continue in the same manner,
qq3 = sset([VarStr(s) for s in ["ExterQual","Exterior1st","Exterior2nd"]])
rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2-qq3,hhtrb,hhtrbr))
# (1694.5493452101775, {1stFlrSFB, TotalBsmtSFB, Utilities})
# (1693.1630508490578, {1stFlrSFB, TotalBsmtSFB})
# (1620.8665578986738, {1stFlrSFB, LandSlope, TotalBsmtSFB})
# (1617.1756900313953, {1stFlrSFB, PavedDrive, TotalBsmtSFB})
# (1611.3552018153305, {1stFlrSFB, Alley, TotalBsmtSFB})
# (1491.5366482964555, {1stFlrSFB, HalfBath, TotalBsmtSFB})
# (1473.1977583285745, {1stFlrSFB, GrLivAreaB, HalfBath})
# (1412.9034649265147, {GarageAreaB, GarageCars, GarageFinish})
# (1402.9331016819592, {1stFlrSFB, GrLivAreaB})
# (1401.4667646131659, {1stFlrSFB, GrLivAreaB, Utilities})
qq4 = sset([VarStr(s) for s in ["1stFlrSFB","TotalBsmtSFB","Utilities"]])
rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2-qq3-qq4,hhtrb,hhtrbr))
# (1412.9034649265147, {GarageAreaB, GarageCars, GarageFinish})
# (1331.0976721076222, {GarageAreaB, GarageCars, GarageType})
# (1323.8535812808213, {GarageAreaB, GarageCars, GarageQual})
# (1310.561464900003, {GarageAreaB, GarageCars, GarageCond})
# (1309.2152777779943, {BsmtFinSF1B, BsmtFinType1, BsmtFullBath})
# (1285.4814209909187, {BsmtQual, GarageAreaB, GarageCars})
# (1271.7784622620688, {FullBath, GarageAreaB, GarageCars})
# (1224.471189313343, {Foundation, GarageAreaB, GarageCars})
# (1223.657567001395, {GarageAreaB, GarageCars, KitchenQual})
# (1214.9020691181922, {BsmtQual, Foundation, Neighborhood})
qq5 = sset([VarStr(s) for s in ["GarageAreaB","GarageCars","GarageFinish","GarageType","GarageQual","GarageCond"]])
rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2-qq3-qq4-qq5,hhtrb,hhtrbr))
# (1453.4771115474682, {BsmtQual, FireplaceQu, Fireplaces, Foundation})
# (1363.822442537943, {BsmtQual, FireplaceQu, Fireplaces, KitchenQual})
# (1333.0081210204607, {BsmtFinType1, BsmtQual, FireplaceQu, Fireplaces})
# (1319.9957903960371, {BsmtQual, FireplaceQu, Fireplaces, FullBath})
# (1309.2152777779943, {BsmtFinSF1B, BsmtFinType1, BsmtFullBath})
# (1227.7053704690293, {BsmtExposure, BsmtQual, FireplaceQu, Fireplaces})
# (1214.9020691181922, {BsmtQual, Foundation, Neighborhood})
# (1188.545222003212, {BsmtCond, BsmtQual, FireplaceQu, Fireplaces})
# (1187.2707174927878, {BsmtQual, FireplaceQu, Fireplaces, HeatingQC})
# (1186.9910298796954, {BsmtFinSF1B, BsmtFinType1, BsmtQual})
qq6 = sset([VarStr(s) for s in ["BsmtQual","FireplaceQu","Fireplaces","Foundation","KitchenQual","FullBath","BsmtFinType1","BsmtFinType1","BsmtFullBath","BsmtExposure","BsmtCond"]])
rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2-qq3-qq4-qq5-qq6,hhtrb,hhtrbr))
# (1021.5872634666134, {MasVnrAreaB, MasVnrType, OverallQual})
# (1008.192002264328, {LandContour, LandSlope, MasVnrAreaB, MasVnrType})
# (985.2892683218874, {HalfBath, KitchenAbvGr, MasVnrAreaB, MasVnrType})
# (975.6594546625283, {MasVnrAreaB, MasVnrType, SaleCondition})
# (974.0412841880125, {MasVnrAreaB, MasVnrType, SaleType})
# (971.2954032113157, {HalfBath, MasVnrAreaB, MasVnrType})
# (967.3208590480517, {HalfBath, MasVnrAreaB, MasVnrType, PoolQC})
# (963.0704190077258, {BsmtHalfBath, HalfBath, MasVnrAreaB, MasVnrType})
# (958.3037608474738, {HalfBath, LandSlope, MasVnrAreaB, MasVnrType})
# (953.7850012379358, {MasVnrAreaB, MasVnrType, RoofStyle})
qq7 = sset([VarStr(s) for s in ["MasVnrAreaB","MasVnrType","OverallQual","LandContour","LandSlope","SaleCondition","BsmtHalfBath","HalfBath","KitchenAbvGr","PoolQC","RoofStyle","SaleType"]])
rpln(buildtuprr(1460,10,10,uub,vvb-qq1-qq2-qq3-qq4-qq5-qq6-qq7,hhtrb,hhtrbr))
# (746.2103566256701, {Alley, MSZoning, Neighborhood, PavedDrive})
# (746.0276233365403, {HeatingQC, MSZoning, Neighborhood})
# (732.7649052052839, {GrLivAreaB, TotRmsAbvGrd})
# (731.0542770424454, {Alley, GrLivAreaB, TotRmsAbvGrd})
# (726.4489731521944, {GrLivAreaB, PavedDrive, TotRmsAbvGrd})
# (720.8607196839353, {MSZoning, Neighborhood, OverallCond})
# (680.9994890584201, {Neighborhood, PavedDrive, YearRemodAddB})
# (676.152201640848, {BedroomAbvGr, MSZoning, Neighborhood})
# (663.0878461206503, {Alley, MSZoning, Neighborhood})
# (658.3548208268667, {Neighborhood, YearRemodAddB})
len(vvb-qq1-qq2-qq3-qq4-qq5-qq6-qq7)
39
After this selection of 7 tuples there are 39 less closely aligned variables remaining.
That is, there is a possible partition of the substrate as follows, $\bigcup\{Q_1,~Q_2,~Q_3,~Q_4,~Q_5,~Q_6,~Q_7,~V_{\mathrm{b}} \setminus \{Q_1,Q_2,Q_3,Q_4,Q_5,Q_6,Q_7\}\} = V_{\mathrm{b}}$,
qq1
# {GarageYrBltB, YearBuiltB}
qq2
# {BldgType, CentralAir, HouseStyle, MSSubClass, Street}
qq3
# {ExterQual, Exterior1st, Exterior2nd}
qq4
# {1stFlrSFB, TotalBsmtSFB, Utilities}
qq5
# {GarageAreaB, GarageCars, GarageCond, GarageFinish, GarageQual, GarageType}
qq6
# {BsmtCond, BsmtExposure, BsmtFinType1, BsmtFullBath, BsmtQual, FireplaceQu, Fireplaces, Foundation, FullBath, KitchenQual}
qq7
# {BsmtHalfBath, HalfBath, KitchenAbvGr, LandContour, LandSlope, MasVnrAreaB, MasVnrType, OverallQual, PoolQC, RoofStyle, SaleCondition, SaleType}
vvb-qq1-qq2-qq3-qq4-qq5-qq6-qq7
# {2ndFlrSFB, 3SsnPorchB, Alley, BedroomAbvGr, BsmtFinSF1B, BsmtFinSF2B, BsmtFinType2, BsmtUnfSFB, Condition1, Condition2, Electrical, EnclosedPorchB, ExterCond, Fence, Functional, GrLivAreaB, Heating, HeatingQC, LotAreaB, LotConfig, LotFrontageB, LotShape, LowQualFinSFB, MSZoning, MiscFeature, MiscValB, MoSold, Neighborhood, OpenPorchSFB, OverallCond, PavedDrive, PoolArea, RoofMatl, SalePriceB, ScreenPorchB, TotRmsAbvGrd, WoodDeckSFB, YearRemodAddB, YrSold}
Predicting sale price without modelling
The sample query variables predict edibility. That is, there is a functional or causal relationship between the query variables and the label variables, $(A_{\mathrm{trb}}\%V_{\mathrm{bk}})^{\mathrm{FS}} \to (A_{\mathrm{trb}}\%V_{\mathrm{bl}})^{\mathrm{FS}}$. So the label entropy or query conditional entropy is zero. See Entropy and alignment. The label entropy is \[ \begin{eqnarray} \mathrm{lent}(A,W,L)~:=~\mathrm{entropy}(A~\%~(W \cup L)) - \mathrm{entropy}(A~\%~W) \end{eqnarray} \]
def lent(aa,ww,vvl):
return ent(red(aa,ww|vvl)) - ent(red(aa,ww))
Then $\mathrm{lent}(A_{\mathrm{trb}},V_{\mathrm{bk}},V_{\mathrm{bl}}) = 0$,
lent(aatrb,vvbk,vvbl)
0.0
We can determine which of the query variables has the least conditional entropy, \[ \begin{eqnarray} \{(\mathrm{lent}(A_{\mathrm{trb}},\{w\},V_{\mathrm{bl}}),~w) : w \in V_{\mathrm{bk}}\} \end{eqnarray} \]
rpln(sset([(lent(aatrb,sset([w]),vvbl),w) for w in vvbk]))
# (2.3688094014030585, Neighborhood)
# (2.401775936638896, OverallQual)
# (2.438375883488403, GrLivAreaB)
# (2.505387552265462, GarageAreaB)
# (2.5141550725111537, TotalBsmtSFB)
# (2.5286425658331457, YearBuiltB)
# (2.575379618536724, GarageYrBltB)
# (2.5772799825200803, 1stFlrSFB)
# (2.6072098566399333, GarageCars)
# (2.626737288605211, YearRemodAddB)
# (2.633387515941217, MSSubClass)
# (2.652820702365173, BsmtQual)
# (2.660976881941984, 2ndFlrSFB)
# (2.662266429504159, ExterQual)
# ...
# (2.9763251921097145, LandSlope)
# (2.980341704620559, PoolArea)
# (2.9844981605029184, PoolQC)
# (2.9889645372889797, Street)
# (2.9928053874878207, Utilities)
This may be compared to the entropy of the label variables, $\mathrm{entropy}(A_{\mathrm{trb}}\%V_{\mathrm{bl}})$,
ent(red(aatrb,vvbl))
2.9948072760546887
Utilities
has the highest conditional entropy, and so makes very little prediction of sale price.
Neighborhood
has the least conditional entropy, and so is more predictive of sale price. Its label entropy is $\mathrm{lent}(A_{\mathrm{trb}},\{\mathrm{Neighborhood}\},V_{\mathrm{bl}})$,
vNeighborhood = VarStr("Neighborhood")
lent(aatrb,sset([vNeighborhood]),vvbl)
2.3688094014030585
Let us reduce the sample, $A_{\mathrm{trb}}~\%~(\{\mathrm{Neighborhood}\} \cup V_{\mathrm{bl}})$, to see the relationship,
rpln(aall(red(aatrb,sset([vNeighborhood])|vvbl)))
# ({(Neighborhood, Blmngtn), (SalePriceB, 163000)}, 2 % 1)
# ({(Neighborhood, Blmngtn), (SalePriceB, 172500)}, 2 % 1)
# ({(Neighborhood, Blmngtn), (SalePriceB, 179200)}, 3 % 1)
# ...
# ({(Neighborhood, Veenker), (SalePriceB, 278000)}, 1 % 1)
# ({(Neighborhood, Veenker), (SalePriceB, 326000)}, 2 % 1)
# ({(Neighborhood, Veenker), (SalePriceB, 755000)}, 1 % 1)
rpln(ssplit(vvbk,states(red(aatrb,sset([vNeighborhood])|vvbl))))
# ({(Neighborhood, Blmngtn)}, {(SalePriceB, 163000)})
# ({(Neighborhood, Blmngtn)}, {(SalePriceB, 172500)})
# ({(Neighborhood, Blmngtn)}, {(SalePriceB, 179200)})
# ...
# ({(Neighborhood, Veenker)}, {(SalePriceB, 278000)})
# ({(Neighborhood, Veenker)}, {(SalePriceB, 326000)})
# ({(Neighborhood, Veenker)}, {(SalePriceB, 755000)})
We can determine minimum subsets of the query variables that are causal or predictive by using the repa conditional entropy tuple set builder. We shall also calculate the shuffle content derived alignment and the size-volume-sized-shuffle relative entropy. \[ \{(\mathrm{lent}(A_{\mathrm{trb}},M,V_{\mathrm{bl}}),~M) : M \in \mathrm{botd}(\mathrm{qmax})(\mathrm{elements}(Z_{P,A_{\mathrm{trb}},\mathrm{L}}))\} \]
def buildcondrr(vvl,aa,kmax,omax,qmax):
return sset([(b,a) for (a,b) in parametersBuilderConditionalVarsRepa(kmax,omax,qmax,vvl,aa).items()])
(kmax,omax,qmax) = (1, 60, 10)
ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)
rpln(ll)
# (2.3688094014030727, {Neighborhood})
# (2.4017759366388978, {OverallQual})
# (2.4383758834884066, {GrLivAreaB})
# (2.5053875522654745, {GarageAreaB})
# (2.5141550725111603, {TotalBsmtSFB})
# (2.5286425658331524, {YearBuiltB})
# (2.5753796185367293, {GarageYrBltB})
# (2.577279982520081, {1stFlrSFB})
# (2.607209856639932, {GarageCars})
# (2.626737288605217, {YearRemodAddB})
Let us sort by shuffle content derived alignment descending. Let $L = \mathrm{botd}(\mathrm{qmax})(\mathrm{elements}(Z_{P,A_{\mathrm{trb}},\mathrm{L}}))$. Then calculate \[ \{(\mathrm{algn}(A_{\mathrm{trb}}\%X)-\mathrm{algn}(A_{\mathrm{trbr}}\%X),~X) : (e,X) \in L\} \]
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for aar1 in [hhaa(hrhh(uub,hrhrred(hhtrbr,xx)))]]))))
# (0.0, {YearRemodAddB})
# (0.0, {YearBuiltB})
# (0.0, {TotalBsmtSFB})
# (0.0, {OverallQual})
# (0.0, {Neighborhood})
# (0.0, {GrLivAreaB})
# (0.0, {GarageYrBltB})
# (0.0, {GarageCars})
# (0.0, {GarageAreaB})
# (0.0, {1stFlrSFB})
and by size-volume-sized-shuffle relative entropy descending, \[ \{(\mathrm{rent}(A_{\mathrm{trb}}~\%~X,~Z_X * \hat{A}_{\mathrm{trbr}}~\%~X),~X) : (e,X) \in L\} \] where $Z_X = \mathrm{scalar}(|X^{\mathrm{C}}|)$,
def vsize(uu,xx,aa):
return resize(vol(uu,xx),aa)
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (3.694822225952521e-13, {GrLivAreaB})
# (1.7763568394002505e-13, {GarageAreaB})
# (1.7763568394002505e-13, {1stFlrSFB})
# (1.2789769243681803e-13, {GarageYrBltB})
# (2.842170943040401e-14, {YearBuiltB})
# (2.6645352591003757e-15, {GarageCars})
# (-2.2382096176443156e-13, {OverallQual})
# (-3.765876499528531e-13, {TotalBsmtSFB})
# (-3.979039320256561e-13, {Neighborhood})
# (-6.323830348264892e-13, {YearRemodAddB})
Choose a tuple $X$ with the maximum relative entropy,
xx = sset([VarStr(s) for s in ["GrLivAreaB"]])
len(xx)
1
The label entropy, $\mathrm{lent}(A_{\mathrm{trb}},X,V_{\mathrm{bl}})$, is,
lent(aatrb,xx,vvbl)
2.438375883488403
This tuple has a volume of $|X^{\mathrm{C}}| = 21$,
vol(uub,xx)
21
Now consider the query effectiveness against the test set, $\mathrm{size}(A_{\mathrm{teb}} * (A_{\mathrm{trb}}\%X)^{\mathrm{F}})$,
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 1459 % 1
So there exists a prediction for each of the test set for the mono-variate tuple.
(kmax,omax,qmax) = (2, 60, 10)
ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)
rpln(ll)
# (1.17259972145533, {GarageYrBltB, GrLivAreaB})
# (1.1763717756588035, {GrLivAreaB, YearBuiltB})
# (1.1825182247882395, {BsmtUnfSFB, GarageAreaB})
# (1.1902468919522748, {GarageAreaB, GrLivAreaB})
# (1.1963275646526252, {1stFlrSFB, GarageYrBltB})
# (1.2007710029111403, {GarageYrBltB, TotalBsmtSFB})
# (1.2044059887997465, {BsmtUnfSFB, GrLivAreaB})
# (1.2080664696090002, {1stFlrSFB, YearBuiltB})
# (1.218109353181501, {BsmtUnfSFB, LotAreaB})
# (1.218337507644863, {1stFlrSFB, GarageAreaB})
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for aar1 in [hhaa(hrhh(uub,hrhrred(hhtrbr,xx)))]]))))
# (235.08640048183634, {GarageAreaB, GrLivAreaB})
# (205.02701940728434, {GrLivAreaB, YearBuiltB})
# (199.00732245944096, {1stFlrSFB, GarageAreaB})
# (196.09625918630286, {1stFlrSFB, YearBuiltB})
# (172.6125934542572, {GarageYrBltB, TotalBsmtSFB})
# (170.12391234916913, {BsmtUnfSFB, GrLivAreaB})
# (153.73732911898333, {1stFlrSFB, GarageYrBltB})
# (147.96457141870314, {GarageYrBltB, GrLivAreaB})
# (62.39292498274949, {BsmtUnfSFB, GarageAreaB})
# (56.5385461499186, {BsmtUnfSFB, LotAreaB})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (195.89974557002188, {GarageAreaB, GrLivAreaB})
# (173.08992727414716, {1stFlrSFB, YearBuiltB})
# (168.78883116031693, {BsmtUnfSFB, GrLivAreaB})
# (167.82327612434983, {GrLivAreaB, YearBuiltB})
# (164.30059739055696, {1stFlrSFB, GarageAreaB})
# (160.0849264746389, {BsmtUnfSFB, GarageAreaB})
# (158.6702662871935, {GarageYrBltB, TotalBsmtSFB})
# (155.68492715470074, {1stFlrSFB, GarageYrBltB})
# (153.5325087147944, {GarageYrBltB, GrLivAreaB})
# (149.44578750669734, {BsmtUnfSFB, LotAreaB})
xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB"]])
len(xx)
2
lent(aatrb,xx,vvbl)
1.2044059887997252
vol(uub,xx)
462
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 1395 % 1
1459 - 1395
64
In the case of the bi-variate tuple with the highest relative entropy, the query on the test set is ineffective for 64 events.
(kmax,omax,qmax) = (3, 60, 10)
ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)
rpln(ll)
# (0.1583902593862545, {BsmtUnfSFB, GarageAreaB, LotAreaB})
# (0.16321608158264933, {BsmtUnfSFB, GrLivAreaB, LotAreaB})
# (0.17534692586874456, {BsmtUnfSFB, GarageYrBltB, LotAreaB})
# (0.17551868584768204, {GarageAreaB, GrLivAreaB, LotAreaB})
# (0.18785610344571868, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB})
# (0.19170918418881655, {BsmtUnfSFB, GarageAreaB, GrLivAreaB})
# (0.1920984001311803, {GarageYrBltB, GrLivAreaB, LotAreaB})
# (0.1925880641418205, {BsmtUnfSFB, LotAreaB, YearRemodAddB})
# (0.19796856294348153, {1stFlrSFB, GarageAreaB, LotAreaB})
# (0.2007710037546273, {BsmtUnfSFB, LotAreaB, YearBuiltB})
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for aar1 in [hhaa(hrhh(uub,hrhrred(hhtrbr,xx)))]]))))
# (130.3754279569223, {1stFlrSFB, GarageAreaB, LotAreaB})
# (124.43940524792288, {BsmtUnfSFB, GarageAreaB, GrLivAreaB})
# (115.96748840137627, {GarageAreaB, GrLivAreaB, LotAreaB})
# (113.42933723706949, {GarageYrBltB, GrLivAreaB, LotAreaB})
# (102.01380411810794, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB})
# (100.30833097059985, {BsmtUnfSFB, LotAreaB, YearBuiltB})
# (92.29286671992077, {BsmtUnfSFB, GrLivAreaB, LotAreaB})
# (78.63867541400259, {BsmtUnfSFB, LotAreaB, YearRemodAddB})
# (73.51335432366454, {BsmtUnfSFB, GarageAreaB, LotAreaB})
# (72.3502035138589, {BsmtUnfSFB, GarageYrBltB, LotAreaB})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (3703.571392190337, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB})
# (3661.5241122377047, {BsmtUnfSFB, GarageAreaB, GrLivAreaB})
# (3658.9316126172052, {BsmtUnfSFB, GrLivAreaB, LotAreaB})
# (3658.1801888467744, {BsmtUnfSFB, GarageAreaB, LotAreaB})
# (3636.9144274921127, {GarageAreaB, GrLivAreaB, LotAreaB})
# (3599.7471796129685, {BsmtUnfSFB, LotAreaB, YearBuiltB})
# (3559.347827119549, {GarageYrBltB, GrLivAreaB, LotAreaB})
# (3558.9595478408883, {BsmtUnfSFB, GarageYrBltB, LotAreaB})
# (3519.053657775017, {1stFlrSFB, GarageAreaB, LotAreaB})
# (3509.535390459998, {BsmtUnfSFB, LotAreaB, YearRemodAddB})
xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB","LotAreaB"]])
len(xx)
3
lent(aatrb,xx,vvbl)
0.1632160815826591
vol(uub,xx)
9702
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 369 % 1
In the case of the tri-variate tuple with the highest relative entropy, the query on the test set is effective for only 369 events.
xx = sset([VarStr(s) for s in ["1stFlrSFB","GarageAreaB","LotAreaB"]])
len(xx)
3
lent(aatrb,xx,vvbl)
0.19796856294348153
vol(uub,xx)
9261
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 436 % 1
The tri-variate tuple with the highest content alignment is effective for only 436 events.
(kmax,omax,qmax) = (4, 60, 10)
ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)
rpln(ll)
# (0.017264363040668584, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold})
# (0.021653557329598172, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (0.025045822967731723, {GarageYrBltB, GrLivAreaB, LotAreaB, MoSold})
# (0.02517147370072692, {BsmtUnfSFB, GrLivAreaB, MoSold, YearBuiltB})
# (0.02658646719956348, {BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})
# (0.028611151303957527, {1stFlrSFB, GarageYrBltB, LotAreaB, MoSold})
# (0.0290169524086199, {GrLivAreaB, LotAreaB, MoSold, YearRemodAddB})
# (0.029142603141615098, {1stFlrSFB, LotAreaB, MoSold, YearRemodAddB})
# (0.029422753513280497, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, LotAreaB})
# (0.029435017256659535, {BsmtUnfSFB, LotAreaB, MoSold, YearBuiltB})
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for aar1 in [hhaa(hrhh(uub,hrhrred(hhtrbr,xx)))]]))))
# (37.233006349854804, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, LotAreaB})
# (25.412008122784982, {GrLivAreaB, LotAreaB, MoSold, YearRemodAddB})
# (21.37090807508173, {1stFlrSFB, LotAreaB, MoSold, YearRemodAddB})
# (18.950539946431263, {1stFlrSFB, GarageYrBltB, LotAreaB, MoSold})
# (17.969710693419643, {BsmtUnfSFB, GrLivAreaB, MoSold, YearBuiltB})
# (17.446462549655052, {GarageYrBltB, GrLivAreaB, LotAreaB, MoSold})
# (17.26414099286103, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (15.772486116083314, {BsmtUnfSFB, LotAreaB, MoSold, YearBuiltB})
# (12.124428656489613, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold})
# (11.613603032723745, {BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (8586.179143302841, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, LotAreaB})
# (7750.455680972198, {BsmtUnfSFB, GrLivAreaB, MoSold, YearBuiltB})
# (7696.60521695565, {BsmtUnfSFB, LotAreaB, MoSold, YearBuiltB})
# (7679.156435784535, {BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})
# (7662.92102955794, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold})
# (7655.871371792047, {1stFlrSFB, GarageYrBltB, LotAreaB, MoSold})
# (7641.2282313114265, {GarageYrBltB, GrLivAreaB, LotAreaB, MoSold})
# (7639.854904487263, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (7518.573606714141, {1stFlrSFB, LotAreaB, MoSold, YearRemodAddB})
# (7504.227315635071, {GrLivAreaB, LotAreaB, MoSold, YearRemodAddB})
xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GarageAreaB","GrLivAreaB","LotAreaB"]])
len(xx)
4
lent(aatrb,xx,vvbl)
0.029422753513281386
vol(uub,xx)
203742
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 65 % 1
(kmax,omax,qmax) = (5, 60, 10)
ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)
rpln(ll)
# (0.0037980667427950365, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.0037980667427950365, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.0047475834284931295, {BsmtUnfSFB, GrLivAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0047475834284931295, {GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.005105972568058448, {1stFlrSFB, BsmtFinSF1B, GarageYrBltB, LotAreaB, MoSold})
# (0.005697100114192111, {BsmtUnfSFB, GrLivAreaB, LotFrontageB, MoSold, YrSold})
# (0.005697100114192999, {1stFlrSFB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.006055489253756541, {BsmtFinSF1B, GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (0.006055489253756541, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB, MoSold, YrSold})
# (0.006055489253757429, {1stFlrSFB, BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (12281.642656929791, {1stFlrSFB, BsmtFinSF1B, GarageYrBltB, LotAreaB, MoSold})
# (12281.642656926066, {1stFlrSFB, BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})
# (12273.219033710659, {BsmtFinSF1B, GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (10183.691543019377, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB, MoSold, YrSold})
# (10141.754584021866, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (10095.169639604632, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (9996.724686695263, {BsmtUnfSFB, GrLivAreaB, MoSold, YearRemodAddB, YrSold})
# (9963.535168956965, {1stFlrSFB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (9949.84899427509, {GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (9918.384667370003, {BsmtUnfSFB, GrLivAreaB, LotFrontageB, MoSold, YrSold})
xx = sset([VarStr(s) for s in ["1stFlrSFB","BsmtFinSF1B","GarageYrBltB","LotAreaB","MoSold"]])
len(xx)
5
lent(aatrb,xx,vvbl)
0.005105972568058448
vol(uub,xx)
2444904
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 18 % 1
(kmax,omax,qmax) = (6, 60, 10)
ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)
rpln(ll)
# (0.0018990333713970742, {1stFlrSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0018990333713970742, {BedroomAbvGr, BsmtFinSF1B, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.0018990333713970742, {BedroomAbvGr, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, LotFrontageB, MoSold, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GarageAreaB, LotFrontageB, MoSold, TotalBsmtSFB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GrLivAreaB, LotFrontageB, MoSold, OpenPorchSFB, YrSold})
# (0.0018990333713970742, {BsmtUnfSFB, GrLivAreaB, MoSold, OpenPorchSFB, YearRemodAddB, YrSold})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (14549.764237865806, {BsmtUnfSFB, GrLivAreaB, MoSold, OpenPorchSFB, YearRemodAddB, YrSold})
# (14501.807925760746, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (14501.807925760746, {BsmtUnfSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (14491.875180616975, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, MoSold, YearRemodAddB, YrSold})
# (14470.93934185803, {BsmtUnfSFB, GrLivAreaB, LotFrontageB, MoSold, OpenPorchSFB, YrSold})
# (14433.893293440342, {1stFlrSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (14422.87513013184, {BsmtUnfSFB, GarageAreaB, LotFrontageB, MoSold, TotalBsmtSFB, YrSold})
# (14403.117766991258, {BsmtUnfSFB, GarageAreaB, GrLivAreaB, LotFrontageB, MoSold, YrSold})
# (13193.704952552915, {BedroomAbvGr, BsmtFinSF1B, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (12989.3541607894, {BedroomAbvGr, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB","MoSold","OpenPorchSFB","YearRemodAddB","YrSold"]])
len(xx)
6
lent(aatrb,xx,vvbl)
0.0018990333713979624
vol(uub,xx)
11586960
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 19 % 1
(kmax,omax,qmax) = (7, 60, 10)
ll = buildcondrr(vvbl,hhtrb,kmax,omax,qmax)
rpln(ll)
# (0.0, {BsmtExposure, BsmtUnfSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.0, {BsmtExposure, BsmtUnfSFB, GarageAreaB, LotFrontageB, MoSold, YearRemodAddB, YrSold})
# (0.0, {BsmtExposure, BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, BsmtExposure, BsmtFinSF1B, GarageYrBltB, LotAreaB, MoSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, BsmtFinSF1B, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, BsmtFinType1, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, BsmtUnfSFB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, LotAreaB, MasVnrAreaB, MoSold, YearRemodAddB, YrSold})
# (0.000949516685698093, {1stFlrSFB, BedroomAbvGr, LotAreaB, MoSold, WoodDeckSFB, YearRemodAddB, YrSold})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# (17683.828986644745, {1stFlrSFB, BedroomAbvGr, BsmtExposure, BsmtFinSF1B, GarageYrBltB, LotAreaB, MoSold})
# (17537.708243489265, {1stFlrSFB, BedroomAbvGr, LotAreaB, MoSold, WoodDeckSFB, YearRemodAddB, YrSold})
# (17537.708243370056, {1stFlrSFB, BedroomAbvGr, BsmtUnfSFB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (17537.708243370056, {1stFlrSFB, BedroomAbvGr, BsmtFinSF1B, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (17513.683985829353, {1stFlrSFB, BedroomAbvGr, LotAreaB, MasVnrAreaB, MoSold, YearRemodAddB, YrSold})
# (17469.78956949711, {1stFlrSFB, BedroomAbvGr, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (16851.51015740633, {BsmtExposure, BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (16851.51015740633, {BsmtExposure, BsmtUnfSFB, GarageAreaB, LotAreaB, MoSold, YearRemodAddB, YrSold})
# (16615.06543546915, {BsmtExposure, BsmtUnfSFB, GarageAreaB, LotFrontageB, MoSold, YearRemodAddB, YrSold})
# (15844.106867194176, {1stFlrSFB, BedroomAbvGr, BsmtFinType1, LotAreaB, MoSold, YearRemodAddB, YrSold})
So the minimum tuple dimension is 7. Choose a tuple $X$ with the maximum relative entropy,
xx = sset([VarStr(s) for s in ["BsmtExposure","BsmtUnfSFB","GarageAreaB","LotAreaB","MoSold","YearRemodAddB","YrSold"]])
len(xx)
7
lent(aatrb,xx,vvbl)
0.0
vol(uub,xx)
55301400
but classifies the sample into only $|(A_{\mathrm{trb}}~\%~(X \cup V_{\mathrm{bl}}))^{\mathrm{F}}| = |(A_{\mathrm{trb}}\%X)^{\mathrm{F}}| = 1460$ effective states or slices,
rpln(aall(red(aatrb,xx|vvl)))
# ({(BsmtExposure, Av), (BsmtUnfSFB, 0), (GarageAreaB, 0), (LotAreaB, 3182), (MoSold, 5), (SalePriceB, 88000), (YearRemodAddB, 1974), (YrSold, 2010)}, 1 % 1)
# ({(BsmtExposure, Av), (BsmtUnfSFB, 0), (GarageAreaB, 0), (LotAreaB, 3182), (MoSold, 12), (SalePriceB, 88000), (YearRemodAddB, 1970), (YrSold, 2007)}, 1 % 1)
# ({(BsmtExposure, Av), (BsmtUnfSFB, 0), (GarageAreaB, 0), (LotAreaB, 12150), (MoSold, 5), (SalePriceB, 124000), (YearRemodAddB, 1993), (YrSold, 2007)}, 1 % 1)
# ...
# ({(BsmtExposure, No), (BsmtUnfSFB, 2336), (GarageAreaB, 844), (LotAreaB, 12150), (MoSold, 3), (SalePriceB, 326000), (YearRemodAddB, 2006), (YrSold, 2007)}, 1 % 1)
# ({(BsmtExposure, No), (BsmtUnfSFB, 2336), (GarageAreaB, 844), (LotAreaB, 14175), (MoSold, 7), (SalePriceB, 755000), (YearRemodAddB, 2009), (YrSold, 2009)}, 1 % 1)
# ({(BsmtExposure, No), (BsmtUnfSFB, 2336), (GarageAreaB, 1488), (LotAreaB, 13005), (MoSold, 8), (SalePriceB, 278000), (YearRemodAddB, 2009), (YrSold, 2009)}, 1 % 1)
size(eff(red(aatrb,xx|vvbl)))
# 1460 % 1
This, however, is the cardinality of effective states of the bucketed training sample. So, even though the relative entropy is the highest obtained so far, which implies a robust or likely model, it is doubtful that there is sufficient size in each component to make the tuple very query effective. This can be seen by considering the query effectiveness of the test set, $\mathrm{size}(A_{\mathrm{teb}} * (A_{\mathrm{trb}}\%X)^{\mathrm{F}})$,
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 7 % 1
Of course, if a query fails with a model of 7 variables we can retry with the less likely model of 6 variables, and so on until a prediction is made.
Instead of determining minimum subsets of the query variables that are causal or predictive by using the conditional entropy tuple set builder, consider instead the conditional entropy fud decomper. The resultant decomposition consists of singleton fuds of self partition transforms of smaller tuples. In this way a set of paths of different tuples for different slices can reduce the label entropy,
def decompercondrr(ll,uu,aa,kmax,omax,fmax):
return parametersSystemsHistoryRepasDecomperConditionalFmaxRepa(kmax,omax,fmax,uu,ll,aa)
(kmax,omax) = (1,5)
(uub1,df) = decompercondrr(vvbl,uub,hhtrb,kmax,omax,31)
dfund(df)
# {1stFlrSFB, BsmtFinSF1B, BsmtUnfSFB, GarageAreaB, GarageYrBltB, GrLivAreaB, LotAreaB, MasVnrAreaB, MoSold, Neighborhood, TotalBsmtSFB}
len(dfund(df))
11
rpln(treesPaths(funcsTreesMap(lambda xx:(fder(xx[1]),fund(xx[1])),dfzz(df))))
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<20,1>,1>}, {GarageAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<23,1>,1>}, {GarageYrBltB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<24,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<26,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<27,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<28,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<2,1>,1>}, {GrLivAreaB}), ({<<29,1>,1>}, {LotAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<3,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<4,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<5,1>,1>}, {BsmtUnfSFB}), ({<<22,1>,1>}, {1stFlrSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<6,1>,1>}, {TotalBsmtSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<7,1>,1>}, {BsmtUnfSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<8,1>,1>}, {TotalBsmtSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<9,1>,1>}, {TotalBsmtSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<10,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<11,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<12,1>,1>}, {MasVnrAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<13,1>,1>}, {TotalBsmtSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<14,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<15,1>,1>}, {BsmtUnfSFB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<16,1>,1>}, {GarageAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<17,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<18,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<19,1>,1>}, {BsmtFinSF1B})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<21,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<25,1>,1>}, {MoSold})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<30,1>,1>}, {GrLivAreaB})]
# [({<<1,1>,1>}, {Neighborhood}), ({<<31,1>,1>}, {BsmtFinSF1B})]
ff = systemsDecompFudsNullablePracticable(uub1,df,1)
uub1 = uunion(uub,fsys(ff))
hhtrbb = hrfmul(uub1,ff,hhtrb)
def hrlent(uu,hh,ww,vvl):
return ent(hhaa(hrhh(uu,hrhrred(hh,ww|vvl)))) - ent(hhaa(hrhh(uu,hrhrred(hh,ww))))
hrlent(uub1,hhtrbb,fder(ff),vvbl)
1.0078399275847598
hhteb = hrhrred(aahr(uub,aateb),vvbk)
hhtebb = hrfmul(uub1,ff,hhteb)
size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 1352 % 1
We can see that the query effectiveness and label entropy are similar to the 2-tuple case above.
(uub1,df) = decompercondrr(vvbl,uub,hhtrb,kmax,omax,63)
len(dfund(df))
15
ff = systemsDecompFudsNullablePracticable(uub1,df,1)
uub1 = uunion(uub,fsys(ff))
hhtrbb = hrfmul(uub1,ff,hhtrb)
hrlent(uub1,hhtrbb,fder(ff),vvbl)
0.7144285840050593
hhtebb = hrfmul(uub1,ff,hhteb)
size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 1243 % 1
(uub1,df) = decompercondrr(vvbl,uub,hhtrb,kmax,omax,127)
len(dfund(df))
21
ff = systemsDecompFudsNullablePracticable(uub1,df,1)
uub1 = uunion(uub,fsys(ff))
hhtrbb = hrfmul(uub1,ff,hhtrb)
hrlent(uub1,hhtrbb,fder(ff),vvbl)
0.38921596419687177
hhtebb = hrfmul(uub1,ff,hhteb)
size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 1064 % 1
(uub1,df) = decompercondrr(vvbl,uub,hhtrb,kmax,omax,255)
len(dfund(df))
28
ff = systemsDecompFudsNullablePracticable(uub1,df,1)
uub1 = uunion(uub,fsys(ff))
hhtrbb = hrfmul(uub1,ff,hhtrb)
hrlent(uub1,hhtrbb,fder(ff),vvbl)
0.1053963521125576
hhtebb = hrfmul(uub1,ff,hhteb)
size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 864 % 1
The 255-fud has higher query effectiveness and lower label entropy than the 3-tuple case above.
(uub1,df) = decompercondrr(vvbl,uub,hhtrb,kmax,omax,511)
len(dfund(df))
34
ff = systemsDecompFudsNullablePracticable(uub1,df,1)
uub1 = uunion(uub,fsys(ff))
hhtrbb = hrfmul(uub1,ff,hhtrb)
hrlent(uub1,hhtrbb,fder(ff),vvbl)
1.7763568394002505e-15
hhtebb = hrfmul(uub1,ff,hhteb)
size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 769 % 1
The 511-fud is causal, like the 7-tuple, and more query effective, but still only for around half of the test sample.
To conclude, the choice between models consisting of only substrate variables is a trade-off between model likelihood and accuracy/effectiveness given the sample size and substrate valencies.
Induced modelling of sale price
Consider an unsupervised induced model $D$ on the query variables, $V_{\mathrm{bk}}$, which exclude sale price. We shall analyse this model, $D$, to find a smaller submodel that predicts the label variables, $V_{\mathrm{bl}}$, or sale price. That is, we shall search in the decomposition fud for a submodel that optimises conditional entropy.
Here the induced model is created by the limited-nodes highest-layer excluded-self maximum-roll-by-derived-dimension fud decomper, $(\cdot,D) = I_{P,U_{\mathrm{b}},\mathrm{D,F,mm,xs,d,f}}((V_{\mathrm{bk}},A_{\mathrm{trb}}))$.
There is an example of model induction in the AMESPy repository.
First consider the fud decomposition AMES_model1.json (see Model induction),
from AMESDev import *
(uub,aab,aatrb,aateb) = amesBucketedIO(20)
vvb = uvars(uub) - sset([VarStr("Id")])
vvbl = sset([VarStr("SalePriceB")])
vvbk = vvb - vvbl
df = persistentsDecompFud_u(json.load(open('./AMES_model1.json','r')))
uub1 = uunion(uub,fsys(dfff(df)))
len(uvars(uub1))
354
Let us examine the tree of the fud decomposition, \[ \begin{eqnarray} \{\{(S,~\mathrm{und}(F),~\mathrm{der}(F)) : (S,F) \in L\} : L \in \mathrm{paths}(D)\} \end{eqnarray} \]
rpln(treesPaths(funcsTreesMap(lambda xx:(xx[0],fund(xx[1]),fder(xx[1])),dfzz(df))))
...
The decomposition tree contains 20 nodes with fud variables as follows, \[ \begin{eqnarray} \{\{\mathrm{fid}(F) : (\cdot,F) \in L\} : L \in \mathrm{paths}(D)\} \end{eqnarray} \]
def fid(ff):
return variablesVariableFud(fder(ff)[0])
rpln(treesSubPaths(funcsTreesMap(lambda xx:fid(xx[1]),dfzz(df))))
# [1]
# [1, 2]
# [1, 2, 7]
# [1, 2, 7, 13]
# [1, 2, 10]
# [1, 3]
# [1, 3, 5]
# [1, 3, 5, 12]
# [1, 3, 5, 14]
# [1, 3, 5, 15]
# [1, 3, 6]
# [1, 3, 9]
# [1, 4]
# [1, 4, 8]
# [1, 4, 11]
Now consider the summed alignment and the summed alignment valency-density, $\mathrm{summation}(U_{\mathrm{b}1},D,A_{\mathrm{b}}))$,
hhb = hrhrred(aahr(uub,aab),vvb)
(wmax,lmax,xmax,omax,bmax,mmax,umax,pmax,fmax,mult,seed) = (2919, 8, 2919, 20, (20*3), 3, 1459, 1, 15, 7, 5)
summation(mult,seed,uub1,df,hhb)
(23920.38686712143, 10661.48762745227)
\[ \begin{eqnarray} \{(\mathrm{fid}(F),~z_C,~a) : ((S,F),(z_C,(a,a_{\mathrm{d}}))) \in \mathrm{nodes}(\mathrm{sumtree}(U_{\mathrm{b}1},D,A_{\mathrm{b}}))\} \end{eqnarray} \]
sumtree = systemsDecompFudsHistoryRepasTreeAlignmentContentShuffleSummation_u
rpln(sorted([(fid(ff),zc,a) for ((ss,ff),(zc,(a,ad))) in sumtree(mult,seed,uub1,df,hhb).items()]))
# (1, 2919, 12657.721873016915)
# (2, 1084, 2529.6565315872417)
# (3, 981, 2809.7648124861553)
# (4, 686, 1658.96614868414)
# (5, 432, 1095.8853558730061)
# (6, 312, 662.7996448348581)
# (7, 256, 465.3573699577162)
# (8, 224, 480.4560693360156)
# (9, 176, 424.64359529323536)
# (10, 120, 179.02315566112912)
# (11, 118, 225.63355887734627)
# (12, 117, 178.25089693539442)
# (13, 103, 229.9593097628051)
# (14, 90, 184.9645923908373)
# (15, 80, 137.30395242463544)
We can see that the root fud has the highest slice size and shuffle content derived alignment, while the leaf fuds have small slice sizes and shuffle content derived alignments.
The bare model is a fud decomposition. As noted in Conversion to fud, the tree of a fud decomposition is sometimes unwieldy, so consider the fud decomposition fud, $F = D^{\mathrm{F}} \in \mathcal{F}$, (see Practicable fud decomposition fud),
ff = systemsDecompFudsNullablePracticable(uub1,df,1)
uub2 = uunion(uub,fsys(ff))
len(uvars(uub2))
518
The model, $F$, has 172 derived variables, $W_F = \mathrm{der}(F)$, and a large derived volume, $|W_F^{\mathrm{C}}|$,
len(fder(ff))
139
fder(ff)
# {<<1,n>,1>, <<1,n>,2>,...,<<15,n>,11>}
vol(uub2,fder(ff))
1081689568857469337014175565968052911323215747000367892356501629566976
The model has 50 underlying variables, $V_F = \mathrm{und}(F)$,
len(fund(ff))
50
vvbk - fund(ff)
# {3SsnPorchB, BsmtFinSF2B, BsmtFinType2, Condition1, Condition2, Electrical, EnclosedPorchB, Functional, GarageAreaB, GarageCars, Heating, LotAreaB, LotConfig, LotFrontageB, LotShape, LowQualFinSFB, MasVnrAreaB, MiscValB, MoSold, Neighborhood, OpenPorchSFB, OverallCond, OverallQual, PoolArea, RoofMatl, SaleType, ScreenPorchB, WoodDeckSFB, YrSold}
That is, a substantial part of substrate is ignored by the model.
The underlying volume, $|V_F^{\mathrm{C}}|$, is
vol(uub,fund(ff))
214786071852303726298697564160000000000000
The derived entropy, $\mathrm{entropy}(A_{\mathrm{b}} * F)$, is
aab1 = hhaa(hrhh(uub2,hrhrred(hrfmul(uub2,ff,hhb),fder(ff))))
ent(aab1)
4.6043656843985366
This may be compared to the logarithm of the derived volume, $\ln |W_F^{\mathrm{C}}|$,
w = vol(uub2,fder(ff))
log(w)
158.9568956509107
So derived entropy is quite low. This is because there are only 243 effective derived states,
size(eff(aab1))
# 243 % 1
rpln([c for (ss,c) in aall(aab1)])
# 1 % 1
# 1 % 1
# 2 % 1
# 1 % 1
# 4 % 1
# ...
# 62 % 1
# 50 % 1
# 1 % 1
# 3 % 1
# 2 % 1
# 1 % 1
Now apply the model to the sample. Let $A_{\mathrm{trbb}} = A_{\mathrm{trb}}~\%~V_{\mathrm{b}} * \prod\mathrm{his}(F)$,
hhtrb = hrhrred(aahr(uub,aatrb),vvb)
hhtrbb = hrfmul(uub2,ff,hhtrb)
hrsize(hhtrbb)
1460
hhteb = hrhrred(aahr(uub,aateb),vvbk)
hhtebb = hrfmul(uub2,ff,hhteb)
hrsize(hhtebb)
1459
rpln(aall(hhaa(hrhh(uub2,hrhrred(hhtrbb,fder(ff)|vvbl)))))
# ({(SalePriceB, 88000), (<<1,n>,1>, 0),...,(<<15,n>,11>, null)}, 1 % 1)
# ...
size(eff(hhaa(hrhh(uub2,hrhrred(hhtrbb,fder(ff)|vvbl)))))
# 722 % 1
The model’s label entropy or query conditional entropy is less than that of Neighborhood
,
\[
\mathrm{lent}(A_{\mathrm{trbb}},W_F,V_{\mathrm{bl}}) < \mathrm{lent}(A_{\mathrm{trbb}},\{\mathrm{Neighbourhood}\},V_{\mathrm{bl}}) < \mathrm{ent}(A_{\mathrm{trbb}}\%V_{\mathrm{bl}})
\]
def hrlent(uu,hh,ww,vvl):
return ent(hhaa(hrhh(uu,hrhrred(hh,ww|vvl)))) - ent(hhaa(hrhh(uu,hrhrred(hh,ww))))
hrlent(uub2,hhtrbb,fder(ff),vvbl)
1.7888987910580774
lent(aatrb,sset([VarStr("Neighborhood")]),vvbl)
2.3688094014030585
ent(red(aatrb,vvbl))
2.9948072760546887
That is, the model is more predictive of sale price than Neighborhood
.
rpln(sset([(hrlent(uub2,hhtrbb,sset([w]),vvbl),w) for w in fder(ff)]))
# (2.713881719423317, <<1,n>,2>)
# (2.7241739949478916, <<1,n>,3>)
# (2.7391770885661, <<3,n>,5>)
# ...
# (2.818326655730092, <<2,n>,4>)
# (2.819116050902175, <<2,n>,1>)
# (2.8204246647147992, <<2,n>,8>)
# (2.873366670608093, <<5,n>,6>)
# (2.8737196824165805, <<6,n>,7>)
# ...
# (2.9591678742238914, <<15,n>,9>)
# (2.9591678742238914, <<15,n>,10>)
# (2.960553324175016, <<8,n>,4>)
# ...
# (2.9670421547242567, <<11,n>,3>)
# (2.9670421547242567, <<11,n>,4>)
# (2.968514119368421, <<10,n>,2>)
# (2.968514119368421, <<10,n>,5>)
# (2.968514119368421, <<10,n>,8>)
We can see that the derived variables nearest the root fud tend to have the lowest label entropy. None have zero label entropy by themselves. Consider derived variable <<1,n>,7>
in the root fud,
w1n2 = stringsVariable("<<1,n>,2>")
fund(fdep(ff,sset([w1n2])))
# {BsmtQual, GarageYrBltB, YearRemodAddB}
hrlent(uub2,hhtrbb,sset([w1n2]),vvbl)
2.713881719423317
rpln(aall(hhaa(hrhh(uub2,hrhrred(hhtrbb,sset([w1n2])|vvbl)))))
# ({(SalePriceB, 88000), (<<1,n>,2>, 0)}, 60 % 1)
# ({(SalePriceB, 88000), (<<1,n>,2>, 2)}, 15 % 1)
# ({(SalePriceB, 106250), (<<1,n>,2>, 0)}, 56 % 1)
# ({(SalePriceB, 106250), (<<1,n>,2>, 2)}, 15 % 1)
# ...
# ({(SalePriceB, 326000), (<<1,n>,2>, 0)}, 4 % 1)
# ({(SalePriceB, 326000), (<<1,n>,2>, 1)}, 57 % 1)
# ({(SalePriceB, 326000), (<<1,n>,2>, 2)}, 11 % 1)
# ({(SalePriceB, 755000), (<<1,n>,2>, 0)}, 4 % 1)
# ({(SalePriceB, 755000), (<<1,n>,2>, 1)}, 64 % 1)
# ({(SalePriceB, 755000), (<<1,n>,2>, 2)}, 5 % 1)
Now consider the label entropy for all of the fud variables, not just the fud derived variables. We can determine minimum subsets of the fud variables that are causal or predictive by using the repa conditional entropy tuple set builder to do the conditional entropy minimise, \[ \{(\mathrm{lent}(A_{\mathrm{trbb}},M,V_{\mathrm{bl}}),~M) : M \in \mathrm{botd}(\mathrm{qmax})(\mathrm{elements}(Z_{P,A_{\mathrm{trbb}},\mathrm{L}}))\} \]
def buildcondrr(vvl,aa,kmax,omax,qmax):
return sset([(b,a) for (a,b) in parametersBuilderConditionalVarsRepa(kmax,omax,qmax,vvl,aa).items()])
(kmax,omax,qmax) = (1, 30, 30)
ll = buildcondrr(vvbl,hhtrbb,kmax,omax,qmax)
rpln(ll)
# (2.3688094014030727, {Neighborhood})
# (2.4017759366388978, {OverallQual})
# (2.4383758834884066, {GrLivAreaB})
# ...
# (2.669795100904241, {<<1,1>,3>})
# (2.6829848274547423, {<<7,1>,2>})
# (2.6857543503089576, {BsmtFinSF1B})
# (2.6909428641116366, {OpenPorchSFB})
# (2.6998273379134146, {<<1,1>,2>})
# (2.703002915017576, {LotAreaB})
# (2.7083055009674, {GarageFinish})
# (2.712732624759936, {MasVnrAreaB})
# (2.713881719423318, {<<1,n>,2>})
# ...
# (2.7187679919428582, {<<1,1>,7>})
# (2.719031983537922, {FullBath})
Let us sort by shuffle content derived alignment descending. Let $L = \mathrm{botd}(\mathrm{qmax})(\mathrm{elements}(Z_{P,A_{\mathrm{trbb}},\mathrm{L}}))$. Then calculate \[ \{(\mathrm{algn}(A_{\mathrm{trbb}}\%X)-\mathrm{algn}(A_{\mathrm{trbrb}}\%X),~X) : (e,X) \in L\} \] where $A_{\mathrm{trbrb}} = A_{\mathrm{trbr}}~\%~V_{\mathrm{b}} * \prod\mathrm{his}(F)$,
hhtrbr = historyRepasShuffle_u(hhtrb,1)
hhtrbrb = hrfmul(uub2,ff,hhtrbr)
hrsize(hhtrbrb)
1460
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb,xx)))]]))))
# (0.0, {<<7,3>,2>})
# (0.0, {<<7,1>,2>})
# (0.0, {<<1,2>,46>})
# ...
# (0.0, {2ndFlrSFB})
# (0.0, {1stFlrSFB})
and by size-volume-sized-shuffle relative entropy descending, \[ \{(\mathrm{rent}(A_{\mathrm{trbb}}~\%~X,~Z_F * \hat{A}_{\mathrm{trbrb}}~\%~X),~X) : (e,X) \in L\} \] where $Z_F = \mathrm{scalar}(|V_F^{\mathrm{C}}|)$,
def vsize(uu,xx,aa):
return resize(vol(uu,xx),aa)
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb,xx))))]]))))
# (328.5675967076595, {<<1,2>,46>})
# (38.520839912467636, {<<1,2>,44>})
# (38.520839912467636, {<<1,n>,2>})
# (23.77367675193318, {<<7,3>,2>})
# (5.730394727029356, {<<1,1>,9>})
# (7.105427357601002e-13, {GrLivAreaB})
# ...
# (-2.4868995751603507e-13, {<<1,1>,7>})
# (-3.836930773104541e-13, {GarageYrBltB})
# (-8.406608742461685e-13, {TotalBsmtSFB})
xx = [stringsVariable("<<1,2>,46>")]
len(xx)
1
The label entropy of the tuple, $X$, is $\mathrm{lent}(A_{\mathrm{trbb}},X,V_{\mathrm{bl}})$,
hrlent(uub2,hhtrbb,xx,vvbl)
2.7182017226256883
vol(uub2,xx)
3
The tuple, $X$, is very query effective, $\mathrm{size}(A_{\mathrm{tebb}}\%X * (A_{\mathrm{trbb}}\%X)^{\mathrm{F}})$,
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb,xx))))))
# 1459 % 1
The substrate variables are usually more predictive of sale price than the fud variables. This is because the substrate variables generally have larger valencies and so fewer are needed to partition the volume,
(kmax,omax,qmax) = (5, 10, 10)
ll = buildcondrr(vvbl,hhtrbb,kmax,omax,qmax)
rpln(ll)
# (0.0037980667427950365, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.0037980667427950365, {GarageAreaB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.005105972568058448, {1stFlrSFB, BsmtFinSF1B, GarageYrBltB, LotAreaB, MoSold})
# (0.006055489253756541, {BsmtFinSF1B, GarageAreaB, GrLivAreaB, LotAreaB, MoSold})
# (0.006055489253756541, {BsmtUnfSFB, GarageYrBltB, GrLivAreaB, MoSold, YrSold})
# (0.006055489253757429, {1stFlrSFB, BsmtUnfSFB, GarageYrBltB, LotAreaB, MoSold})
# (0.007596133485589185, {BsmtUnfSFB, GrLivAreaB, LotAreaB, MoSold, WoodDeckSFB})
# (0.007596133485589185, {GarageYrBltB, GrLivAreaB, LotAreaB, MoSold, YrSold})
# (0.007596133485590073, {1stFlrSFB, GarageAreaB, LotAreaB, MoSold, YrSold})
# (0.007596133485590073, {1stFlrSFB, GarageYrBltB, LotAreaB, MoSold, YrSold})
None of the 5-tuples contain model variables.
Now optimise for larger tuples, excluding the substrate. Let $A_{\mathrm{trbb}2} = A_{\mathrm{trbb}}~\%~(\mathrm{vars}(F) \setminus V_{\mathrm{b}} \cup V_{\mathrm{bl}}$, $A_{\mathrm{trbrb}2} = A_{\mathrm{trbrb}}~\%~(\mathrm{vars}(F) \setminus V_{\mathrm{b}} \cup V_{\mathrm{bl}}$ and $A_{\mathrm{tebb}2} = A_{\mathrm{tebb}}~\%~(\mathrm{vars}(F) \setminus V_{\mathrm{b}} \cup V_{\mathrm{bl}}$,
hhtrbb2 = hrhrred(hhtrbb,fvars(ff)-vvb|vvbl)
hhtrbrb2 = hrhrred(hhtrbrb,fvars(ff)-vvb|vvbl)
hhtebb2 = hrhrred(hhtebb,fvars(ff)-vvb|vvbl)
(kmax,omax,qmax) = (1, 10, 10)
ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)
rpln(ll)
# (2.669795100904241, {<<1,1>,3>})
# (2.6829848274547423, {<<7,1>,2>})
# (2.6998273379134146, {<<1,1>,2>})
# (2.713881719423318, {<<1,n>,2>})
# (2.713881719423318, {<<1,2>,44>})
# (2.715230020247958, {<<1,1>,9>})
# (2.718001054594302, {<<7,3>,2>})
# (2.7182017226256883, {<<1,2>,46>})
# (2.7187679919428582, {<<1,1>,7>})
# (2.7237328733746518, {<<8,1>,5>})
The shuffle content derived alignment is \[ \{(\mathrm{algn}(A_{\mathrm{trbb}2}\%X)-\mathrm{algn}(A_{\mathrm{trbrb}2}\%X),~X) : (e,X) \in L\} \]
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx)))]]))))
# (0.0, {<<8,1>,5>})
# (0.0, {<<7,3>,2>})
# (0.0, {<<7,1>,2>})
# (0.0, {<<1,2>,46>})
# (0.0, {<<1,2>,44>})
# (0.0, {<<1,1>,9>})
# (0.0, {<<1,1>,7>})
# (0.0, {<<1,1>,3>})
# (0.0, {<<1,1>,2>})
# (0.0, {<<1,n>,2>})
and the size-volume-sized-shuffle relative entropy is \[ \{(\mathrm{rent}(A_{\mathrm{trbb}2}~\%~X,~Z_F * \hat{A}_{\mathrm{trbrb}2}~\%~X),~X) : (e,X) \in L\} \]
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (328.5675967076595, {<<1,2>,46>})
# (38.520839912467636, {<<1,2>,44>})
# (38.520839912467636, {<<1,n>,2>})
# (23.77367675193318, {<<7,3>,2>})
# (5.730394727029356, {<<1,1>,9>})
# (9.237055564881302e-14, {<<1,1>,3>})
# (-1.0658141036401503e-14, {<<7,1>,2>})
# (-4.263256414560601e-14, {<<1,1>,2>})
# (-9.947598300641403e-14, {<<8,1>,5>})
# (-2.4868995751603507e-13, {<<1,1>,7>})
xx = [stringsVariable("<<1,2>,46>")]
len(xx)
1
fund(fdep(ff,xx))
# {BsmtQual, Foundation, GarageYrBltB, YearRemodAddB}
The label entropy of the tuple, $X$, is $\mathrm{lent}(A_{\mathrm{trbb}},X,V_{\mathrm{bl}})$,
hrlent(uub2,hhtrbb,xx,vvbl)
2.7182017226256883
vol(uub2,xx)
3
vol(uub2,fund(fdep(ff,xx)))
11970
The tuple, $X$, is also very query effective, $\mathrm{size}(A_{\mathrm{tebb}2}\%X * (A_{\mathrm{trbb}2}\%X)^{\mathrm{F}})$,
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 1459 % 1
(kmax,omax,qmax) = (2, 10, 10)
ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)
rpln(ll)
# (2.305341144972077, {<<1,1>,3>, <<7,1>,2>})
# (2.3198579037518825, {<<1,1>,3>, <<7,1>,39>})
# (2.3393213381047397, {<<1,1>,3>, <<7,1>,44>})
# (2.3395583059234233, {<<1,1>,2>, <<7,1>,2>})
# (2.3430591600610113, {<<1,1>,7>, <<7,1>,2>})
# (2.355056115647031, {<<1,1>,2>, <<7,1>,39>})
# (2.3583556507794876, {<<1,1>,3>, <<7,1>,1>})
# (2.3635828963123915, {<<1,1>,7>, <<7,1>,39>})
# (2.365723758707168, {<<1,1>,3>, <<7,3>,2>})
# (2.3695799074638098, {<<1,1>,3>, <<7,1>,42>})
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx)))]]))))
# (115.12929318577699, {<<1,1>,3>, <<7,1>,2>})
# (113.76007930535798, {<<1,1>,2>, <<7,1>,2>})
# (109.91491481954017, {<<1,1>,3>, <<7,3>,2>})
# (104.49829987349767, {<<1,1>,3>, <<7,1>,44>})
# (97.43300394335802, {<<1,1>,3>, <<7,1>,1>})
# (93.26220109565656, {<<1,1>,7>, <<7,1>,2>})
# (92.82522411873833, {<<1,1>,3>, <<7,1>,39>})
# (86.5499693790789, {<<1,1>,3>, <<7,1>,42>})
# (84.22246529301538, {<<1,1>,2>, <<7,1>,39>})
# (68.85149937955248, {<<1,1>,7>, <<7,1>,39>})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (161.96050031032064, {<<1,1>,3>, <<7,3>,2>})
# (42.298556271586676, {<<1,1>,3>, <<7,1>,2>})
# (38.16857739141096, {<<1,1>,3>, <<7,1>,44>})
# (33.52068556603558, {<<1,1>,2>, <<7,1>,2>})
# (30.92740438162491, {<<1,1>,3>, <<7,1>,39>})
# (28.63735377895341, {<<1,1>,3>, <<7,1>,1>})
# (28.207405835408736, {<<1,1>,7>, <<7,1>,2>})
# (25.927168307674037, {<<1,1>,3>, <<7,1>,42>})
# (25.51855137601865, {<<1,1>,2>, <<7,1>,39>})
# (20.197547061751266, {<<1,1>,7>, <<7,1>,39>})
xx = list(map(stringsVariable,["<<1,1>,3>","<<7,3>,2>"]))
len(xx)
2
fund(fdep(ff,xx))
# {Foundation, GrLivAreaB, TotRmsAbvGrd, YearBuiltB}
hrlent(uub2,hhtrbb,xx,vvbl)
2.365723758707158
vol(uub2,xx)
18
vol(uub2,fund(fdep(ff,xx)))
37044
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 1459 % 1
(kmax,omax,qmax) = (3, 10, 10)
ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)
rpln(ll)
# (1.9386864544632831, {<<1,1>,3>, <<5,2>,66>, <<7,1>,2>})
# (1.9395045487534865, {<<1,1>,3>, <<5,2>,66>, <<7,1>,39>})
# (1.9473458643540305, {<<1,1>,2>, <<5,2>,66>, <<7,1>,39>})
# (1.948943978266302, {<<1,1>,2>, <<5,2>,66>, <<7,1>,2>})
# (1.9522572171883503, {<<1,1>,7>, <<5,2>,66>, <<7,1>,2>})
# (1.9544643431992634, {<<1,1>,7>, <<5,2>,66>, <<7,1>,39>})
# (1.9557094113731153, {<<1,1>,3>, <<5,2>,66>, <<7,1>,44>})
# (1.964120788397755, {<<1,1>,3>, <<5,2>,67>, <<7,1>,39>})
# (1.96735774869191, {<<1,1>,3>, <<5,2>,67>, <<7,1>,2>})
# (1.9688461943837092, {<<1,1>,2>, <<5,2>,67>, <<7,1>,39>})
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx)))]]))))
# (450.43720437793763, {<<1,1>,3>, <<5,2>,67>, <<7,1>,2>})
# (423.8150201085655, {<<1,1>,3>, <<5,2>,67>, <<7,1>,39>})
# (326.52854136239284, {<<1,1>,2>, <<5,2>,67>, <<7,1>,39>})
# (289.34063860936385, {<<1,1>,3>, <<5,2>,66>, <<7,1>,2>})
# (277.0541070490017, {<<1,1>,3>, <<5,2>,66>, <<7,1>,44>})
# (267.12296731992546, {<<1,1>,3>, <<5,2>,66>, <<7,1>,39>})
# (236.89657582315158, {<<1,1>,2>, <<5,2>,66>, <<7,1>,2>})
# (218.99769854904662, {<<1,1>,7>, <<5,2>,66>, <<7,1>,2>})
# (206.352038190636, {<<1,1>,2>, <<5,2>,66>, <<7,1>,39>})
# (193.11755265618422, {<<1,1>,7>, <<5,2>,66>, <<7,1>,39>})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (558.1237513674423, {<<1,1>,3>, <<5,2>,67>, <<7,1>,2>})
# (501.6570569383912, {<<1,1>,3>, <<5,2>,67>, <<7,1>,39>})
# (471.61421102145687, {<<1,1>,2>, <<5,2>,67>, <<7,1>,39>})
# (385.22984929557424, {<<1,1>,3>, <<5,2>,66>, <<7,1>,2>})
# (379.7860253904946, {<<1,1>,3>, <<5,2>,66>, <<7,1>,44>})
# (365.09582845144905, {<<1,1>,2>, <<5,2>,66>, <<7,1>,2>})
# (353.56139129481744, {<<1,1>,3>, <<5,2>,66>, <<7,1>,39>})
# (347.1745320268674, {<<1,1>,7>, <<5,2>,66>, <<7,1>,2>})
# (338.9905341231497, {<<1,1>,2>, <<5,2>,66>, <<7,1>,39>})
# (317.58896323957015, {<<1,1>,7>, <<5,2>,66>, <<7,1>,39>})
xx = list(map(stringsVariable,["<<1,1>,3>","<<5,2>,67>","<<7,1>,2>"]))
len(xx)
3
fund(fdep(ff,xx))
# {BsmtQual, GrLivAreaB, MSSubClass, TotalBsmtSFB, YearBuiltB}
hrlent(uub2,hhtrbb,xx,vvbl)
1.9673577486918812
vol(uub2,xx)
96
vol(uub2,fund(fdep(ff,xx)))
740880
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 1454 % 1
Continuing on to the 5-tuple,
(kmax,omax,qmax) = (5, 10, 10)
ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)
rpln(ll)
# (1.151595152132888, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (1.1541639205451997, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (1.159747665972092, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (1.163605416577954, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (1.164104463623639, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (1.1660351927564374, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (1.1697783650332028, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})
# (1.1715271391890552, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})
# (1.1839921214905944, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,42>})
# (1.1854837763031236, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,2>, <<10,1>,77>})
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx)))]]))))
# (1169.7369261368972, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,42>})
# (1122.8132005772654, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (1083.6899385564275, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (991.344032497169, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (953.5517213466175, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (952.6896983821832, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (945.5980858986261, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,2>, <<10,1>,77>})
# (908.15744405081, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (790.7029253629565, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})
# (757.1364269623466, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (6800.9578676223755, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,42>})
# (6441.833786427975, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (6371.953254342079, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,18>, <<7,1>,39>})
# (6345.806191205978, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,2>, <<10,1>,77>})
# (6234.429318904877, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (6038.452056646347, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (5983.207999944687, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>})
# (5953.347136735916, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,26>, <<7,1>,39>})
# (3819.1537833809853, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})
# (3760.2451288998127, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>})
xx = list(map(stringsVariable,["<<1,1>,7>","<<1,1>,53>","<<5,2>,66>","<<7,1>,39>","<<7,1>,42>"]))
len(xx)
5
fund(fdep(ff,xx))
# {1stFlrSFB, BldgType, GarageYrBltB, GrLivAreaB, HalfBath, TotalBsmtSFB, YearRemodAddB}
hrlent(uub2,hhtrbb,xx,vvbl)
1.183992121490582
vol(uub2,xx)
1920
vol(uub2,fund(fdep(ff,xx)))
55427085
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 1271 % 1
The 5-tuple model may be compared to the 2-tuple substrate model, above,
xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB"]])
len(xx)
2
lent(aatrb,xx,vvbl)
1.2044059887997252
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for aar1 in [hhaa(hrhh(uub,hrhrred(hhtrbr,xx)))]]))))
# ...
# (170.12391234916913, {BsmtUnfSFB, GrLivAreaB})
# ...
rpln $ reverse $ sort [(rent aa' vaar', xx) | let aa' = hhaa (hrhh uub (hhtrb `hrhrred` xx)), let vaar' = vsize uub xx (hhaa (hrhh uub (hhtrbr `hrhrred` xx)))]
# ...
# (168.78883116031693, {BsmtUnfSFB, GrLivAreaB})
# ...
vol(uub,xx)
462
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 1395 % 1
The 2-tuple substrate model is more query effective but has lower derived alignment and lower relative entropy, so the 5-tuple model is a more robust model.
(kmax,omax,qmax) = (7, 10, 10)
ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)
rpln(ll)
# (0.6510428779469537, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (0.6527333874114545, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (0.6579842790563353, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (0.6593833136968277, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (0.6604468154146392, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (0.6614815479610705, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (0.6625079223227841, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (0.6637523396828247, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,68>, <<10,1>,73>, <<12,1>,23>})
# (0.6665422572749868, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (0.6678034581419459, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>})
rpln(reversed(list(sset([(algn(aa1)-algn(aar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for aar1 in [hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx)))]]))))
# (669.8102531949735, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (636.9428039305611, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (632.1125474520368, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>})
# (617.1921275716891, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (605.8336896027176, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (604.8143918693091, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (599.0865094568629, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,68>, <<10,1>,73>, <<12,1>,23>})
# (586.828691073807, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (539.0592319851964, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (534.4426187070898, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (19016.152709960938, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (18762.087646484375, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,77>, <<11,3>,61>, <<12,1>,23>})
# (18434.231048583984, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (18311.437713623047, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<6,1>,61>, <<7,1>,39>, <<10,1>,77>, <<12,1>,23>})
# (16355.118576049805, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (16229.837440490723, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>})
# (15821.124588012695, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<12,1>,23>})
# (14755.907958984375, {<<1,1>,2>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (14458.577049255371, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (14223.993101119995, {<<1,1>,7>, <<1,1>,53>, <<5,2>,66>, <<7,1>,39>, <<7,1>,68>, <<10,1>,73>, <<12,1>,23>})
xx = list(map(stringsVariable,["<<1,1>,2>","<<1,1>,53>","<<5,2>,66>","<<7,1>,39>","<<10,1>,77>","<<11,3>,61>","<<12,1>,23>"]))
len(xx)
7
fund(fdep(ff,xx))
# {1stFlrSFB, Alley, BldgType, BsmtCond, BsmtFullBath, FullBath, GarageYrBltB, GrLivAreaB, HalfBath, PavedDrive, Street, TotalBsmtSFB, YearRemodAddB}
hrlent(uub2,hhtrbb,xx,vvbl)
0.6527333874114696
vol(uub2,xx)
8640
vol(uub2,fund(fdep(ff,xx)))
124710941250
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 1029 % 1
Note that the 7-tuple model derived alignments are lower than for the 5-tuple model, although the relative entropies are higher.
Now skip to the 10-tuple,
(kmax,omax,qmax) = (10, 10, 10)
ll = buildcondrr(vvbl,hhtrbb2,kmax,omax,qmax)
rpln(ll)
# (0.26196615178403615, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.2625353845839111, {<<1,1>,7>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.26511267145079387, {<<1,1>,2>, <<1,1>,53>, <<5,2>,43>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.26593728122492166, {<<1,1>,2>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.2669168808919986, {<<1,1>,7>, <<1,1>,53>, <<5,2>,43>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.2670790785114949, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (0.2673522747237458, {<<1,1>,7>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (0.2674560307104965, {<<1,1>,2>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.2707630250485886, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,43>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (0.2707823213240923, {<<1,1>,2>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx)))] for vaar1 in [vsize(uub2,fund(fdep(ff,xx)),hhaa(hrhh(uub2,hrhrred(hhtrbrb2,xx))))]]))))
# (36967.0, {<<1,1>,2>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (36811.25, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (36647.75, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,43>, <<5,2>,66>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (35137.3125, {<<1,1>,7>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (35091.5625, {<<1,1>,2>, <<1,1>,53>, <<2,1>,41>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
# (34303.0625, {<<1,1>,2>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (34221.75, {<<1,1>,7>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (33440.578125, {<<1,1>,2>, <<1,1>,53>, <<5,2>,43>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (33389.84375, {<<1,1>,7>, <<1,1>,53>, <<5,2>,43>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>, <<13,1>,9>})
# (33076.046875, {<<1,1>,7>, <<1,1>,53>, <<2,1>,40>, <<5,2>,66>, <<6,1>,92>, <<7,1>,39>, <<10,1>,9>, <<10,1>,73>, <<11,3>,61>, <<12,1>,23>})
xx = list(map(stringsVariable,["<<1,1>,2>","<<1,1>,53>","<<2,1>,41>","<<5,2>,66>","<<6,1>,92>","<<7,1>,39>","<<10,1>,73>","<<11,3>,61>","<<12,1>,23>","<<13,1>,9>"]))
len(xx)
10
fund(fdep(ff,xx))
# {1stFlrSFB, Alley, BldgType, BsmtCond, BsmtExposure, BsmtFullBath, CentralAir, Exterior1st, FireplaceQu, FullBath, GarageYrBltB, GrLivAreaB, HalfBath, KitchenAbvGr, PavedDrive, TotalBsmtSFB, YearRemodAddB}
hrlent(uub2,hhtrbb,xx,vvbl)
0.2659372812249323
vol(uub2,xx)
155520
vol(uub2,fund(fdep(ff,xx)))
239445007200000
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebb2,xx))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbb2,xx))))))
# 616 % 1
Note that the derived volume is now very large so we have not calculated the derived alignment.
The 10-tuple sub-model of the induced model may be compared to the 3-tuple substrate model, above,
xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB","LotAreaB"]])
len(xx)
3
lent(aatrb,xx,vvbl)
0.1632160815826591
rpln(reversed(list(sset([(rent(aa1,vaar1),xx) for (e,xx) in ll for aa1 in [hhaa(hrhh(uub,hrhrred(hhtrb,xx)))] for vaar1 in [vsize(uub,xx,hhaa(hrhh(uub,hrhrred(hhtrbr,xx))))]]))))
# ...
# (3658.9316126172052, {BsmtUnfSFB, GrLivAreaB, LotAreaB})
# ...
vol(uub,xx)
9702
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 369 % 1
The 3-tuple substrate model has similar label entropy and query effectiveness to the 10-tuple sub-model. The 3-tuple model has lower relative entropy, however, so the 10-tuple model is the more likely model. That is, the 10-tuple model is more accurate when effective.
The underlying of the conditional entropy fud decompositions, above, is just the query substrate, $V_{\mathrm{bk}}$. Now let us consider taking the fud decomposition fud variables, $\mathrm{vars}(F)$, of the induced model, $D$, as the underlying. First we must reframe the fud variables,
def refr1(k):
def refr1k(v):
if isinstance(v, VarPair):
(w,i) = v._rep
if isinstance(w, VarPair):
(f,l) = w._rep
if isinstance(f, VarInt):
return VarPair((VarPair((VarPair((VarInt(k),f)),l)),i))
return v
return refr1k
def tframe(f,tt):
reframe = transformsMapVarsFrame
nn = sdict([(v,f(v)) for v in tvars(tt)])
return reframe(tt,nn)
def fframe(f,ff):
return qqff([tframe(f,tt) for tt in ffqq(ff)])
ff1 = fframe(refr1(1),ff)
uub1 = uunion(uub,fsys(ff1))
Now we apply the reframed fud to the sample,
hhtrbb = hrfmul(uub1,ff1,hhtrb)
hhtrbb2 = hrhrred(hhtrbb,fvars(ff1)-vvb|vvbl)
Now apply the conditional entropy fud decomper to minimise the label entropy,
def decompercondrr(ll,uu,aa,kmax,omax,fmax):
return parametersSystemsHistoryRepasDecomperConditionalFmaxRepa(kmax,omax,fmax,uu,ll,aa)
(kmax,omax) = (1,5)
(uub2,df2) = decompercondrr(vvbl,uub1,hhtrbb2,kmax,omax,15)
dfund(df2)
# {<<<1,1>,1>,3>, <<<1,1>,1>,53>, <<<1,5>,2>,66>, <<<1,5>,2>,76>, <<<1,6>,1>,3>, <<<1,6>,1>,18>, <<<1,6>,1>,20>, <<<1,7>,1>,2>, <<<1,7>,1>,39>, <<<1,7>,1>,42>, <<<1,15>,1>,2>}
len(dfund(df2))
11
rpln(treesPaths(funcsTreesMap(lambda xx:(fder(xx[1]),fund(xx[1])),dfzz(df2))))
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<2,1>,1>}, {<<<1,7>,1>,2>}), ({<<5,1>,1>}, {<<<1,7>,1>,42>}), ({<<15,1>,1>}, {<<<1,1>,1>,53>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<2,1>,1>}, {<<<1,7>,1>,2>}), ({<<11,1>,1>}, {<<<1,5>,2>,76>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<2,1>,1>}, {<<<1,7>,1>,2>}), ({<<12,1>,1>}, {<<<1,1>,1>,53>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<3,1>,1>}, {<<<1,7>,1>,2>}), ({<<6,1>,1>}, {<<<1,5>,2>,66>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<3,1>,1>}, {<<<1,7>,1>,2>}), ({<<13,1>,1>}, {<<<1,5>,2>,66>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<4,1>,1>}, {<<<1,7>,1>,39>}), ({<<8,1>,1>}, {<<<1,6>,1>,3>}), ({<<14,1>,1>}, {<<<1,15>,1>,2>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<7,1>,1>}, {<<<1,7>,1>,2>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<9,1>,1>}, {<<<1,6>,1>,18>})]
# [({<<1,1>,1>}, {<<<1,1>,1>,3>}), ({<<10,1>,1>}, {<<<1,6>,1>,20>})]
Consider this model as a predictor of label,
ff2 = systemsDecompFudsNullablePracticable(uub2,df2,1)
ff2 = fdep(funion(ff2,ff1),fder(ff2))
uub2 = uunion(uub,fsys(ff2))
hhtrbc = hrfmul(uub2,ff2,hhtrb)
hrlent(uub2,hhtrbc,fder(ff2),vvbl)
2.047763453782215
hhtebc = hrfmul(uub2,ff2,hhteb)
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebc,fder(ff2)))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbc,fder(ff2)))))))
# 1459 % 1
Continuing on with larger fuds,
(uub2,df2) = decompercondrr(vvbl,uub1,hhtrbb2,kmax,omax,63)
len(dfund(df2))
38
ff2 = systemsDecompFudsNullablePracticable(uub2,df2,1)
ff2 = fdep(funion(ff2,ff1),fder(ff2))
uub2 = uunion(uub,fsys(ff2))
hhtrbc = hrfmul(uub2,ff2,hhtrb)
hrlent(uub2,hhtrbc,fder(ff2),vvbl)
1.46016484022873
hhtebc = hrfmul(uub2,ff2,hhteb)
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebc,fder(ff2)))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbc,fder(ff2)))))))
# 1455 % 1
(uub2,df2) = decompercondrr(vvbl,uub1,hhtrbb2,kmax,omax,127)
len(dfund(df2))
70
ff2 = systemsDecompFudsNullablePracticable(uub2,df2,1)
ff2 = fdep(funion(ff2,ff1),fder(ff2))
uub2 = uunion(uub,fsys(ff2))
hhtrbc = hrfmul(uub2,ff2,hhtrb)
hrlent(uub2,hhtrbc,fder(ff2),vvbl)
1.0787486822622814
hhtebc = hrfmul(uub2,ff2,hhteb)
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebc,fder(ff2)))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbc,fder(ff2)))))))
# 1454 % 1
The 127-fud conditional over 15-fud induced model has lower label entropy than the 2-tuple substrate model, but is more query effective,
xx = sset([VarStr(s) for s in ["BsmtUnfSFB","GrLivAreaB"]])
len(xx)
2
lent(aatrb,xx,vvbl)
1.2044059887997252
size(mul(aateb,eff(hhaa(hrhh(uub,hrhrred(hhtrb,xx))))))
# 1395 % 1
The 127-fud conditional over 15-fud induced model has similar label entropy to the 31-fud conditional substrate model, but is more query effective,
len(dfund(df))
11
hrlent(uub1,hhtrbb,fder(ff),vvbl)
1.0078399275847598
size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 1352 % 1
(uub2,df2) = decompercondrr(vvbl,uub1,hhtrbb2,kmax,omax,255)
len(dfund(df2))
106
ff2 = systemsDecompFudsNullablePracticable(uub2,df2,1)
ff2 = fdep(funion(ff2,ff1),fder(ff2))
uub2 = uunion(uub,fsys(ff2))
hhtrbc = hrfmul(uub2,ff2,hhtrb)
hrlent(uub2,hhtrbc,fder(ff2),vvbl)
0.6685934841580492
hhtebc = hrfmul(uub2,ff2,hhteb)
size(mul(hhaa(hrhh(uub2,hrhrred(hhtebc,fder(ff2)))),eff(hhaa(hrhh(uub2,hrhrred(hhtrbc,fder(ff2)))))))
# 1448 % 1
The 255-fud conditional over 15-fud induced model has similar label entropy to the 63-fud conditional substrate model, but is more query effective,
len(dfund(df))
15
hrlent(uub1,hhtrbb,fder(ff),vvbl)
0.7144285840050593
size(mul(hhaa(hrhh(uub1,hrhrred(hhtebb,fder(ff)))),eff(hhaa(hrhh(uub1,hrhrred(hhtrbb,fder(ff)))))))
# 1243 % 1
With respect to sale price, we can see that there are sub-models of the induced model which have similar properties as models consisting of subsets of the substrate. Again, when choosing between sub-models of the induced model there is a trade-off between model likelihood and query effectiveness. When choosing between a sub-model of the induced model and a corresponding substrate variable model of similar label entropy and query effectiveness, however, the sub-model is, in general, the more likely model. That is, the sub-model of the induced model is preferable to the substrate model because it is more accurate when it is query effective.