MUSH - Model 16 induction

MUSH - Analysis of the UCI Machine Learning Repository Mushroom Data Set/Model 16 induction

MUSH_model16.json is induced by MUSH_engine16.hs.

MUSH_engine16 may be built as described in README. Then run as follows -

stack exec MUSH_engine16.exe +RTS -s >MUSH_engine16.log 2>&1 &

tail -f MUSH_engine16.log

The first section loads the sample,

    (uu,hh) <- do
      mush <- ByteStringChar8.readFile "../MUSH/agaricus-lepiota.data"
      let aa = llaa $ map (\ll -> (llss ll,1)) $ map (\ss -> (map (\(u,(v,uu)) -> (VarStr v,ValStr (fromJust (lookup u uu)))) (zip ss names))) $ map (\l -> filter (/=',') l) $ lines $ ByteStringChar8.unpack $ mush
      let uu = sys aa
      return (uu, aahr uu aa)

    let vv = uvars uu
    let vvl = Set.singleton (VarStr "edible")
    let vvk = vv `Set.difference` vvl

Then the parameters are defined,

    let model = "MUSH_model16"
    let (wmax,lmax,xmax,omax,bmax,mmax,umax,pmax,fmax,mult,seed) = ((9*9*10), 8, (9*9*10), 40, (40*4), 4, (9*9*10), 1, 20, 7, 5)

Here the limit of the underlying volume, xmax, is set to the product of the second, third and fourth largest valencies, 9*9*10,

rpln $ sort [(u,w) | w <- qqll vv, let u = vol uu (sgl w)]
...
"(9,stalk-color-above-ring)"
"(9,stalk-color-below-ring)"
"(10,cap-color)"
"(12,gill-color)"

In general, the maximum-roll-by-derived-dimension decomper is such that increasing any of the parameters generally increases the summed alignment valency-density at the cost of computation time and space. In this case the parameters are chosen such that MUSH_engine16 runs on a Ubuntu 16.04 Pentium CPU G2030 @ 3.00GHz using 1784 MB total memory in 1166 seconds.

Then the decomper is run,

    Just (uu',df) <- decomperIO uu vvk hh vvl wmax lmax xmax omax bmax mmax umax pmax fmax mult seed
...
  where 
...
    decomperIO uu vv hh ll wmax lmax xmax omax bmax mmax umax pmax fmax mult seed =
      parametersSystemsHistoryRepasDecomperMaxRollByMExcludedSelfHighestFmaxLabelMinEntropyIORepa 
        wmax lmax xmax omax bmax mmax umax pmax fmax mult seed uu vv hh ll 

Although all aligned induction is unsupervised, the sequence of the decomposition in the label-entropy decomper here chooses the next slice as that with the highest scaled label entropy, rather than simply choosing the slice with the largest size. Also, the label-entropy decomper does not decompose slices with zero label entropy. In this case, the decomper terminates after 10 nodes when all leaf slices are label modal.

Then the model is is written to MUSH_model16.json,

    writeModel model df
...
  where 
...
    writeModel model df = ByteString.writeFile (model ++ ".json") $ decompFudsPersistentsEncode $ decompFudsPersistent df

Finally, the summed alignment and the summed alignment valency-density are calculated,

    let (a,ad) = summation mult seed uu' df hh
    printf "alignment: %.2f\n" $ a
    printf "alignment density: %.2f\n" $ ad
...
  where 
...
    summation = systemsDecompFudsHistoryRepasAlignmentContentShuffleSummation_u

where the systemsDecompFudsHistoryRepasAlignmentContentShuffleSummation_u is defined in module AlignmentPracticableRepa,

systemsDecompFudsHistoryRepasAlignmentContentShuffleSummation_u :: 
  Integer -> Integer -> System -> DecompFud -> HistoryRepa -> (Double,Double)

as

systemsDecompFudsHistoryRepasAlignmentContentShuffleSummation_u mult seed uu df aa =
    Set.fold scalgn (0,0) $ treesElements $ apply mult seed uu df aa
  where
    scalgn ((_,ff),(hr,hrxx)) (a,ad) = (a + b, ad + b/(u ** (1/m)))
      where
        u = fromIntegral (vol uu (vars aa))
        m = fromIntegral (Set.size (vars aa))
        aa = araa uu (hr `hrred` fder ff)
        bb = resize (size aa) (araa uu (hrxx `hrred` fder ff))
        b = algn aa - algn bb
    apply = systemsDecompFudsHistoryRepasMultiplyWithShuffle

The summed alignment is,

alignment: 71310.20
alignment density: 32440.30

top