We note that some features and factors and some are numerical. We are going to attempt to predict bikers. We suspect Poisson to be appropriate since the number of bikers can only be non-negative integer and under the conditions of a specific day, we will assume that bike usage follows a Poisson process.
So we learn from EDA, that we cannot use registered and casual to predict bikers, because that’s cheating!
First things first, we will split training and test and create a cross folds object (that we will use later).
SexInsured Female VehicleType PC Clm_Count Exp_weights LNWEIGHT NCD AgeCat
1 U 0 T 0 0 0.6680356 -0.40341383 30 0
2 U 0 T 0 0 0.5667351 -0.56786326 30 0
3 U 0 T 0 0 0.5037645 -0.68564629 30 0
4 U 0 T 0 0 0.9144422 -0.08944106 20 0
5 U 0 T 0 0 0.5366188 -0.62246739 20 0
6 U 0 T 0 0 0.7529090 -0.28381095 20 0
AutoAge0 AutoAge1 AutoAge2 AutoAge VAgeCat VAgecat1
1 0 0 0 0 0 2
2 0 0 0 0 0 2
3 0 0 0 0 0 2
4 0 0 0 0 0 2
5 0 0 0 0 0 2
6 0 0 0 0 0 2
summary(SingaporeAuto)
SexInsured Female VehicleType PC Clm_Count
F: 700 Min. :0.00000 A :3842 Min. :0.0000 Min. :0.00000
M:3145 1st Qu.:0.00000 G :2882 1st Qu.:0.0000 1st Qu.:0.00000
U:3638 Median :0.00000 Q : 358 Median :1.0000 Median :0.00000
Mean :0.09355 M : 188 Mean :0.5134 Mean :0.06989
3rd Qu.:0.00000 P : 88 3rd Qu.:1.0000 3rd Qu.:0.00000
Max. :1.00000 Z : 71 Max. :1.0000 Max. :3.00000
(Other): 54
Exp_weights LNWEIGHT NCD AgeCat
Min. :0.005476 Min. :-5.2074 Min. : 0.00 Min. :0.00
1st Qu.:0.279261 1st Qu.:-1.2756 1st Qu.: 0.00 1st Qu.:0.00
Median :0.503764 Median :-0.6856 Median :20.00 Median :2.00
Mean :0.519859 Mean :-0.8945 Mean :19.85 Mean :1.94
3rd Qu.:0.752909 3rd Qu.:-0.2838 3rd Qu.:30.00 3rd Qu.:4.00
Max. :1.000000 Max. : 0.0000 Max. :50.00 Max. :7.00
AutoAge0 AutoAge1 AutoAge2 AutoAge
Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000
Median :0.0000 Median :0.00000 Median :0.00000 Median :1.000
Mean :0.3905 Mean :0.05867 Mean :0.05987 Mean :0.509
3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.000
Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.000
VAgeCat VAgecat1
Min. :0.000 Min. :2.000
1st Qu.:0.000 1st Qu.:2.000
Median :1.000 Median :2.000
Mean :2.019 Mean :2.933
3rd Qu.:4.000 3rd Qu.:4.000
Max. :6.000 Max. :6.000
# A tibble: 1 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 poisson_log_loss standard 0.249 36 0.00399 Preprocessor1_Model40
tune_metrics <-collect_metrics(SA_tuning)ggplot(tune_metrics, aes(x = penalty, y = mean))+geom_point()
Warning: Removed 39 rows containing missing values or values outside the scale range
(`geom_point()`).
tune_metrics
# A tibble: 40 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 1 e-10 poisson_log_loss standard NaN 0 NA Preprocessor1_Model…
2 1.80e-10 poisson_log_loss standard NaN 0 NA Preprocessor1_Model…
3 3.26e-10 poisson_log_loss standard NaN 0 NA Preprocessor1_Model…
4 5.88e-10 poisson_log_loss standard NaN 0 NA Preprocessor1_Model…
5 1.06e- 9 poisson_log_loss standard NaN 0 NA Preprocessor1_Model…
6 1.91e- 9 poisson_log_loss standard NaN 0 NA Preprocessor1_Model…
7 3.46e- 9 poisson_log_loss standard NaN 0 NA Preprocessor1_Model…
8 6.24e- 9 poisson_log_loss standard NaN 0 NA Preprocessor1_Model…
9 1.13e- 8 poisson_log_loss standard NaN 0 NA Preprocessor1_Model…
10 2.03e- 8 poisson_log_loss standard NaN 0 NA Preprocessor1_Model…
# ℹ 30 more rows
Well, this suggests that something if going on with how the penalty parameters are being fed to glmnet in the case of Poisson regression. There is clearly something I do not understand.