2024-02-16

스파르타/TIL(Today I Learned) 2024. 2. 17. 01:07

마찬가지로 프로젝트로 인해 작성못함 수정해줄예정

-> 추가 수정

어제 학습시킬 때 oe로 인코딩한 것을 빼먹고 학습시켰음-> 그래서 이때 해준 것인지는 기억나지 않지만 다시 포함시켜 학습시켜주었음

각 모델 하이퍼 파라미터

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

dt
- 패키지
- sklearn.tree.DecisionTreeClassifier
- from sklearn.tree import DecisionTreeClassifier
- 하이퍼 파라미터 설명
  - criterionThe function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see Mathematical formulation.
  - criterion*{“gini”, “entropy”, “log_loss”}, default=”gini”*
  - max_depth none으로 하면
    - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
    - 트리의 최대 깊이입니다. 없음인 경우 모든 잎이 순수하거나 모든 잎이 min_samples_split 샘플보다 작은 표본을 포함할 때까지 노드가 확장됩니다.
  - min_samples_splitThe minimum number of samples required to split an internal node: • If int, then consider min_samples_split as the minimum number. • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
  - min_samples_split : int or float, default=2
  - min_samples_leafThe minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. • If int, then consider min_samples_leaf as the minimum number. • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
  - min_samples_leaf : int or float, default=1
criterion = entropy (기본값 gini)min_samples_leaf =1(default)(max_n_classes = 7→다중분류의 갯수라 A~G까지 7개 고정인게 맞음)
max_depth = 25?
min_samples_split = 2(default)

rf
from sklearn.ensemble import RandomForestClassifier
sklearn.ensemble.RandomForestClassifier
- .estimators_로 각 트리에 접근 가능
- ```
# 각 트리의 깊이 확인
for idx, estimator in enumerate(loaded_dict['model_rf'].estimators_):
    print(f"Tree {idx} depth: {estimator.tree_.max_depth}")

sum([estimator.tree_.max_depth for estimator in loaded_dict['model_rf'].estimators_])/len([estimator.tree_.max_depth for estimator in loaded_dict['model_rf'].estimators_])
```
- n_estimatorsint, default=100
- The number of trees in the forest.
- max_depthint, default=None
  - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  - 트리의 최대 깊이입니다. 없음인 경우 모든 잎이 순수하거나 모든 잎이 min_samples_split 샘플보다 작은 표본을 포함할 때까지 노드가 확장됩니다.
n_estimators = 305 (기본값 100) criterion = 'gini' (default) max_depth = 62 (난 42로 했었음 그리고 각 뎁스를 보니 젤 큰게 29였음(평균24) min_samples_split = 7 (기본값 2) min_samples_leaf = 1 (기본값)

et
sklearn.ensemble.ExtraTreesClassifier
- n_estimatorsint, default=100
- The number of trees in the forest.
- max_depthint, default=None
  - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  - 트리의 최대 깊이입니다. 없음인 경우 모든 잎이 순수하거나 모든 잎이 min_samples_split 샘플보다 작은 표본을 포함할 때까지 노드가 확장됩니다.
n_estimators = 930 (기본값 100) criterion = 'entropy' (기본값 gini) max_depth = 65 (난 65로 했는듯? 그리고 최대59 평균39.6) min_samples_split = 6 (기본값 2) min_samples_leaf = 1 (기본값)
from sklearn.ensemble import ExtraTreesClassifier

xgbPython API Reference — xgboost 2.1.0-dev documentation XGboost 주요 하이퍼파라미터 (with 파이썬)

파라미터 설명• max_depth (Optional[int]) – Maximum tree depth for base learners.1. general parameterNo 파라미터 인수명 설명

1	booster	- gbtree(tree based model) 또는 gblinear(linear model) 중 선택- Default = 'gbtree'
2	silent	- 출력 메시지 설정 관련 인수(나타내고 싶지 않을 경우 1로 설정)- Default = 1
3	nthread	- CPU 실행 스레드 개수 조정- Default는 전체 다 사용하는 것- 멀티코어/스레드 CPU 시스템에서 일부CPU만 사용할 때 변경

2. booster parameter (매우 중요)No 파라미터 인수명 설명

1	eta(기본값 0.3)	- 일반적으로 확습률(learning rate)로 불리우는 파라미터- weak learner의 반영 수준을 나타냄- 범위는 0 ~ 1로 클 수록 모형의 업데이트 속도는 빨라짐.클수록 과적합의 이슈 발생 가능성이 높음
2	num_boost_around(기본값 10)	- 학습에 활용될 weak learner의 반복 수
3	min_child_weight(기본값 1)	- leaf node에 포함되는 최소 관측치의 수를 의미- 작은 값을 가질수록 과적합 발생 가능성이 높음(과적합 조절 용도로 사용됨)- 범위: 0 ~ ∞
4	gamma(기본값 0)	- leaf node의 추가분할을 결정할 최소손실 감소값- 해당값보다 손실이 크게 감소할 때 분리- 값이 클수록 과적합 감소효과- 범위: 0 ~ ∞
5	max_depth(기본값 6)	- 트리의 최대 깊이를 설정- 0을 지정하면 깊이의 제한이 없음- 과적합에 가장 민감하게 작용하는 파라미터 중 하나임(과적합 조절 용도로 사용됨)- 범위: 0 ~ ∞
6	sub_sample(기본값 1)	- 학습 시 데이터 샘플링 비율을 지정(과적합 제어)- 일반적으로 0.5 ~ 1 사이의 값을 사용- 범위: 0 ~ 1
7	colsample_bytree(기본값 1)	- 트리 생성에 필요한 feature의 샘플링에 사용- feature가 많을 때 과적합 조절에 사용- 범위: 0 ~ 1
8	lambda(기본값 1)	- L2 Regularization 적용 값- feature 개수가 많을 때 적용 검토- 클수록 과적합 감소 효과
9	alpha(기본값 0)	- L1 Regularization 적용 값- feature 개수가 많을 때 적용 검토- 클수록 과적합 감소 효과
10	scale_pos_weight(기본값 1)	- 불균형 데이터셋의 균형을 유지

3. train parameterNo 파라미터 인수명 설명

1	objective	- reg:linear : 회귀- binary:logistic : 이진분류- multi:softmax : 다중분류, 클래스 반환- multi:softprob : 다중분류, 확률반환
2	eval_metric	- 검증에 사용되는 함수정의- 회귀 분석인 경우 'rmse'를, 클래스 분류 문제인 경우 'error'---------------------------------------------------------------------- rmse : Root Mean Squared Error- mae : mean absolute error- logloss : Negative log-likelihood- error : binary classification error rate- merror : multiclass classification error rate- mlogloss: Multiclass logloss- auc: Area Under Curve

reg_lambda: L2 정규화 가중치입니다. 이 값이 크면 모델은 더욱 보수적으로 학습하게 됩니다. 기본값은 1입니다.
reg_alpha: L1 정규화 가중치입니다. 이 값이 크면 모델은 더욱 보수적으로 학습하게 됩니다. 기본값은 0입니다.
tree_method: 트리 생성 알고리즘을 정의합니다. "exact"는 전체 데이터 세트에 대해 정확한 그리디 알고리즘을 사용합니다. 다른 옵션으로는 'approx', 'hist', 'gpu_hist' 등이 있습니다. 기본값은 'auto'로, 적절한 알고리즘이 자동으로 선택됩니다.
colsample_bytree: 각 트리를 구성할 때 열의 서브샘플 비율입니다. 이 값은 0과 1 사이의 값을 가지며, 값이 작을수록 각 트리가 학습하는 피처의 수가 줄어들게 됩니다. 이를 통해 모델의 과적합을 방지할 수 있습니다. 기본값은 1입니다.

위의 하이퍼파라미터들은 모두 모델의 복잡성과 과적합을 제어하는 데 사용됩니다. 적절한 값을 설정하면 모델의 성능을 크게 향상시킬 수 있습니다.

보수적으로 학습한다 → 이상치나 노이즈에 둔감하게 반응한다 의미
max_depth: 기본값은 6입니다. 각 트리의 최대 깊이를 결정합니다. learning_rate: 기본값은 0.3입니다. 각 트리의 가중치를 축소하는 비율을 결정합니다. n_estimators: 기본값은 100입니다. 부스팅 라운드 수를 결정합니다. objective: 기본값은 'reg:squarederror'입니다. 손실 함수를 결정합니다. booster: 기본값은 'gbtree'입니다. 사용할 부스터를 결정합니다. gamma: 기본값은 0입니다. 분할을 수행하는데 필요한 최소 손실 감소를 결정합니다.
• n_estimators (Optional[int]) – Number of boosting rounds.

n_estimators = 665 (기본값 100인듯) reg_lambda = 0.04614513317156364 (기본값 1 클수록 보수적 학습) reg_alpha = 0.8831857977740336 (기본값 0 클수록 보수적 학습) tree_method = "exact"(기본값 auto 'approx', 'hist', 'gpu_hist'등이 있음) colsample_bytree = 0.7664006730032823 (자유도 제한해서 너무 복잡해지지 않게조절해서 과적합 방지 너무작으면 과소적합됨) subsample = 0.6579847353498132 (기본값 1이고 보통 0.5~1사용) learning_rate = 0.4046062291148477 (기본값 0.3) max_depth = 64 (기본값 6) min_child_weight = 2 (기본값 1 작을 수록 과적합가능성상승)

from xgboost import XGBClassifier

rf = RandomForestClassifier(random_state = 42
                         , n_estimators = 305
                         , criterion = 'gini'
                         , max_depth = 62
                         , min_samples_split = 7
                         , min_samples_leaf = 1)
dt = DecisionTreeClassifier(random_state = 42
                         , criterion = 'entropy'
                         , max_depth = 25
                         , min_samples_split = 2
                         , min_samples_leaf = 1)
et = ExtraTreesClassifier(random_state = 42
                         , n_estimators = 930
                         , criterion = 'entropy'
                         , max_depth = 65
                         , min_samples_split = 6
                         , min_samples_leaf = 1
                         )
xgb = XGBClassifier(random_state = 42
                   , n_estimators = 665
                   , reg_lambda = 0.04614513317156364
                   , reg_alpha = 0.8831857977740336
                   , tree_method = "exact"
                   , colsample_bytree = 0.7664006730032823
                   , subsample = 0.6579847353498132
                   , learning_rate = 0.4046062291148477
                   , max_depth = 64
                   , min_child_weight = 2
                   )

중간 결과물

6차

7차

결과
number of features for model_dt : 3 Features sorted by their rank: [(1, '대출금액'), (1, '총상환원금'), (1, '총상환이자'), (2, '연간소득'), (3, '대출기간'), (4, '부채_대비_소득_비율'), (5, '총계좌수'), (6, '근로기간'), (7, '최근_2년간_연체_횟수'), (8, '주택소유상태_1'), (9, '대출목적_1'), (10, '주택소유상태_2'), (11, '대출목적_2'), (12, '대출목적_0'), (13, '대출목적_3'), (14, '연체계좌수'), (15, '총연체금액'), (16, '주택소유상태_0')] Optimal number of features for model_rf : 3 Features sorted by their rank: [(1, '대출금액'), (1, '총상환원금'), (1, '총상환이자'), (2, '부채_대비_소득_비율'), (3, '연간소득'), (4, '총계좌수'), (5, '근로기간'), (6, '대출기간'), (7, '최근_2년간_연체_횟수'), (8, '주택소유상태_2'), (9, '대출목적_2'), (10, '주택소유상태_1'), (11, '대출목적_1'), (12, '대출목적_0'), (13, '대출목적_3'), (14, '연체계좌수'), (15, '총연체금액'), (16, '주택소유상태_0')] Optimal number of features for model_et : 3 Features sorted by their rank: [(1, '대출금액'), (1, '총상환원금'), (1, '총상환이자'), (2, '부채_대비_소득_비율'), (3, '연간소득'), (4, '총계좌수'), (5, '근로기간'), (6, '대출기간'), (7, '최근_2년간_연체_횟수'), (8, '대출목적_2'), (9, '주택소유상태_2'), (10, '대출목적_1'), (11, '주택소유상태_1'), (12, '대출목적_0'), (13, '대출목적_3'), (14, '연체계좌수'), (15, '총연체금액'), (16, '주택소유상태_0')] Optimal number of features for model_xgb : 4 Features sorted by their rank: [(1, '대출금액'), (1, '대출기간'), (1, '총상환원금'), (1, '총상환이자'), (2, '대출목적_0'), (3, '총연체금액'), (4, '대출목적_2'), (5, '최근_2년간_연체_횟수'), (6, '대출목적_3'), (7, '대출목적_1'), (8, '연간소득'), (9, '부채_대비_소득_비율'), (10, '총계좌수'), (11, '연체계좌수'), (12, '주택소유상태_1'), (13, '근로기간'), (14, '주택소유상태_2'), (15, '주택소유상태_0')]

(1, '대출금액'), (1, '대출기간'), (1, '총상환원금'), (1, '총상환이자')를 생각하면 되지 않을까 예상

xgb경우

이때 학습시킨 모델과 test예측값 파일로 저장되어있음( 점수가 확낮게나왔음 이건 쓰진 못할듯 )

8차

대출기간 숫자로 바꾼뒤 grid로 성적

결과

Model: model_dt Best Parameters: {'max_depth': 25} Best F1-macro Score: 0.7694097881771752 Best Accuracy Score: 0.834115426892591 Best Score: 0.7694097881771752

Model: model_rf Best Parameters: {'max_depth': 62, 'n_estimators': 300} Best F1-macro Score: 0.6539144737968758 Best Accuracy Score: 0.7925736773897338 Best Score: 0.6539144737968758

Model: model_et Best Parameters: {'max_depth': 65, 'n_estimators': 930} Best F1-macro Score: 0.5378356178879458 Best Accuracy Score: 0.6736337472329164 Best Score: 0.5378356178879458

Model: model_xgb Best Parameters: {'max_depth': 64, 'n_estimators': 665} Best F1-macro Score: 0.8002781251216975 Best Accuracy Score: 0.8610783203038469 Best Score: 0.8002781251216975

모델별 파라미터 기록해뒀고, 모델별 (dt는 불가해서 제외) 특성중요도 그래프로 그려서 이미지 첨부

흠 이뒤로 너무 많은 듯하여 그냥 각 시도 회차별로 변경점과 결과만 간단히 적어주도록 하겠다

RFE는 특성중요도 변하는 것과 변수 어떤것 나오는지 확인 한거라 어떻게 해줬다만 간단히 적어두겠다

(어떤 변수만을 선택했을 때 가장 높게나왔는지도 의미는 있을 듯하긴 하지만 grid로 했을 때와 비슷하기 때문에 gird한번으로 대체한다고 생각하고 생략하겠다)

9차

파생변수 추가해서 RFE

10차

파생변수 추가해서 grid

점수와 파라미터Model: model_rf Best Parameters: {'max_depth': 42, 'n_estimators': 300} Best F1-macro Score: 0.9163888877500093 Best Accuracy Score: 0.9377031321853584 Best Score: 0.9163888877500093Model: model_xgb Best Parameters: {'max_depth': 64, 'n_estimators': 50} Best F1-macro Score: 0.9167744971849624 Best Accuracy Score: 0.9462936461801498 Best Score: 0.9167744971849624
Model: model_et Best Parameters: {'max_depth': 65, 'n_estimators': 930} Best F1-macro Score: 0.8003327594683436 Best Accuracy Score: 0.8723128842995683 Best Score: 0.8003327594683436
Model: model_dt Best Parameters: {'max_depth': 25} Best F1-macro Score: 0.9227966874786953 Best Accuracy Score: 0.9440780565506529 Best Score: 0.9227966874786953

11차

['연간소득', '부채_대비_소득_비율','총연체금액', '대출기간', '대출금액_대비_총상환원금_비율', '대출금액_대비_총상환이자_비율', '총상환원금/총상환이자'] 만을 변수로 채택하여 RFE

12차

11차와 동일하게 grid

점수와 파라미터Model: model_rf Best Parameters: {'max_depth': 42, 'n_estimators': 300} Best F1-macro Score: 0.9321413879043347 Best Accuracy Score: 0.9472132983422256 Best Score: 0.9321413879043347Model: model_xgb Best Parameters: {'max_depth': 64, 'n_estimators': 665} Best F1-macro Score: 0.9150310016099621 Best Accuracy Score: 0.9434405589809323 Best Score: 0.9150310016099621
Model: model_et Best Parameters: {'max_depth': 65, 'n_estimators': 930} Best F1-macro Score: 0.9128942571188599 Best Accuracy Score: 0.9305966161249929 Best Score: 0.9128942571188599
Model: model_dt Best Parameters: {'max_depth': 25} Best F1-macro Score: 0.9299382786659827 Best Accuracy Score: 0.9456247613304998 Best Score: 0.9299382786659827

13차

총연체금액 제외해주고 그냥 돌렸음 (그런데 문득 cv들 다 그냥 5로 설정해줬는데 kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) 이렇게해서 cv=kf해주는게 좋을듯) (grid로 했던 것으로 기억)

점수와 파라미터Model: model_rf Best Parameters: {'max_depth': 42, 'n_estimators': 300} Best F1-macro Score: 0.9336243620283359 Best Accuracy Score: 0.9484360239430168 Best Score: 0.9336243620283359Model: model_xgb Best Parameters: {'max_depth': 64, 'n_estimators': 50} Best F1-macro Score: 0.9152215547113555 Best Accuracy Score: 0.9436182176366369 Best Score: 0.9152215547113555
Model: model_et Best Parameters: {'max_depth': 65, 'n_estimators': 930} Best F1-macro Score: 0.9172012120149905 Best Accuracy Score: 0.9342230256795044 Best Score: 0.9172012120149905
Model: model_dt Best Parameters: {'max_depth': 25} Best F1-macro Score: 0.9311260319939804 Best Accuracy Score: 0.9453530495057576 Best Score: 0.9311260319939804

14차

추가로 13차에서 언급한대로 kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) 이렇게해서 cv=kf 설정 다시하고 돌렸음(gridsearch)

점수와 파라미터Model: model_rf Best Parameters: {'max_depth': 42, 'n_estimators': 300} Best F1-macro Score: 0.934750012140886 Best Accuracy Score: 0.9484778490758015 Best Score: 0.934750012140886Model: model_xgb Best Parameters: {'max_depth': 64, 'n_estimators': 665} Best F1-macro Score: 0.9148440563680573 Best Accuracy Score: 0.9424373171686302 Best Score: 0.9148440563680573
Model: model_et Best Parameters: {'max_depth': 65, 'n_estimators': 930} Best F1-macro Score: 0.9156159277363937 Best Accuracy Score: 0.934066282593655 Best Score: 0.9156159277363937
Model: model_dt Best Parameters: {'max_depth': 25} Best F1-macro Score: 0.9288097942559841 Best Accuracy Score: 0.9445065671618302 Best Score: 0.9288097942559841

15차

['대출기간', '대출금액_대비_총상환원금_비율', '대출금액_대비_총상환이자_비율', '총상환원금/총상환이자'] 이렇게 과감히 변수 4개로 줄여서 시도

점수와 파라미터Model: model_rf Best Parameters: {'max_depth': 42, 'n_estimators': 50} Best F1-macro Score: 0.9469583286184525 Best Accuracy Score: 0.9526163403248544 Best Score: 0.9469583286184525Model: model_xgb Best Parameters: {'max_depth': 64, 'n_estimators': 50} Best F1-macro Score: 0.9268340709669355 Best Accuracy Score: 0.9477358315417268 Best Score: 0.9268340709669355
Model: model_et Best Parameters: {'max_depth': 65, 'n_estimators': 930} Best F1-macro Score: 0.9431641837294844 Best Accuracy Score: 0.9511323341991662 Best Score: 0.9431641837294844
Model: model_dt Best Parameters: {'max_depth': 25} Best F1-macro Score: 0.9420897058795743 Best Accuracy Score: 0.9500141056271938 Best Score: 0.9420897058795743

16차(발표때 최종적으로 쓴 파일 후에 더 하이퍼파라미터를 조절해본 것들이 있으나 크게 차이나지 않았고, 다시 추가로 다 적어주기엔 시간이 부족할 것이라 판단하여 내린 결정이었음)

random search 이용해서 하이퍼 파라미터 일차적으로 선택 →갯수 10개로 너무 적게 했음 늘려서도 해보면 좋을듯

점수와 파라미터Model: model_rf Best Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'max_depth': 59, 'criterion': 'gini'} Best F1-macro Score: 0.9470459108647772 Best Accuracy Score: 0.9525954373149352 Best Score: 0.9470459108647772Model: model_xgb Best Parameters: {'tree_method': 'exact', 'subsample': 1.0, 'n_estimators': 50, 'min_child_weight': 3, 'max_depth': 67, 'learning_rate': 0.3, 'colsample_bytree': 0.6779661016949152} Best F1-macro Score: 0.9491100497689358 Best Accuracy Score: 0.9526790488085283 Best Score: 0.9491100497689358
Model: model_et Best Parameters: {'n_estimators': 600, 'min_samples_split': 5, 'max_depth': 59, 'criterion': 'gini'} Best F1-macro Score: 0.9443827072187675 Best Accuracy Score: 0.9516339714878423 Best Score: 0.9443827072187675
Model: model_dt Best Parameters: {'max_depth': 16, 'criterion': 'entropy'} Best F1-macro Score: 0.9428722920385514 Best Accuracy Score: 0.9506202475898672 Best Score: 0.9428722920385514
특성중요도

계획:

~~지금 하고 있는 것 대출기간 숫자로 바꿔서 해보기~~

파생변수 추가해서 RFE로 특성 선택 →성적보고 결정

파생변수 추가해서 gird로 확인

파생변수까지 포함된 상태로 RFE와 특성중요도 확인 후 제외하고 중요한 것들로만 다시 Grid와 rand로 하이퍼 파라미터 조절

그리고 마지막 위에서 구한 하이퍼파라미터로 stacking

'스파르타 > TIL(Today I Learned)' 카테고리의 다른 글

2024-02-20 (0)	2024.02.20
2024-02-19 (1)	2024.02.19
2024-02-15 (0)	2024.02.16
2024-02-14 (0)	2024.02.15
2024-02-13 (0)	2024.02.14

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

kyeob 개발일지 kyeob 개발일지

'스파르타 > TIL(Today I Learned)' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

인기포스트

ABOUT ME

'스파르타 > TIL(Today I Learned)' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역