2024-02-15

스파르타/TIL(Today I Learned) 2024. 2. 16. 02:18

프로젝트로 인하여 내용 나중에 적을 예정

(추가해 줌)

추가 아이디어

스케일링도 여러가지 있는 듯해서 여유되면 다른 스케일러 찾아서 해보기

인코딩도 확인할 수있으면 추가로 비교해서 확인해보기

그리고 여테 각 변수의 범주별 등급분포를 봤는데

등급별 각 변수 분포도 반대로 확인해봐야 할듯

~~파생 변수~~

voting from sklearn.ensemble import VotingClassifier

학습시킨 모델 파일로 저장하기~~from sklearn.externals import joblib~~ import joblib
joblib.dump(clf, 'model.pkl', compress=3)
import sys

scoring 커스텀으로 만들어서 중간 진행도 확인하기

tqdm으로 진행도 파악하기 pip install tqdm from tqdm import tqdm

히트맵 그리기 함수화해두기(파생변수 추가되었을 때도 추가적으로 새로 그리면 좋을 것 같아서)

코드 패치노트

슬슬 이날 부터는 빠르게 빠르게 해준다고 코드 수정부분에 대하여 어떻게 수정하였는지 기록을 남겨주지 못하였다

(적어둔 부분도 크게 중요한 부분은 아니라 생각되어 생략하도록 하겠다, z-score기반으로 이상치 제거하는 부분이 나름 중요하긴 하나 나중에 프로젝트 관련하여 내용에 포함될 예정이기 때문에 지금 이시점에 추가해줬다 이게 중요하진 않을 것이라 판단되어 중요하지 않다고 생각하였다)

각 모델 하이퍼 파라미터 기본값

sklearn.model_selection.GridSearchCV

sklearn.model_selection.RandomizedSearchCV

dt
sklearn.tree.DecisionTreeClassifier
- The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- 트리의 최대 깊이입니다. 없음인 경우 모든 잎이 순수하거나 모든 잎이 min_samples_split 샘플보다 작은 표본을 포함할 때까지 노드가 확장됩니다.
max_depth none으로 하면
from sklearn.tree import DecisionTreeClassifier
rf
sklearn.ensemble.RandomForestClassifier
- n_estimatorsint, default=100
- The number of trees in the forest.
- max_depthint, default=None
  - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  - 트리의 최대 깊이입니다. 없음인 경우 모든 잎이 순수하거나 모든 잎이 min_samples_split 샘플보다 작은 표본을 포함할 때까지 노드가 확장됩니다.
from sklearn.ensemble import RandomForestClassifier
et
sklearn.ensemble.ExtraTreesClassifier
- n_estimatorsint, default=100
- The number of trees in the forest.
- max_depthint, default=None
  - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  - 트리의 최대 깊이입니다. 없음인 경우 모든 잎이 순수하거나 모든 잎이 min_samples_split 샘플보다 작은 표본을 포함할 때까지 노드가 확장됩니다.
from sklearn.ensemble import ExtraTreesClassifier
xgb
Python API Reference — xgboost 2.1.0-dev documentation• max_depth (Optional[int]) – Maximum tree depth for base learners.
• n_estimators (Optional[int]) – Number of boosting rounds.
from xgboost import XGBClassifier

중간 결과물

(너무 긴 부분도 있어서 일부는 날림)

1차적 모델끼리 정확도 f1_macro로 평가한 점수확인 →dt, rf, xgb, ~~lgb,~~ et스케일링은 sc_col = ['연간소득', '부채_대비_소득_비율'] 나머지는 mm대출목적, 주택소유상태
- 결과'model_dt': {'fit_time': array([0.66260242, 0.67761564, 0.70664191, 0.66460276, 0.67299581]), 'score_time': array([0.01201081, 0.01201034, 0.01100993, 0.01201081, 0.01201224]), 'test_accuracy': array([0.83603302, 0.83927265, 0.83968229, 0.83550191, 0.84307885]), 'train_accuracy': array([1., 1., 1., 1., 1.]), 'test_f1_macro': array([0.77276985, 0.77618444, 0.76610055, 0.7559306 , 0.77350753]), 'train_f1_macro': array([1., 1., 1., 1., 1.])},'model_knn': {'fit_time': array([0.01501346, 0.01501393, 0.01401353, 0.01501298, 0.01401305]), 'score_time': array([1.42609739, 1.42329311, 1.39426684, 1.34873176, 1.37875271]), 'test_accuracy': array([0.35270143, 0.35390323, 0.34728536, 0.34927105, 0.35130898]), 'train_accuracy': array([0.55118943, 0.54659107, 0.55038537, 0.54952319, 0.55028086]), 'test_f1_macro': array([0.23419952, 0.24146593, 0.23105777, 0.23069666, 0.2252951 ]), 'train_f1_macro': array([0.40529019, 0.39838383, 0.40772596, 0.40435698, 0.4075273 ])},'model_xgb': {'fit_time': array([1.17260599, 1.24229264, 1.23724437, 1.15693688, 1.2334466 ]), 'score_time': array([0.05905128, 0.04659295, 0.04604173, 0.05005455, 0.04704237]), 'test_accuracy': array([0.83378618, 0.83133034, 0.83184407, 0.83618122, 0.83821916]), 'train_accuracy': array([0.88922128, 0.88961319, 0.88757675, 0.8901241 , 0.88974526]), 'test_f1_macro': array([0.77543692, 0.77986058, 0.77048389, 0.76905818, 0.78055823]), 'train_f1_macro': array([0.90812768, 0.90799494, 0.90740538, 0.90669811, 0.90768593])},'model_et': {'fit_time': array([9.43380547, 9.42943907, 9.37981987, 9.40974283, 9.49166727]), 'score_time': array([0.57652903, 0.57800269, 0.59411597, 0.57307863, 0.58021045]), 'test_accuracy': array([0.68758491, 0.69014526, 0.68328369, 0.68265663, 0.6882479 ]), 'train_accuracy': array([1., 1., 1., 1., 1.]), 'test_f1_macro': array([0.5710648 , 0.58411625, 0.56415031, 0.55996858, 0.55780156]), 'train_f1_macro': array([1., 1., 1., 1., 1.])}}
- 'model_lgb': {'fit_time': array([1.01751494, 0.99526429, 0.98273134, 0.99707556, 1.02645135]), 'score_time': array([0.13913178, 0.14463401, 0.14013529, 0.14613295, 0.14415145]), 'test_accuracy': array([0.82516459, 0.82349253, 0.82275174, 0.82604379, 0.82944035]), 'train_accuracy': array([0.85880939, 0.86040314, 0.85851078, 0.85951666, 0.86036577]), 'test_f1_macro': array([0.76025329, 0.76941381, 0.76254294, 0.7453647 , 0.77205294]), 'train_f1_macro': array([0.88099242, 0.88296013, 0.88113644, 0.88274915, 0.88222023])},
- 'model_gbm': {'fit_time': array([81.38249898, 80.86995912, 80.98467851, 81.07965398, 81.31905246]), 'score_time': array([0.1411283 , 0.14513254, 0.14203811, 0.14163327, 0.14203835]), 'test_accuracy': array([0.71120284, 0.70759745, 0.71118775, 0.70998589, 0.71160579]), 'train_accuracy': array([0.72262211, 0.7242028 , 0.72519922, 0.72305683, 0.72314827]), 'test_f1_macro': array([0.64601324, 0.66192797, 0.65922547, 0.66383058, 0.6660106 ]), 'train_f1_macro': array([0.72160171, 0.71705946, 0.72559226, 0.72397636, 0.72279175])},
- 'model_rf': {'fit_time': array([17.61120749, 17.63870502, 17.9050355 , 17.63789248, 18.32859588]), 'score_time': array([0.42438555, 0.43223834, 0.43039751, 0.44506025, 0.43639684]), 'test_accuracy': array([0.78153412, 0.78284042, 0.77723781, 0.77676752, 0.7803731 ]), 'train_accuracy': array([1., 1., 1., 1., 1.]), 'test_f1_macro': array([0.6583028 , 0.66863208, 0.66285725, 0.64895749, 0.65537035]), 'train_f1_macro': array([1., 1., 1., 1., 1.])},
- {'model_lor': {'fit_time': array([1.38506365, 1.33310032, 1.28650403, 1.3001821 , 1.28016329]), 'score_time': array([0.00900769, 0.00800776, 0.00900793, 0.00900817, 0.00800753]), 'test_accuracy': array([0.54080886, 0.53709897, 0.53796311, 0.53738831, 0.54041908]), 'train_accuracy': array([0.53557852, 0.53681955, 0.54283475, 0.5407838 , 0.53712606]), 'test_f1_macro': array([0.34887846, 0.33862998, 0.34855732, 0.34131534, 0.34558651]), 'train_f1_macro': array([0.34313179, 0.34238225, 0.3483373 , 0.34593599, 0.34224702])},
be_cols = ['대출목적','주택소유상태'] oe_cols = ['대출기간', '근로기간']
target_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6}와
그냥 간단히 z-score로 이상치 제거만 해주고 나머지는 크게 안건드리고
2차 → 결과기록에선 이것을 1차로 잡음
- 1Model: model_rf Best Parameters: {'max_depth': 7, 'n_estimators': 50} Best F1-macro Score: 0.32920444519060793 Best Accuracy Score: 0.5361751755672899 Best Score: 0.32920444519060793Model: model_xgb Best Parameters: {'max_depth': 7, 'n_estimators': 200} Best F1-macro Score: 0.7938732862998641 Best Accuracy Score: 0.856448623065963 Best Score: 0.7938732862998641
- Model: model_et Best Parameters: {'max_depth': 7, 'n_estimators': 50} Best F1-macro Score: 0.17016087102660077 Best Accuracy Score: 0.38641615404789836 Best Score: 0.17016087102660077
- Model: model_dt Best Parameters: {'max_depth': 7} Best F1-macro Score: 0.5751310415573679 Best Accuracy Score: 0.6367217779727371 Best Score: 0.5751310415573679
- 2
- {'model_dt': {'best_params': {'max_depth': 7}, 'best_f1_score': 0.5755512911533535, 'best_accuracy': 0.6367322289316126, 'best_scores': 0.5755512911533535}, 'model_rf': {'best_params': {'max_depth': 7, 'n_estimators': 100}, 'best_f1_score': 0.32874450898130936, 'best_accuracy': 0.5378054945711506, 'best_scores': 0.32874450898130936}, 'model_et': {'best_params': {'max_depth': 7, 'n_estimators': 50}, 'best_f1_score': 0.17182739237333444, 'best_accuracy': 0.3866355968800748, 'best_scores': 0.17182739237333444}, 'model_xgb': {'best_params': {'max_depth': 7, 'n_estimators': 200}, 'best_f1_score': 0.7938732862998641, 'best_accuracy': 0.856448623065963, 'best_scores': 0.7938732862998641}}
max_depth : 3,5,7 / n_estimators 50,100, 200
3차Model: model_dt Optimal number of features for model_dt : 3 Features sorted by their rank: [(14, '주택소유상태_0'), (13, '연체계좌수'), (12, '총연체금액'), (11, '대출목적_3'), (10, '대출목적_0'), (9, '대출목적_1'), (8, '주택소유상태_1'), (7, '대출목적_2'), (6, '주택소유상태_2'), (5, '최근_2년간_연체_횟수'), (4, '총계좌수'), (3, '연간소득'), (2, '부채_대비_소득_비율'), (1, '총상환이자'), (1, '총상환원금'), (1, '대출금액')] Model: model_rf Optimal number of features for model_rf : 3 Features sorted by their rank: [(14, '주택소유상태_0'), (13, '총연체금액'), (12, '연체계좌수'), (11, '대출목적_0'), (10, '대출목적_3'), (9, '대출목적_1'), (8, '대출목적_2'), (7, '주택소유상태_1'), (6, '주택소유상태_2'), (5, '최근_2년간_연체_횟수'), (4, '총계좌수'), (3, '연간소득'), (2, '부채_대비_소득_비율'), (1, '총상환이자'), (1, '총상환원금'), (1, '대출금액')] Model: model_et
4차
다시 간단히 해준 것 15 232137
- 데이터
- {'model_dt': {'fit_time': array([0.70492125, 0.68269444, 0.68830848, 0.69556308, 0.70111775]), 'score_time': array([0.01301217, 0.01201129, 0.01201129, 0.01301169, 0.01301122]), 'test_accuracy': array([0.83603302, 0.83927265, 0.83968229, 0.83550191, 0.84307885]), 'train_accuracy': array([1., 1., 1., 1., 1.]), 'test_f1_macro': array([0.77276985, 0.77618444, 0.76610055, 0.7559306 , 0.77350753]), 'train_f1_macro': array([1., 1., 1., 1., 1.])}, 'model_rf': {'fit_time': array([18.67404246, 18.45873594, 18.94830513, 18.87328792, 18.79281425]), 'score_time': array([0.48928714, 0.50013638, 0.50470519, 0.47235608, 0.5300293 ]), 'test_accuracy': array([0.78153412, 0.78284042, 0.77723781, 0.77676752, 0.7803731 ]), 'train_accuracy': array([1., 1., 1., 1., 1.]), 'test_f1_macro': array([0.6583028 , 0.66863208, 0.66285725, 0.64895749, 0.65537035]), 'train_f1_macro': array([1., 1., 1., 1., 1.])}, 'model_xgb': {'fit_time': array([1.78613806, 1.3886559 , 1.43333077, 1.39674878, 1.34752297]), 'score_time': array([0.05706191, 0.05404925, 0.04704309, 0.05455208, 0.04804325]), 'test_accuracy': array([0.83378618, 0.83133034, 0.83184407, 0.83618122, 0.83821916]), 'train_accuracy': array([0.88922128, 0.88961319, 0.88757675, 0.8901241 , 0.88974526]), 'test_f1_macro': array([0.77543692, 0.77986058, 0.77048389, 0.76905818, 0.78055823]), 'train_f1_macro': array([0.90812768, 0.90799494, 0.90740538, 0.90669811, 0.90768593])}, 'model_et': {'fit_time': array([10.67007947, 10.74230504, 10.38943934, 10.36898971, 10.39428568]), 'score_time': array([0.88480425, 0.6525929 , 0.65832424, 0.64709187, 0.69310451]), 'test_accuracy': array([0.68758491, 0.69014526, 0.68328369, 0.68265663, 0.6882479 ]), 'train_accuracy': array([1., 1., 1., 1., 1.]), 'test_f1_macro': array([0.5710648 , 0.58411625, 0.56415031, 0.55996858, 0.55780156]), 'train_f1_macro': array([1., 1., 1., 1., 1.])}}
3으로 압축
dt와 xgb만 제출
다시 간단히 해준것15 232137
5차
다시 grid만 해준 것
- 데이터Model: model_rf Best Parameters: {'max_depth': 62, 'n_estimators': 300} Best F1-macro Score: 0.6665468163184007 Best Accuracy Score: 0.7860942079402135 Best Score: 0.6665468163184007Model: model_xgb Best Parameters: {'max_depth': 64, 'n_estimators': 50} Best F1-macro Score: 0.8096816271500844 Best Accuracy Score: 0.8631893746786329 Best Score: 0.8096816271500844
- Model: model_et Best Parameters: {'max_depth': 65, 'n_estimators': 930} Best F1-macro Score: 0.5802458041513825 Best Accuracy Score: 0.6981930435178354 Best Score: 0.5802458041513825
- Model: model_dt Best Parameters: {'max_depth': 25} Best F1-macro Score: 0.7711546435003871 Best Accuracy Score: 0.8403022520189045 Best Score: 0.7711546435003871
- 전체 출력창Model: model_rf Best Parameters: {'max_depth': 62, 'n_estimators': 300} Best F1-macro Score: 0.6665468163184007 Best Accuracy Score: 0.7860942079402135 Best Score: 0.6665468163184007Model: model_xgb Best Parameters: {'max_depth': 64, 'n_estimators': 50} Best F1-macro Score: 0.8096816271500844 Best Accuracy Score: 0.8631893746786329 Best Score: 0.8096816271500844
- Model: model_et Best Parameters: {'max_depth': 65, 'n_estimators': 930} Best F1-macro Score: 0.5802458041513825 Best Accuracy Score: 0.6981930435178354 Best Score: 0.5802458041513825
- model_dt하는 중 2024-02-15 23:32:35 Score: 0.7734968967818564 2024-02-15 23:32:35 Score: 0.7768765067610435 2024-02-15 23:32:36 Score: 0.7643739447026154 2024-02-15 23:32:37 Score: 0.7623637109588748 2024-02-15 23:32:38 Score: 0.7786621582975456 2024-02-15 23:32:38 Score: 0.7723861550433266 2024-02-15 23:32:39 Score: 0.7745798475864126 2024-02-15 23:32:40 Score: 0.7650308516625927 2024-02-15 23:32:40 Score: 0.7613123275677983 2024-02-15 23:32:41 Score: 0.7782796992960147 model_rf하는 중 2024-02-15 23:32:52 Score: 0.6467345734811936 2024-02-15 23:33:01 Score: 0.6599997129665837 2024-02-15 23:33:11 Score: 0.6507932368458079 2024-02-15 23:33:20 Score: 0.6496164535310948 2024-02-15 23:33:30 Score: 0.6645225072697792 2024-02-15 23:34:28 Score: 0.661798558372096 2024-02-15 23:35:26 Score: 0.6712752923764389 2024-02-15 23:36:27 Score: 0.6699383851808802 2024-02-15 23:37:25 Score: 0.6551229211458003 2024-02-15 23:38:24 Score: 0.6722412825973408 2024-02-15 23:38:33 Score: 0.6484702651116608 2024-02-15 23:38:43 Score: 0.6636754451083045 2024-02-15 23:38:53 Score: 0.6503778119933712 2024-02-15 23:39:03 Score: 0.6478283586240391 2024-02-15 23:39:13 Score: 0.6675192166500906 2024-02-15 23:40:11 Score: 0.6639971611364844 2024-02-15 23:41:13 Score: 0.6720950887499757 2024-02-15 23:42:12 Score: 0.6697667631251256 2024-02-15 23:43:09 Score: 0.6534160629299078 2024-02-15 23:44:09 Score: 0.6734590056505099 model_et하는 중 2024-02-15 23:45:31 Score: 0.5591487896477373 2024-02-15 23:45:37 Score: 0.5789781954176758 2024-02-15 23:45:43 Score: 0.5669878869071248 2024-02-15 23:45:49 Score: 0.5594979153060298 2024-02-15 23:45:55 Score: 0.5590177882858921 2024-02-15 23:47:44 Score: 0.5796289916518254 2024-02-15 23:49:32 Score: 0.583740127389552 2024-02-15 23:51:18 Score: 0.585692593408991 2024-02-15 23:53:03 Score: 0.5788789888329663 2024-02-15 23:54:47 Score: 0.5732883194735778 model_xgb하는 중 2024-02-15 23:56:55 Score: 0.7974785779644502 2024-02-15 23:57:04 Score: 0.8154100137082023 2024-02-15 23:57:12 Score: 0.807892538981471 2024-02-15 23:57:20 Score: 0.8162352979054269 2024-02-15 23:57:28 Score: 0.811391707190872 2024-02-15 23:57:53 Score: 0.7932996077873283 2024-02-15 23:58:19 Score: 0.8084430584565435 2024-02-15 23:58:44 Score: 0.8046897289356689 2024-02-15 23:59:10 Score: 0.8010032771667751 2024-02-15 23:59:35 Score: 0.8090104107481139 Model: model_dt Best Parameters: {'max_depth': 25} Best F1-macro Score: 0.7711546435003871 Best Accuracy Score: 0.8403022520189045 Best Score: 0.7711546435003871
xgb만 2로 압축
xgb만 점수 의미있게 오른 것 같아서 이것만 제출
submission_2024-02-16_002953