๐˜š๐˜ญ๐˜ฐ๐˜ธ ๐˜ฃ๐˜ถ๐˜ต ๐˜ด๐˜ต๐˜ฆ๐˜ข๐˜ฅ๐˜บ

๋ฐ์ดํ„ฐ๋ถ„์„ ๊ณผ์ œํ…Œ์ŠคํŠธ ๋ฒผ๋ฝ์น˜๊ธฐ๋กœ ์ค€๋น„ํ•˜๊ธฐ ๋ณธ๋ฌธ

๊ธฐํƒ€

๋ฐ์ดํ„ฐ๋ถ„์„ ๊ณผ์ œํ…Œ์ŠคํŠธ ๋ฒผ๋ฝ์น˜๊ธฐ๋กœ ์ค€๋น„ํ•˜๊ธฐ

.23 2024. 10. 28. 17:49

๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ณผ์ œ ํ…Œ์ŠคํŠธ๋ž€?

๐Ÿ’ก ๋ถ„์„ → ์ „์ฒ˜๋ฆฌ → ๋ชจ๋ธ๋ง → ์„ฑ๋Šฅํ‰๊ฐ€

 

> ์ œํ•œ์‹œ๊ฐ„ ๋‚ด ์ผ๋ จ์˜ ๊ณผ์ •์„ ๋ชจ๋‘ ์™„๋ฃŒํ•œ ๋’ค ์™„์„ฑ๋œ ๋ชจ๋ธ์„ ํ†ตํ•ด ์˜ˆ์ธกํ•œ ๊ฐ’์„ ์ œ์ถœํ•˜๋Š” ๊ฒƒ.

 

์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ณผ์ œ๋ผ๋ฉด

 

๋์—†๋Š” EDA ๊ณผ์ •์„ ๊ฒช์œผ๋ฉฐ feature engineering์„ ๊ฑฐ๋“ญํ•˜๊ณ  ๊ฑฐ๋“ญํ•˜๊ณ  .. ๊ฑฐ๋“ญํ•ด์„œ

์ ์ ˆํ•œ ML/DL ๋ชจ๋ธ์„ ๊ณ ๋ฅด๊ณ  ๊ณ ๋ฅด๊ณ  .. ๊ณจ๋ผ์„œ

ํ•™์Šต ๋ชจ๋ธ์˜ parameter tuning๊นŒ์ง€ ์™„๋ฃŒํ•œ ๋’ค

์ตœ์ƒ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•˜๋Š” ๊ฒƒ์ด ๋งž์ง€๋งŒ,

 

ํ•œ ๋ฌธ์ œ๋‹น ํ•œ ์‹œ๊ฐ„, ๋งŽ์•„์•ผ ๋‘ ์‹œ๊ฐ„ ์ฃผ๋Š” ์‹œํ—˜์—์„œ๋Š” ์‹œํ–‰์ฐฉ์˜ค์˜ ๊ณผ์ •์„ ์ผ์ผํžˆ ๊ฑฐ์น˜๋Š” ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

 

๋”ฐ๋ผ์„œ ๊ฒฐ๊ณผ๋ฅผ ์ œ์ถœํ•˜๋Š” ๊ฒƒ์ด ์šฐ์„ ์ธ์ง€๋ผ,

๋น ๋ฅด๊ฒŒ ๋ชจ๋ธ๋ง ๋‹จ๊ณ„๋กœ ์ง„์ž…ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ๋Œ€๋น„๋ฅผ ํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

 

 

์ผ๋ฐ˜์ ์ธ ๊ธฐ์—… ์‹œํ—˜์—์„œ ๋‚ด๋Š” ๋ฌธ์ œ๋Š”

ํฌ๊ฒŒ DataFrame ํ™œ์šฉ ๋ฌธ์ œ, ์˜ˆ์ธก/ํšŒ๊ท€ ๋ฌธ์ œ ์ด๋ ‡๊ฒŒ ๋‘ ์œ ํ˜•์œผ๋กœ ๋‚˜๋‰œ๋‹ค.

 

Pandas ์–ผ๋งˆ๋‚˜ ์ž˜์“ฐ๋Š”์ง€ ๋ฌผ์–ด๋ณด๋Š” ๋ฌธ์ œ๋Š” ํŒ๋‹ค์Šค ํ•จ์ˆ˜ ์–ผ๋งˆ๋‚˜ ์ž˜ ์•Œ๊ณ ์žˆ๋Š”์ง€ ๋ณด๋Š”๊ฑฐ๋ผ ์‚ฌ์‹ค ํŒ๋‹ค์Šค๋ฅผ ๋งŽ์ด ๋‹ค๋ค„๋ดค์–ด์•ผ ๋น ๋ฅธ ์‹œ๊ฐ„ ๋‚ด ํ‘ธ๋Š”๊ฒŒ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ

๋ถ„๋ฅ˜/ํšŒ๊ท€์˜ ๊ฒฝ์šฐ ์•„์ง ๋ฌธ์ œ๊ฐ€ ํฌ๊ฒŒ ์–ด๋ ค์›Œ์ง€์ง€ ์•Š์€ ์ง€๊ธˆ ์„ ์—์„œ๋Š” ๊ฐ–๋‹ค ์“ฐ๋Š” ์ฝ”๋“œ ๋ช‡๊ฐœ๋งŒ ์™ธ์šฐ๋ฉด ์–ด๋Š์ •๋„ ์ปค๋ฒ„๊ฐ€ ๊ฐ€๋Šฅํ•œ ๊ฒƒ ๊ฐ™๋”๋ผ

 

 

๊ทธ๋ž˜์„œ ์ •๋ฆฌํ•ด๋ณด๋Š” ํ•„์š”ํ•œ ์ฝ”๋“œ ์š”์•ฝ๊ธ€์ด๋‹ค.

 

์‹œํ—˜ ๊ณผ์ • ์š”์•ฝ

โœ”๏ธ ๊ฒฐ๊ณผ๋ฅผ ์ œ์ถœํ•˜๋Š” ๊ฒƒ์ด ์šฐ์„ ์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ„๋‹จํ•œ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ์ดํ›„ ๋ฐ”๋กœ ๋ชจ๋ธ๋ง ์ง„ํ–‰

  • ๋ณ€์ˆ˜ ๋ณ„ ๋ถ„ํฌ ํŒŒ์•…
  • ๊ฒฐ์ธก์น˜ ์กด์žฌํ•˜๋Š” ๋ฐ์ดํ„ฐ ๋ฒ„๋ฆฌ๊ธฐ or ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ธฐ
  • ์ด์ƒ์น˜ ๋ณด์ •
  • Scale ๋ณด์ •/์ •๊ทœํ™”
  • objectํ˜• ๋ณ€์ˆ˜ categoryํ™” ํ•˜๊ธฐ
  • ์ฃผ์˜ํ•  ์ : ์ „์ฒ˜๋ฆฌ ์ง„ํ–‰ ์‹œ test dataset์—๋„ ๋ฐ˜๋“œ์‹œ ๋™์ผํ•˜๊ฒŒ ์ ์šฉํ•ด์ค˜์•ผ๋จ

โœ”๏ธ ๋ชจ๋ธ๋ง ๋ฐ ๊ฐ„๋‹จํ•œ ์„ฑ๋Šฅ ํ‰๊ฐ€

  • cross_validate ํ†ตํ•œ ์ตœ์„ ์˜ ๋ชจ๋ธ ์„ ํƒ
  • ๊ฐ„๋‹จํ•œ parameter tuning

โœ”๏ธ ์„ฑ๋Šฅ ๊ณ ๋„ํ™” ์œ„ํ•œ EDA ์ง„ํ–‰

  • ๋ณ€์ˆ˜ ๋ณ„ ๊ด€๊ณ„ ํŒŒ์•…
  • ๋ฐ์ดํ„ฐ / ๋ฐ์ดํ„ฐ ๊ฐ„ ๊ด€๊ณ„ ์‹œ๊ฐํ™”
  • ๊ฒฐ๊ณผ์— ์˜ํ–ฅ ์ฃผ๋Š” ๋ณ€์ˆ˜ ํ™•์ธ
  • Feature engineering
  • ...

โœ”๏ธ ๋‹ค์‹œ ML ๋ชจ๋ธ๋ง → ๊ฒฐ๊ณผ ์ œ์ถœ


To-do List

1. ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋กœ๋“œ

import warnings

warnings.filterwarnings("ignore") # ๊ฒฝ๊ณ  ๋ฌด์‹œ

import seaborn as sns        # ๊ทธ๋ฆผ๊ทธ๋ฆฌ๊ธฐ(์–˜๊ฐ€ ์ข€ ๋” ์‰ฌ์›€)
import numpy as np            # ๋ฐฐ์—ด ์‚ฌ์šฉ(์ž˜ ์•ˆ์“ฐ๋˜๋ฐ ๋ฐฐ์—ดํ˜•ํƒœ ์–ธ์ œ์“ธ์ค„๋ชจ๋ฅด๋‹ˆ ์ผ๋‹จ ๊ทธ๋ƒฅ load)
import matplotlib.pyplot as plt # ๊ทธ๋ฆผ๊ทธ๋ฆฌ๊ธฐ2

# ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค ์‹œํ—˜ ๊ธฐ์ค€
# ์ผ๋ฐ˜์ ์œผ๋กœ ํŒŒ์ผ ๋กœ๋“œํ•˜๋Š” ์ฝ”๋“œ๋Š” ๊ทธ๋ƒฅ ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— pandas๋Š” ๊ตณ์ด import ์•ˆํ•ด๋„ ๋˜๊ธด ํ•จ
import pandas as pd

 

2. ๋ฐ์ดํ„ฐ ๋กœ๋“œ

์ผ๋ฐ˜์ ์œผ๋กœ ์ œ์ผ ์ฒ˜์Œ์— ์‹คํ–‰๋งŒ ํ•ด๋„ ๋˜๋Š” ํ˜•ํƒœ๋กœ ์ฃผ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์‹ค ์‹œํ—˜๋ณผ๋•Œ ์˜๋ฏธ๋Š” X

train = read_csv("๊ฒฝ๋กœ/train_file.csv")
test = read_csv("๊ฒฝ๋กœ/test_file.csv")

 

3. ๋ฐ์ดํ„ฐ ํ›‘์–ด๋ณด๊ธฐ

train.head()         	# ์ƒ์œ„ 5๊ฐœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ
train.columns.values 	# ๋ฐ์ดํ„ฐ columns array๋กœ ๋ฐ˜ํ™˜
# train.columns.to_list() ์“ฐ๋ฉด ๋ฆฌ์ŠคํŠธํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜

train.info()         	# ๋ฐ์ดํ„ฐ์˜ column, non-null์ธ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜, column์˜ data type
train.describe()     	# object๊ฐ€ ์•„๋‹Œ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์น˜๊ฐ’ ๋ฐ˜ํ™˜
train.describe(incude="O") # object ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์น˜๊ฐ’(count, unique, top, freq)

 

columns ์ œ์™ธ ๊ฐ๊ฐ ์ฝ”๋“œ ์‹คํ–‰ ๊ฒฐ๊ณผ(bike-sharing-demand.csv ์˜ˆ์‹œ)

 

train.head()         	# ์ƒ์œ„ 5๊ฐœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ

train.info()         	# ๋ฐ์ดํ„ฐ์˜ column, non-null์ธ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜, column์˜ data type

train.describe()     	# object๊ฐ€ ์•„๋‹Œ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์น˜๊ฐ’ ๋ฐ˜ํ™˜

train.describe(incude="O") # object ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์น˜๊ฐ’(count, unique, top, freq)

 

4. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

๊ฒฐ์ธก์น˜ / ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ

# ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ๋ฐฉ์‹
train['col1'] = train['col1'].fillna(0)                      # 1. ํŠน์ • ๊ฐ’(0)
train['col2'] = train['col2'].fillna(train['col2'].mean())   # 2. ํ‰๊ท ๊ฐ’
train['col3'] = train['col3'].fillna(train['col3'].median()) # 3. ์ค‘์•™๊ฐ’
train['col4'] = train['col4'].fillna(train['col4'].mode()[0])# 4. ์ตœ๋นˆ๊ฐ’

 

๋ฐ์ดํ„ฐ ๋ณ€ํ™˜

# ๋ฐ์ดํ„ฐ์˜ ํ˜•์‹ ๋ณ€ํ™˜
pd.to_numeric(column, errors='coerce')    # ์ˆซ์ž๋กœ ๋ณ€ํ™˜
pd.to_datetime(column, format='%m/%d/%Y') # m/d/YYYY ํ˜•์‹ -> YYYY-d-m ํ˜•์‹
# objectํ˜• ๋ณ€์ˆ˜ -> categoryํ™”
train.obj_col = train.obj_col.astype('category')

# dictionary ๋งŒ๋“ค์–ด์„œ map ํ•จ์ˆ˜ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€ํ™˜.. ์ด๊ฑฐ ์‚ฌ์šฉํ• ๋•Œ ์กฐ์‹ฌํ• ๊ฒƒ
# ์ „์— ์‹œํ—˜๋ณผ๋•Œ ์ž˜๋ชปํ•ด์„œ ๊ฐœ๊ณ ์ƒํ–ˆ๋˜ ๊ธฐ์–ต์ด ์žˆ์Œใ… ใ… 
obj_dict = {'col1':1, 'col2':2, ... }
train.obj_col = train.obj_col.map(obj_dict)

 

๋ฐ์ดํ„ฐ ์ •๊ทœํ™”

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# ์ฃผ์˜ !! array ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•จ
scaler = StandardScaler
train_scaled = scaler.fit_transform(train)
df_train_scaled = pd.DataFrame(train_scaled, columns=train.columns)

# test๋„ ๋˜‘๊ฐ™์ด ์ง„ํ–‰
test['col'] = scaler.transform(test['col'])

* ์ฃผ์˜! Scaling ์‹œ ํ›ˆ๋ จ ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ๋Š” fit์„ ์ง„ํ–‰ํ•˜๊ณ , test ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ์ง„ํ–‰ํ•˜์ง€ ์•Š๊ณ  ๋ฐ”๋กœ transform ์ง„ํ–‰ํ• ๊ฒƒ

 

    + ์™ธ์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐฉ๋ฒ•๋“ค

        Data Transformation

            - Positive skewed

np.log(df['ํ•ด๋‹น ๋ณ€์ˆ˜'])

                0 ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ์ผ ๊ฒฝ์šฐ  np.log1p  ๋˜๋Š” 1 ๋”ํ•ด์„œ ์˜ค๋ฅ˜ ๋ฐฉ์ง€

 

            - Negative skewed

np.sqrt(df['ํ•ด๋‹น ๋ณ€์ˆ˜'])

 

5. Plot ํ†ตํ•ด ๋ฐ์ดํ„ฐ / ๋ฐ์ดํ„ฐ ๊ฐ„ ๊ด€๊ณ„ ํ‘œํ˜„ (seaborn)

์ž์ฃผ ์‚ฌ์šฉํ•˜๋Š” plot ์ •๋ฆฌ

 

์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ ๋ถ„ํฌ: distplot

sns.distplot(data.column, bins=range(min, max), ax)

    - deprecated๋œ ํ•จ์ˆ˜๋ผ ๋œจ๋Š”๋ฐ, kdeplot(์„ ) or histplot(๋ง‰๋Œ€๊ทธ๋ž˜ํ”„) ์‚ฌ์šฉ ๊ฐ€๋Šฅ

    - displot์€ subplots ๋ง๊ณ  ๋‹จ๋… ์‚ฌ์šฉ๋งŒ ๊ฐ€๋Šฅ

distplot
kdeplot / histplot

+ DataFrame.hist

histogram ๊ฐ™์€๊ฑฐ ์–ด๋Š ์„ธ์›”์— column ๊ณจ๋ผ์„œ ์ €๊ฑฐ ๋‹ค ๊ทธ๋ฆฌ๊ณ ์žˆ์Œ?? ํ• ๋•Œ ์ด๊ฑฐ ์จ๋„ ๊ดœ์ฐฎ

df.hist(figsize=(12, 10))

๋ณ€์ˆ˜๋“ค์˜ ์ด์ƒ์น˜ ๋ถ„ํฌ ํ™•์ธ: boxplot

sns.boxplot(data.column, data, ax)

์ด์‚ฐํ˜• ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์„ธ๊ธฐ: countplot

sns.countplot(data.column, data)

 

์ด์‚ฐํ˜•-์ด์‚ฐํ˜• ๊ด€๊ณ„ ํ‘œํ˜„: barplot

sns.barplot(x, y, data, ax)

์ด์‚ฐํ˜•-์—ฐ์†ํ˜• / ์ด์‚ฐํ˜•-(๋ฒ”์ฃผ์˜ ๊ฐœ์ˆ˜๊ฐ€๋งŽ์€)์ด์‚ฐํ˜• ๊ด€๊ณ„ ํ‘œํ˜„: lineplot, pointplot

sns.pointplot(x, y, data, ax)
sns.lineplot(x, y, data, ax)

barplot / pointplot / lineplot

์—ฐ์†ํ˜•-์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ ๊ด€๊ณ„ ํ‘œํ˜„: scatterplot, regplot

sns.scatterplot(x, y, data, ax)
sns.regplot(x, y, data, ax, line_kws, scatter_kws)

scatterplot / regplot

 

+ df ์ž์ฒด์—๋„ plot ๊ธฐ๋Šฅ์ด ์žˆ์Œ (ex.  df.plot.bar(), df.plot.density(), df.plot.pie() )

df.plot.bar(), df.plot.density(), df.plot.pie()

 

๊ธ‰ํ•˜๊ฒŒ๋Š” ์šฐ์„  ์ €๋ ‡๊ฒŒ๋งŒ ํ•ด๋„ ๊ดœ์ฐฎ, ์‹œ๊ฐ„์ด ๋˜๋ฉด ์—ฌ๋Ÿฌ๋ฒˆ ๋‹ค์–‘ํ•˜๊ฒŒ ๊ทธ๋ฆฌ๋Š” ์—ฐ์Šต ํ•ด๋ณด๊ธฐ

์ƒ๊ด€๊ด€๊ณ„ plot - heatmap (๊พธ๋ฏธ๋Š” ์ฝ”๋“œ๋Š” ๊ทธ๋ƒฅ ์™ธ์›Œ์•ผ๋จ..๋ณ„์ˆ˜์—†์Œ..)

corr = train.corr() # ๋ณ€์ˆ˜๋ณ„ ์ƒ๊ด€๊ด€๊ณ„ ์ €์žฅ

mask = np.triu(np.ones_like(corr, dtype=np.bool))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
fig, ax = plt.subplots(figsize=(20, 15))

sns.heatmap(corr, mask=mask, cmap=cmap, center=0, linewidth=.5, annot=True, fmt='.3f')

 

6. Cross validation์œผ๋กœ ๋ชจ๋ธ ์„ ํƒ

# regression ์˜ˆ์‹œ
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_validate, KFold
# ๋ถ„๋ฅ˜๋ฌธ์ œ์—๋Š” StratifiedKFold๋„ ๋ ๋“ฏ

rf = RandomForestRegressor()
score = cross_validate(rf, X_train, Y_train, return_train_score=True, n_jobs=-1, cv=KFold())
print(np.mean(score['train_score']), np.mean(score['test_score'])

gb = GradientBoostingRegressor()
score = cross_validate(gb, X_train, Y_train, return_train_score=True, n_jobs=-1, cv=KFold())
print(np.mean(score['train_score']), np.mean(score['test_score'])

 

๋ชจ๋ธ์„ ์ €๋ ‡๊ฒŒ ๋‘๊ฐœ๋งŒ ๋ถ€๋ฅธ ์ด์œ ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋‘๊ฐœ์˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์•„์„œ...

 

7. ์„ฑ๋Šฅ ๋” ์ข‹์€ ๊ฒƒ ์„ ํƒํ•ด์„œ grid search ์ง„ํ–‰

from sklearn.model_selection import GridSearchCV
from sklearn import metrics

# ์ผ๋ฐ˜์ ์œผ๋กœ rf๊ฐ€ ์ œ์ผ ์ข‹๋”๋ผ..
rf = RandomForestRegressor()
rf_params = { 'max_depth':range(2, 10, 2), 'n_estimators':[10, 100, 200] }
# ์ž˜ ๋ชจ๋ฅด๊ฒ ๋Š” parameter๋“ค์€ help ์จ์„œ ์ฐธ๊ณ ํ•ด์„œ ์“ฐ๊ธฐ

grid_rf = GridSearchCV(rf, rf_params, cv=5)
grid_rf.fit(X_train, Y_train)
pred = grid_rf.predict(X_train)

mse = mean_squared_error(Y_train, pred)
mse

 

metrics ์‚ฌ์šฉ ์‹œ

Regression - mean_squared_error / mean_absolute_error

Classification - accuracy_score / f1_score

8. Best parameter ํ™•์ธํ•ด์„œ ํ•™์Šต ์ง„ํ–‰

# best parameter ํ™•์ธ๋ฒ•
grid_rf.best_params_
rf = RandomForestRegressor(max_depth=..., n_estimators=...)
rf.fit(X_train, Y_train)
pred = rf.predict(X_valid)
mse = mean_squared_error(Y_valid, pred)
mse

predict = rf.predict(X_test)

 

9. ์กฐ๊ฑด ํ™•์ธํ•ด์„œ ์ œ์ถœ

idx = test['col']

result = pd.DataFrame({
	"index":idx,	# ๋ฌธ์ œ์—์„œ ์ œ์‹œํ•œ ์ •๋ ฌ/primary key/index ๊ธฐ์ค€ 
	"ans":predict	# ์˜ˆ์ธกํ•œ column
})


# df.to_csv๋กœ ์ €์žฅ๋ช…๊นŒ์ง€๋Š” ์ฝ”๋“œ๋กœ ์ฃผ์–ด์คŒ
# ๋’ค์— index=False ๋ถ™์—ฌ์ฃผ๋Š”๊ฑฐ๋งŒ ์žŠ์ง€๋ง๊ธฐ
result.to_csv("submission.csv", index=False)

 

 


๊ทธ ์™ธ ์•Œ์•„๋‘๋ฉด ์ข‹์€ ๊ฒƒ๋“ค

๊ฒ€์ƒ‰ํ•ด์„œ ์—ฌ๊ธฐ๊นŒ์ง€ ์˜ค๋Š” ์‚ฌ๋žŒ์ด ์žˆ์„์ง„๋ชจ๋ฅด๊ฒ ๋Š”๋ฐ..

์ถ”๊ฐ€์ ์œผ๋กœ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๋„์›€ ๋งŽ์ด ๋๋˜ ๊ฑฐ ์•Œ๋ ค๋“œ๋ฆผ๋‹ˆ๋‹ค

 

- ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๋„์›€ ๋๋˜ ๋ฌธ์ œ๋“ค

1๏ธโƒฃ ๊ฒฐ์ธก์น˜ ์กด์žฌํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ: ์šฐ์ฃผ์„  ์ƒ์กด์ž ์˜ˆ์ธก

 

Spaceship Titanic

Predict which passengers are transported to an alternate dimension

www.kaggle.com

 

2๏ธโƒฃ ๋ฐ์ดํ„ฐ ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ: ์ž์ „๊ฑฐ ์ˆ˜์š” ์˜ˆ์ธก

 

Bike Sharing Demand

Forecast use of a city bikeshare system

www.kaggle.com

    ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋กœ์„œ์˜ ์˜๋ฏธ๋„ ์žˆ๊ณ .. ๋‚ ์งœ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๊ณ ๋ฏผํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ข‹์€๋“ฏ

 

3๏ธโƒฃ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ฒ˜๋ฆฌ: ์ฑ„์šฉ๊ณต๊ณ ์ถ”์ฒœ

 

ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค

SW๊ฐœ๋ฐœ์ž๋ฅผ ์œ„ํ•œ ํ‰๊ฐ€, ๊ต์œก, ์ฑ„์šฉ๊นŒ์ง€ Total Solution์„ ์ œ๊ณตํ•˜๋Š” ๊ฐœ๋ฐœ์ž ์„ฑ์žฅ์„ ์œ„ํ•œ ๋ฒ ์ด์Šค์บ ํ”„

programmers.co.kr

 

 

- ์™€ ๋‚˜๋Š” ์ง„์งœ ๋„์ €ํžˆ ํ‰์†Œ์— pandas ์•ˆ์จ์„œ DF ๋ฌธ์ œ ๋ชปํ’€๊ฒ ๋‹ค??

 

๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ํ•จ์ˆ˜ ๊ณต๋ถ€ : ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค ์‚ฌ์ „ํ…Œ์ŠคํŠธใ…‹ใ…‹ ๋กœ ์ฃผ์–ด์ง€๋Š” '๋ฏธ์„ธ๋จผ์ง€ ๋†๋„์— ๋”ฐ๋ฅธ WHO ์˜ˆ๋ณด๋“ฑ๊ธ‰ ~~์–ด์ฉŒ๊ณ ' ๋ฌธ์ œ ๊ผญ ํ’€์–ด๋ณด๊ธฐ

→ ๋ฌธ์ œ์—์„œ ๋‹ต๊นŒ์ง€ ๋„์ถœํ•˜๋Š”๋ฐ ์š”๊ตฌ๋˜๋Š” ์ตœ์†Œํ•œ์˜ ํ•จ์ˆ˜๋Š” ๊ฑฐ์˜ ์—ฌ๊ธฐ์„œ ์—ฐ์Šต ๊ฐ€๋Šฅ

 

 

- ๋ชจ๋ธ ๋Œ€์ฒด ๋ญ์“ฐ์ง€??

 

๋ถ„๋ฅ˜ : RandomForestClassifier, DecisionTreeClassifier

ํšŒ๊ท€ : RandomForestRegressor, GradientBoostingRegressor

 

scikit-learn์— ์ฃผ์–ด์ง€๋Š” ๊ธฐ๋ณธ ML ๋ชจ๋ธ ์ค‘์—์„œ๋Š”

์›ฌ๋งŒํผ ์ƒ์‹์ ์ธ ๋ฐ์ดํ„ฐ์…‹ ๋ฒ”์œ„ ๋‚ด์—์„œ๋Š” tree ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ธ DecisionTree / RandomForest๊ฐ€ ์ œ์ผ ์ž˜๋จนํžŒ๋‹ค

๋Œ€์‹  ๊ทธ๋งŒํผ ๊ณผ์ ํ•ฉ๋„ ์‰ฌ์šด ํŽธ์ด๋‹ˆ ๋ฌด์กฐ๊ฑด ์ž˜๋‚˜์™”๋‹ค๊ณ  ์ข‹์•„ํ• ๊ฑด ์•„๋‹ˆ๊ณ 

train_test_split ์จ์„œ validation์„ ํ•ด๋ณด๋˜๊ฐ€, parameter search ์ž˜ํ•ด์•ผ๋จ

 

 

- ํ•จ์ˆ˜๋„ ์ž˜ ๋ชจ๋ฅด๊ฒ ๊ณ  ํŒŒ๋ผ๋ฏธํ„ฐ๋„ ์ž˜ ๊ธฐ์–ต์•ˆ๋‚ ๋•Œ

 

๋ณดํ†ต python, numpy, pandas ๊นŒ์ง€๋Š” ์ฃผํ”ผํ„ฐ๋…ธํŠธ๋ถ์—์„œ ๊ณต์‹๋ฌธ์„œ ๋งํฌ ์ฃผ๊ธฐ๋•Œ๋ฌธ์—  ์ฐธ๊ณ ํ•˜๋ฉด ๋˜๋Š”๋ฐ, ๊ทธ ์™ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ํŠน์ • ํ•จ์ˆ˜๋ฅผ ๋ชจ๋ฅผ๋• ๋ฌด์กฐ๊ฑด help.

help๋Š” ์‹ ์ด๋‹ค.

์‹œํ—˜ํ™˜๊ฒฝ์—์„œ ๋จนํžˆ๋Š”์ง€ ์ด๋ฏธ ์—ฌ๋Ÿฌ๋ฒˆ ์จ๋จน์–ด๋ด„๐Ÿ‘

 

help(RandomForestClassifier)

 

๋‹ต์„ ๋‹ค ์ฃผ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ.. ์—ฐ์Šต์„ ์—ฌ๋Ÿฌ๋ฒˆ ํ•ด๋ดค๋‹ค๋ฉด ํ‚ค์›Œ๋“œ๋งŒ ๋ณด๊ณ ๋„ ๋ฐ”๋กœ ์ƒ๊ฐ๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ง ์‹œํ—˜ ๋ณด๋‹ค ๊ธ‰ํ•  ๋•Œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ..

์ € ์ •๋ง ์ € ๊ธฐ๋Šฅ ๋•๋ถ„์— ์—ฌ๋Ÿฌ๋ฒˆ ์‚ด์•˜์Šต๋‹ˆ๋‹ค

 

 

์ทจ์ค€์ƒ ๋ชจ๋‘ ํ™”์ดํŒ…

๋‚˜ ํ™”์ดํŒ…..

Comments