edit : post

StartBootstrap · Sep 3, 2020 · d897fd7 · d897fd7
1 parent 8666de0
commit d897fd7
Show file tree

Hide file tree

Showing 8 changed files with 96 additions and 13 deletions.
diff --git a/_posts/2020-08-29-Titanic_1.md → _posts/2020-08-29-Titanic(1).md b/_posts/2020-08-29-Titanic_1.md → _posts/2020-08-29-Titanic(1).md
@@ -1,5 +1,5 @@
 ---
-title: "[Kaggle] 타이타닉 데이터 분석(1)"
+title: "타이타닉 데이터 분석(1)"
 search: true
 categories:
  - Kaggle
@@ -11,7 +11,7 @@ tags:
 last_modified_at: 2020-08-29 15:57
 layout: jupyter
 classes: wide
-excerpt: Kaggle 타이타닉 데이터셋을 분석해봅니다.
+excerpt: [Kaggle] 타이타닉 데이터셋 분석하기
 toc: true
 toc_sticky: true
 toc_label: "목차"
@@ -95,6 +95,27 @@ test = pd.read_csv('kaggle/titanic/test.csv')
 
 <div class="input_area" markdown="1">
 
+```python
+import matplotlib.font_manager as fm
+plt.style.use('ggplot')
+sns.set()
+sns.set_palette("Set2")
+
+%matplotlib inline
+path = '/usr/share/fonts/NanumFont/NanumGothic.ttf'
+font_name = fm.FontProperties(fname=path, size=50).get_name()
+plt.rc('font', family=font_name)
+plt.rcParams['figure.figsize'] = 8,5
+plt.rcParams["font.family"] = "NanumGothic"
+```
+
+</div>
+
+<div class="prompt input_prompt">
+</div>
+
+<div class="input_area" markdown="1">
+
 ```python
 train.head()
 ```
@@ -438,9 +459,9 @@ plt.style.use('ggplot')
 sns.set()
 sns.set_palette("Set2")
 
-def chart(feature):
-    survived = train[train['Survived'] == 1][feature].value_counts()
-    dead = train[train['Survived'] == 0][feature].value_counts()
+def chart(dataset, feature):
+    survived = dataset[dataset['Survived'] == 1][feature].value_counts()
+    dead = dataset[dataset['Survived'] == 0][feature].value_counts()
     df = pd.DataFrame([survived, dead])
     df.index = ['Survived', 'Dead']
     df.plot(kind='bar', stacked=True)
@@ -456,7 +477,7 @@ def chart(feature):
 <div class="input_area" markdown="1">
 
 ```python
-chart('Pclass')
+chart(train, 'Pclass')
 ```
 
 </div>
@@ -473,7 +494,7 @@ chart('Pclass')
 <div class="input_area" markdown="1">
 
 ```python
-chart('Sex')
+chart(train, 'Sex')
 ```
 
 </div>
@@ -490,7 +511,7 @@ chart('Sex')
 <div class="input_area" markdown="1">
 
 ```python
-chart('SibSp')
+chart(train, 'SibSp')
 ```
 
 </div>
@@ -499,15 +520,16 @@ chart('SibSp')
 ![png](/images/Titanic_1_files/Titanic_1_17_0.png)
 
 
-이 그래프는 자매와 배우자의 수에 따라 도시한 그래프입니다. `SibSp` 값이 0인 사람보다는, 비율상 1이나 2인 사람들이 더욱 많이 생존했다는 것을 확인할 수 있습니다. 하지만 3 이상부터는 잘 보이지 않아서 추가적인 확인이 필요할 것 같습니다.
+이 그래프는 자매와 배우자의 수에 따라 도시한 그래프입니다. `SibSp` 값이 0인 사람보다는, 비율상 1이나 2인 사람들이 더욱 많이 생존했다는 것을 확인할 수 있습니다. 하지만 3 이상부터는 잘 보이지 않아서 추가적인 확인이 필요할 것 같습니다. <br>한번 확인해보겠습니다.
 
 <div class="prompt input_prompt">
 </div>
 
 <div class="input_area" markdown="1">
 
 ```python
-chart('Embarked')
+temp = train[(train['SibSp'] > 2)]
+chart(temp, 'SibSp')
 ```
 
 </div>
@@ -516,8 +538,69 @@ chart('Embarked')
 ![png](/images/Titanic_1_files/Titanic_1_19_0.png)
 
 
-이번에는 승객별 탑승지에 따른 생존율입니다. 그래프를 보기 전까지는 탑승지랑 생존율이랑 무슨 관련이 있을까 싶었는데, 그래프를 그리고 나니 생각보다 차이가 많이 납니다. 지역별로 부유한 도시와 가난한 도시가 있을수도 있을 것 같습니다. <br>
+확인해보니 `SibSp` 값이 3 이상인 사람들은 생존율이 높지 않다는 것을 확인할 수 있습니다.
+
+<div class="prompt input_prompt">
+</div>
+
+<div class="input_area" markdown="1">
+
+```python
+chart(train, 'Embarked')
+```
+
+</div>
+
+
+![png](/images/Titanic_1_files/Titanic_1_21_0.png)
+
+
+이번에는 승객별 탑승지에 따른 생존율입니다. 그저 탑승지의 차이라기에는 생각보다 차이가 많이 납니다. 지역별로 부유한 도시와 가난한 도시가 있을수도 있을 것 같습니다. 탑승지별로 1등석, 2등석, 3등석의 수를 한번 확인해보겠습니다.<br>
+
+<div class="prompt input_prompt">
+</div>
+
+<div class="input_area" markdown="1">
+
+```python
+S = train[train['Embarked'] == 'S']['Pclass'].value_counts()
+C = train[train['Embarked'] == 'C']['Pclass'].value_counts()
+Q = train[train['Embarked'] == 'Q']['Pclass'].value_counts()
+df = pd.DataFrame([S, C, Q])
+df.index = ['S', 'C', 'Q']
+df.plot(kind='bar', stacked=True)
+```
+
+</div>
+
+
+
+
+{:.output_data_text}
+
+```
+<matplotlib.axes._subplots.AxesSubplot at 0x7f23d6b214d0>
+```
+
+
+
+
+![png](/images/Titanic_1_files/Titanic_1_23_1.png)
+
+
+확인해보니 1등석의 비율이 탑승지별로 다른 것을 확인할 수 있습니다. `Embarked`가 `C`인 사람들은 1등석 비율이 거의 절반에 육박합니다. 이는 전 그래프에서 탑승지가 `C`였던 사람들의 생존률이 거의 50퍼센트에 가깝게 나왔다는 것에 큰 영향이 있을 것 같습니다.
 
 ---
 
 이렇게 타이타닉 문제에 대한 데이터 분석을 해보았습니다. 다음 글에서는 ***Feature Engineering***을 해보겠습니다. 감사합니다!
+
+<div class="prompt input_prompt">
+</div>
+
+<div class="input_area" markdown="1">
+
+```python
+
+```
+
+</div>
diff --git a/_posts/2020-08-31-ML_splitting_dataset.md b/_posts/2020-08-31-ML_splitting_dataset.md
@@ -1,5 +1,5 @@
 ---
-title: "[ML] 데이터셋 분할하기(훈련 세트, 테스트 세트)(1)"
+title: "훈련 데이터셋 분할하기(1)"
 search: true
 categories:
  - 머신러닝
@@ -9,7 +9,7 @@ tags:
 last_modified_at: 2020-08-31 23:17
 layout: jupyter
 classes: wide
-excerpt: 데이터를 훈련 세트와 테스트 세트로 분할하기
+excerpt: [ML] 데이터를 훈련 세트와 테스트 세트로 분할하기
 toc: true
 toc_sticky: true
 toc_label: "목차"

diff --git a/images/Titanic_1_files/Titanic_1_13_0.png b/images/Titanic_1_files/Titanic_1_13_0.png
diff --git a/images/Titanic_1_files/Titanic_1_15_0.png b/images/Titanic_1_files/Titanic_1_15_0.png
diff --git a/images/Titanic_1_files/Titanic_1_17_0.png b/images/Titanic_1_files/Titanic_1_17_0.png
diff --git a/images/Titanic_1_files/Titanic_1_19_0.png b/images/Titanic_1_files/Titanic_1_19_0.png
diff --git a/images/Titanic_1_files/Titanic_1_21_0.png b/images/Titanic_1_files/Titanic_1_21_0.png
diff --git a/images/Titanic_1_files/Titanic_1_23_1.png b/images/Titanic_1_files/Titanic_1_23_1.png