softeerbootcamp4th · jang-namu · Jul 8, 2024 · Jul 8, 2024 · Jul 8, 2024 · Jul 8, 2024
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,112 @@
+# Created by https://www.toptal.com/developers/gitignore/api/visualstudiocode,pycharm+all,python,venv,jupyternotebooks
+# Edit at https://www.toptal.com/developers/gitignore?templates=visualstudiocode,pycharm+all,python,venv,jupyternotebooks
+*.csv
+
+### JupyterNotebooks ###
+# gitignore template for Jupyter Notebooks
+# website: http://jupyter.org/
+
+.ipynb_checkpoints
+*/.ipynb_checkpoints/*
+
+# IPython
+profile_default/
+ipython_config.py
+
+share/
+
+# Remove previous ipynb_checkpoints
+#   git rm -r .ipynb_checkpoints/
+
+### PyCharm+all ###
+# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
+# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
+
+# User-specific stuff
+.idea/**/workspace.xml
+.idea/**/tasks.xml
+.idea/**/usage.statistics.xml
+.idea/**/dictionaries
+.idea/**/shelf
+
+# AWS User-specific
+.idea/**/aws.xml
+
+# Generated files
+.idea/**/contentModel.xml
+
+# Sensitive or high-churn files
+.idea/**/dataSources/
+.idea/**/dataSources.ids
+.idea/**/dataSources.local.xml
+.idea/**/sqlDataSources.xml
+.idea/**/dynamic.xml
+.idea/**/uiDesigner.xml
+.idea/**/dbnavigator.xml
+
+# Gradle
+.idea/**/gradle.xml
+.idea/**/libraries
+
+# Gradle and Maven with auto-import
+# When using Gradle or Maven with auto-import, you should exclude module files,
+# since they will be recreated, and may cause churn.  Uncomment if using
+# auto-import.
+# .idea/artifacts
+# .idea/compiler.xml
+# .idea/jarRepositories.xml
+# .idea/modules.xml
+# .idea/*.iml
+# .idea/modules
+# *.iml
+# *.ipr
+
+# CMake
+cmake-build-*/
+
+# Mongo Explorer plugin
+.idea/**/mongoSettings.xml
+
+# File-based project format
+*.iws
+
+# IntelliJ
+out/
+
+# mpeltonen/sbt-idea plugin
+.idea_modules/
+
+# JIRA plugin
+atlassian-ide-plugin.xml
+
+# Cursive Clojure plugin
+.idea/replstate.xml
+
+# SonarLint plugin
+.idea/sonarlint/
+
+# Crashlytics plugin (for Android Studio and IntelliJ)
+com_crashlytics_export_strings.xml
+crashlytics.properties
+crashlytics-build.properties
+fabric.properties
+
+# Editor-based Rest Client
+.idea/httpRequests
+
+# Android studio 3.1+ serialized cache file
+.idea/caches/build_file_checksums.ser
+
+### PyCharm+all Patch ###
+# Ignore everything but code style settings and run configurations
+# that are supposed to be shared within teams.
+
+.idea/*
+
+!.idea/codeStyles
+!.idea/runConfigurations
+
+### Python ###
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
@@ -76,11 +185,8 @@ docs/_build/
 target/
 
 # Jupyter Notebook
-.ipynb_checkpoints
 
 # IPython
-profile_default/
-ipython_config.py
 
 # pyenv
 #   For a library or package, you might want to ignore these files since the code is
@@ -158,3 +264,46 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+
+### Python Patch ###
+# Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
+poetry.toml
+
+# ruff
+.ruff_cache/
+
+# LSP config files
+pyrightconfig.json
+
+### venv ###
+# Virtualenv
+# http://iamzed.com/2009/05/07/a-primer-on-virtualenv/
+[Bb]in
+[Ii]nclude
+[Ll]ib
+[Ll]ib64
+[Ll]ocal
+#[Ss]cripts
+pyvenv.cfg
+pip-selfcheck.json
+
+### VisualStudioCode ###
+.vscode/*
+!.vscode/settings.json
+!.vscode/tasks.json
+!.vscode/launch.json
+!.vscode/extensions.json
+!.vscode/*.code-snippets
+
+# Local History for Visual Studio Code
+.history/
+
+# Built Visual Studio Code Extensions
+*.vsix
+
+### VisualStudioCode Patch ###
+# Ignore all local history of files
+.history
+.ionide
+
+# End of https://www.toptal.com/developers/gitignore/api/visualstudiocode,pycharm+all,python,venv,jupyternotebooks
diff --git a/missions/W1/README.md b/missions/W1/README.md
@@ -0,0 +1,101 @@
+# Week1. 환경 설정
+
+## 목차
+1. [환경설정](#1-환경설정)
+   - [pyenv 설치](#pyenv-설치)
+   - [venv](#venv)
+   - [jupyter lab, notebook 설치](#jupyter-lab-notebook-설치)
+   - [jupyter notebook, venv에서 실행하기](#jupyter-notebook-venv에서-실행하기)
+   - [부록](#부록)
+   - [참고](#참고)
+
+## 1. 환경설정
+
+### pyenv 설치
+>pyenv는 여러 python 버전을 한 컴퓨터에서 관리 및 사용할 수 있게 해준다.  
+설치에는 다음의 링크를 참고하자.  
+[GitHub-pyenv](https://github.com/pyenv/pyenv?tab=readme-ov-file#installation)  
+
+어차피 구체적인 환경은 venv를 사용할 예정이므로, pyenv를 통해 설치하는 python은 Global로 설정해도 된다.  
+```bash
+pyenv global 3.12 # [email protected]를 global하게 사용(설치)
+```  
+<br>
+
+### venv
+>venv는 가상환경을 제공해서 워킹디렉토리 별로 python 버전 관리 및 의존성을 따로 관리할 수 있도록 해준다.  
+
+<br>
+
+### jupyter lab, notebook 설치
+[Jupyter Install](https://jupyter.org/install)를 참고하여 주피터 랩과 노트북을 설치한다.  
+<br>
+
+### jupyter notebook, venv에서 실행하기
+주피터를 이대로 실행하게 되면 Global로 설정된 pyenv를 사용하고 후에 설치하게 될 모든 의존성도 그곳에 모이게된다.  
+이렇게 될 경우 venv의 이점을 다 누리지 못하므로 우리는 pyenv 대신 venv를 통해 주피터를 사용할 수 있게 설정한다.  
+<br>
+>pyenv(global)가 아닌 venv(local)에 패키지 설치하고 사용하기
+파이썬이 어디서 실행되는지 알려면 간단하게 which를 사용해볼 수 있다.  
+```
+which python
+```  
+![figure-1](assets/figure-1.png)  
+현재 global로 설정된 pyenv의 [email protected]를 찾아오고 있다.  
+우리는 이를 venv로 교체한다.  
+<br>
+W1 폴더에 venv를 만들어준다.  
+```bash
+python -m venv <working-directory>
+```
+![figure-2](assets/figure-2.png)  
+<br>
+가상환경을 활성화한다.  
+```bash
+source <working-directory>/bin/activate
+```
+![figure-3](assets/figure-3.png)  
+<br>
+이제 주피터 노트북과 venv를 연결해줘야 한다.  
+ipykernel을 설치하자.   
+- ipykernel도 가상환경 밑에 설치되도록, activate 이후 설치한다.  
+```bash
+pip install ipykernel
+```
+<br>
+
+커널을 하나 만들어준다. 추가할 가상환경과 주피터에서 display할 이름을 정의한다.  
+```bash
+python -m ipykernel install --user --name [가상환경폴더] --display-name [Jupyter에서 보여질 이름]
+# python -m ipykernel install --user --name W1 --display-name W1-venv
+```
+
+<br>
+
+주피터 실행 후 ipynb 파일을 생성하고 커널을 변경한다.  
+![figure-4](assets/figure-4.png)  
+
+<br>
+
+새로운 패키지를 설치해보고 올바른 위치(venv)에 깔리는지 확인하자.  
+![figure-5](assets/figure-5.png)    
+
+<br>
+
+#### 부록.
+현재 존재하는 커널 리스트를 확인하고 삭제하는 방법  
+```bash
+# List all kernels and grap the name of the kernel you want to remove
+jupyter kernelspec list
+# Remove it
+jupyter kernelspec remove <kernel_name>
+```  
+
+<br>
+
+##### 참고.
+[JupyterLab 에 가상환경(Virtualenv) 연결, 삭제하기
+](https://raki-1203.github.io/jupyter/JupyterLab_venv_add_delete/)
+
+
+
diff --git a/missions/W1/assets/figure-1.png b/missions/W1/assets/figure-1.png
diff --git a/missions/W1/assets/figure-2.png b/missions/W1/assets/figure-2.png
diff --git a/missions/W1/assets/figure-3.png b/missions/W1/assets/figure-3.png
diff --git a/missions/W1/assets/figure-4.png b/missions/W1/assets/figure-4.png
diff --git a/missions/W1/assets/figure-5.png b/missions/W1/assets/figure-5.png
diff --git a/missions/W1/etl/etl.md b/missions/W1/etl/etl.md
@@ -0,0 +1,52 @@
+# ETL 프로세스 구현
+> generated 디렉터리를 만들어야 합니다.
+
+웹 스크래핑(수집) -> 프로세싱(가공) -> DB(저장)의 파이프라인을 구성한다.
+
+**시나리오**  
+- 당신은 해외로 사업을 확장하고자 하는 기업에서 Data Engineer로 일하고 있습니다. 경영진에서 **GDP가 높은 국가**들을 대상으로 사업성을 평가하려고 합니다.  
+- 이 자료는 앞으로 경영진에서 지속적으로 요구할 것으로 생각되기 때문에 **자동화된 스크립트**를 만들어야 합니다.
+
+**기능요구사항**
+- IMF에서 제공하는 국가별 GDP를 구하세요. [wiki](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29)
+- 국가별 GDP를 확인할 수 있는 테이블을 만드세요.
+- 해당 테이블에는 GDP가 높은 국가들이 먼저 나와야 합니다.
+- GDP의 단위는 1B USD이어야 하고 소수점 2자리까지만 표시해 주세요.
+- IMF에서 매년 2회 이 자료를 제공하기 때문에 정보가 갱신되더라도 해당 코드를 재사용해서 정보를 얻을 수 있어야 합니다.
+
+**화면 출력**
+- GDP가 100B USD이상이 되는 국가만을 구해서 화면에 출력해야 합니다.
+- 각 Region별로 top5 국가의 GDP 평균을 구해서 화면에 출력해야 합니다.
+
+**주의**
+- 함수를 분리하여 작성
+- 주석으로 설명달기
+
+
+# 팀 활동 요구사항
+1. wikipeida 페이지가 아닌, IMF 홈페이지에서 직접 데이터를 가져오는 방법은 없을까요? 어떻게 하면 될까요?  
+IMF DATA에서 관련된 API를 제공한다. 가능하다면 API 키를 받아 사용하는 것이 베스트  
+API 키 발급이 어렵다면 IMF 홈페이지에서 제공하는 레포트 작성 기능(쿼리)을 셀레니움을 통해 자동화할 수 있다.  
+[IMF WEO](https://www.imf.org/en/Publications/WEO/weo-database/2024/April)
+
+2. 만약 데이터가 갱신되면 과거의 데이터는 어떻게 되어야 할까요? 과거의 데이터를 조회하는 게 필요하다면 ETL 프로세스를 어떻게 변경해야 할까요?  
+>지금은 한 번에 200건 정도밖에 되지 않지만 데이터가 굉장히 많다고 가정  
+
+우선, ETL 프로세스와 보고서 작성(리전별 상위 5개국 평균, GDP 100B 이상)을 위한 스크립트를 분리한다.  
+데이터가 갱신되지 않았다는 가정하에 ETL 프로그램은 보고서 작성을 위해 매 번 호출될 필요가 없다.  
+즉, ETL 프로세스는 데이터가 갱신되었을 때만 동작하며 일상적인 보고서 작성은 따로 분리된 스크립트(DB 조회하는)를 통해서만 동작한다.
+<br>
+
+위와 같은 가정아래 GDP 데이터가 갱신되어 데이터베이스에 새로 데이터를 쓰려고 할 때, 기존의 데이터는 어떻게 해야할까?  
+일상적이진 않지만 과거 데이터를 조회해야 하는 일이 있다면 그에 대한 데이터는 어딘가 반드시 저장되어야 한다.  
+<br>
+가장 단순하게는 기존의 테이블에 데이터를 스크랩한 날짜와 IMF가 제공하는 데이터의 연도, (데이터가 상/하반기로 갱신된다면)분기를 나타내는 심볼을 함께 저장하면 된다.  
+연도와 분기를 통해 알고싶은 데이터를 찾을 수 있을 것이다.  
+<br>
+다만, 대량의 데이터가 될 경우 한 테이블에서 모든 데이터를 찾기란 느리고 비효율적인 일이 될 것이다.  
+일상적으로 필요한 최신의 데이터 이외의 나머지 과거 데이터는 그 접근 빈도수가 적지만, 모든 데이터를 한 곳에 몰아두면 조회가 함께 느려진다는 문제가 있다.  
+<br>
+이를 해결하기 위해 최근 데이터만을 모아두는 테이블을 만드는 방법을 사용할 수 있다.  
+즉, 새로 갱신된 데이터는 최신 데이터 테이블과 아카이브 테이블 두 곳에 저장한다.  
+일상적인 조회는 최신 데이터 테이블에서 수행하고, 가끔 필요한 과거 데이터에 대한 조회는 아카이브 테이블을 사용토록한다.  
+이렇게하면 전체 데이터 수와 무관하게 최신 데이터 테이블의 크기를 일정하게 유지하여 일상적인 조회 성능을 보장받을 수 있다.