diff --git a/week1_02_cnn_for_texts_and_more_embeddings/week1_02_cnn_for_texts.ipynb b/week1_02_cnn_for_texts_and_more_embeddings/week1_02_cnn_for_texts.ipynb
new file mode 100644
index 000000000..eaed60b2c
--- /dev/null
+++ b/week1_02_cnn_for_texts_and_more_embeddings/week1_02_cnn_for_texts.ipynb
@@ -0,0 +1,994 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "13pL--6rycN3"
+ },
+ "source": [
+ "## Practice 02: Dealing with texts using CNN\n",
+ "\n",
+ "Today we're gonna apply the newly learned tools for the task of predicting job salary.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "Based on YSDA [materials](https://github.com/yandexdataschool/nlp_course/blob/master/week02_classification/seminar.ipynb). _Special thanks to [Oleg Vasilev](https://github.com/Omrigan/) for the core assignment idea._"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "P8zS7m-gycN5"
+ },
+ "outputs": [],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "\n",
+ "%matplotlib inline"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "34x92vWQycN_"
+ },
+ "source": [
+ "### About the challenge\n",
+ "For starters, let's download and unpack the data from [here](https://www.dropbox.com/s/5msc5ix7ndyba10/Train_rev1.csv.tar.gz?dl=0). \n",
+ "\n",
+ "You can also get it from [yadisk url](https://yadi.sk/d/vVEOWPFY3NruT7) the competition [page](https://www.kaggle.com/c/job-salary-prediction/data) (pick `Train_rev1.*`)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "colab_type": "code",
+ "id": "vwN72gd4ycOA",
+ "outputId": "7b9e8549-3128-4041-c4be-33fb6f326c78"
+ },
+ "outputs": [],
+ "source": [
+ "# Do this only once\n",
+ "!curl -L \"https://www.dropbox.com/s/5msc5ix7ndyba10/Train_rev1.csv.tar.gz?dl=1\" -o Train_rev1.csv.tar.gz\n",
+ "!tar -xvzf ./Train_rev1.csv.tar.gz"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "colab_type": "code",
+ "id": "vwN72gd4ycOA",
+ "outputId": "7b9e8549-3128-4041-c4be-33fb6f326c78"
+ },
+ "outputs": [],
+ "source": [
+ "data = pd.read_csv(\"./Train_rev1.csv\", index_col=None)\n",
+ "data.shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "z7kznuJfycOH"
+ },
+ "source": [
+ "One problem with salary prediction is that it's oddly distributed: there are many people who are paid standard salaries and a few that get tons o money. The distribution is fat-tailed on the right side, which is inconvenient for MSE minimization.\n",
+ "\n",
+ "There are several techniques to combat this: using a different loss function, predicting log-target instead of raw target or even replacing targets with their percentiles among all salaries in the training set. We gonna use logarithm for now.\n",
+ "\n",
+ "_You can read more [in the official description](https://www.kaggle.com/c/job-salary-prediction#description)._"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 265
+ },
+ "colab_type": "code",
+ "id": "UuuKIKfrycOH",
+ "outputId": "e5de0f94-a4f6-4b51-db80-9d11ddc1db31"
+ },
+ "outputs": [],
+ "source": [
+ "data[\"Log1pSalary\"] = np.log1p(data[\"SalaryNormalized\"]).astype(\"float32\")\n",
+ "\n",
+ "plt.figure(figsize=[8, 4])\n",
+ "plt.subplot(1, 2, 1)\n",
+ "plt.hist(data[\"SalaryNormalized\"], bins=20)\n",
+ "\n",
+ "plt.subplot(1, 2, 2)\n",
+ "plt.hist(data[\"Log1pSalary\"], bins=20);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "Fcu-qmHRycOK"
+ },
+ "source": [
+ "Our task is to predict one number, __Log1pSalary__.\n",
+ "\n",
+ "To do so, our model can access a number of features:\n",
+ "* Free text: __`Title`__ and __`FullDescription`__\n",
+ "* Categorical: __`Category`__, __`Company`__, __`LocationNormalized`__, __`ContractType`__, and __`ContractTime`__."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 332
+ },
+ "colab_type": "code",
+ "id": "p9vyA_erycOK",
+ "outputId": "af9a21f3-10b7-4fde-d4cd-1f66939566b8"
+ },
+ "outputs": [],
+ "source": [
+ "text_columns = [\"Title\", \"FullDescription\"]\n",
+ "categorical_columns = [\"Category\", \"Company\", \"LocationNormalized\", \"ContractType\", \"ContractTime\"]\n",
+ "target_column = \"Log1pSalary\"\n",
+ "\n",
+ "# cast missing values to string \"NaN\"\n",
+ "data[categorical_columns] = data[categorical_columns].fillna(\"NaN\")\n",
+ "data.sample(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "IUdclucmycON"
+ },
+ "source": [
+ "### Preprocessing text data\n",
+ "\n",
+ "Just like last week, applying NLP to a problem begins from tokenization: splitting raw text into sequences of tokens (words, punctuation, etc).\n",
+ "\n",
+ "__Your task__ is to lowercase and tokenize all texts under `Title` and `FullDescription` columns. Store the tokenized data as a __space-separated__ string of tokens for performance reasons.\n",
+ "\n",
+ "It's okay to use nltk tokenizers. Assertions were designed for WordPunctTokenizer, slight deviations are okay."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 107
+ },
+ "colab_type": "code",
+ "id": "YzeOxD_aycOO",
+ "outputId": "b4826117-1196-4a0e-92fa-6fd3ca609202"
+ },
+ "outputs": [],
+ "source": [
+ "print(\"Raw text:\")\n",
+ "print(data[\"FullDescription\"][2::100000])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "RUWkpd7PycOQ"
+ },
+ "outputs": [],
+ "source": [
+ "import nltk\n",
+ "\n",
+ "\n",
+ "tokenizer = nltk.tokenize.WordPunctTokenizer()\n",
+ "\n",
+ "\n",
+ "# see task above\n",
+ "def normalize(text):\n",
+ " # YOUR CODE HERE\n",
+ " pass\n",
+ "\n",
+ "\n",
+ "data[text_columns] = data[text_columns].applymap(normalize)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "o3pQdHihycOT"
+ },
+ "source": [
+ "Now we can assume that our text is a space-separated list of tokens:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 107
+ },
+ "colab_type": "code",
+ "id": "Gs-6lnS_ycOU",
+ "outputId": "8948250d-7117-4e4f-a38d-00405f9b2cec"
+ },
+ "outputs": [],
+ "source": [
+ "print(\"Tokenized:\")\n",
+ "print(data[\"FullDescription\"][2::100000])\n",
+ "assert data[\"FullDescription\"][2][:50] == \"mathematical modeller / simulation analyst / opera\"\n",
+ "assert data[\"Title\"][54321] == \"international digital account manager ( german )\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "ouE3L2hyycOX"
+ },
+ "source": [
+ "Not all words are equally useful. Some of them are typos or rare words that are only present a few times. \n",
+ "\n",
+ "Let's count how many times is each word present in the data so that we can build a \"white list\" of known words."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ },
+ "colab_type": "code",
+ "id": "iC7hBwwjycOX",
+ "outputId": "70eb75fc-535f-45a3-ad97-95a98e1d020f"
+ },
+ "outputs": [],
+ "source": [
+ "# Count how many times does each token occur in both \"Title\" and \"FullDescription\" in total\n",
+ "# build a dictionary { token -> it's count }\n",
+ "from collections import Counter\n",
+ "\n",
+ "from tqdm import tqdm as tqdm"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# hint: you may or may not want to use collections.Counter\n",
+ "token_counts = None # YOUR CODE HERE"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 215
+ },
+ "colab_type": "code",
+ "id": "GiOWbc15ycOb",
+ "outputId": "1e807140-5513-4af0-d9a9-9f029059a553"
+ },
+ "outputs": [],
+ "source": [
+ "print(\"Total unique tokens :\", len(token_counts))\n",
+ "print(\"\\n\".join(map(str, token_counts.most_common(n=5))))\n",
+ "print(\"...\")\n",
+ "print(\"\\n\".join(map(str, token_counts.most_common()[-3:])))\n",
+ "\n",
+ "assert token_counts.most_common(1)[0][1] in range(2600000, 2700000)\n",
+ "assert len(token_counts) in range(200000, 210000)\n",
+ "print(\"Correct!\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 279
+ },
+ "colab_type": "code",
+ "id": "nd5v3BNfycOf",
+ "outputId": "1c59b386-f052-4340-bf5d-09ae8d15983c"
+ },
+ "outputs": [],
+ "source": [
+ "# Let's see how many words are there for each count\n",
+ "plt.hist(list(token_counts.values()), range=[0, 10 ** 4], bins=50, log=True)\n",
+ "plt.xlabel(\"Word counts\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "znuXxeghycOh"
+ },
+ "source": [
+ "Now filter tokens a list of all tokens that occur at least 10 times."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "SeNFBWx5ycOh"
+ },
+ "outputs": [],
+ "source": [
+ "min_count = 10\n",
+ "\n",
+ "# tokens from token_counts keys that had at least min_count occurrences throughout the dataset\n",
+ "tokens = None # YOUR CODE HERE"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 53
+ },
+ "colab_type": "code",
+ "id": "RATIRyPKycOk",
+ "outputId": "6bb7482c-7c46-4f7e-81f2-6b70e04abc64"
+ },
+ "outputs": [],
+ "source": [
+ "# Add a special tokens for unknown and empty words\n",
+ "UNK, PAD = \"UNK\", \"PAD\"\n",
+ "tokens = [UNK, PAD] + sorted(tokens)\n",
+ "print(\"Vocabulary size:\", len(tokens))\n",
+ "\n",
+ "assert type(tokens) == list\n",
+ "assert len(tokens) in range(32000, 35000)\n",
+ "assert \"me\" in tokens\n",
+ "assert UNK in tokens\n",
+ "print(\"Correct!\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "cqEsgbjZycOo"
+ },
+ "source": [
+ "Build an inverse token index: a dictionary from token(string) to it's index in `tokens` (int)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "L60lo1l_ycOq"
+ },
+ "outputs": [],
+ "source": [
+ "# You have already done that ;)\n",
+ "token_to_id = None # YOUR CODE HERE"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ },
+ "colab_type": "code",
+ "id": "DeAoVo4mycOr",
+ "outputId": "8f29ef68-f9bd-4628-8222-1dc17f8f2590"
+ },
+ "outputs": [],
+ "source": [
+ "assert isinstance(token_to_id, dict)\n",
+ "assert len(token_to_id) == len(tokens)\n",
+ "for tok in tokens:\n",
+ " assert tokens[token_to_id[tok]] == tok\n",
+ "\n",
+ "print(\"Correct!\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "cmJAkq3gycOv"
+ },
+ "source": [
+ "And finally, let's use the vocabulary you've built to map text lines into neural network-digestible matrices."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "JEsLeBjVycOw"
+ },
+ "outputs": [],
+ "source": [
+ "UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])\n",
+ "\n",
+ "\n",
+ "def as_matrix(sequences, max_len=None):\n",
+ " \"\"\"Convert a list of tokens into a matrix with padding\"\"\"\n",
+ " if isinstance(sequences[0], str):\n",
+ " sequences = list(map(str.split, sequences))\n",
+ "\n",
+ " max_len = min(max(map(len, sequences)), max_len or float(\"inf\"))\n",
+ "\n",
+ " matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))\n",
+ " for i, seq in enumerate(sequences):\n",
+ " row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]\n",
+ " matrix[i, : len(row_ix)] = row_ix\n",
+ "\n",
+ " return matrix"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 179
+ },
+ "colab_type": "code",
+ "id": "JiBlPkdKycOy",
+ "outputId": "3866b444-1e2d-4d79-d429-fecc6d8e02a8"
+ },
+ "outputs": [],
+ "source": [
+ "print(\"Lines:\")\n",
+ "print(\"\\n\".join(data[\"Title\"][::100000].values), end=\"\\n\\n\")\n",
+ "print(\"Matrix:\")\n",
+ "print(as_matrix(data[\"Title\"][::100000]))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "nGOdZ3-dycO4"
+ },
+ "source": [
+ "Now let's encode the categirical data we have.\n",
+ "\n",
+ "As usual, we shall use one-hot encoding for simplicity. Kudos if you implement more advanced encodings: tf-idf, pseudo-time-series, etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 53
+ },
+ "colab_type": "code",
+ "id": "DpOlBp7ZycO6",
+ "outputId": "30a911f2-7d35-4cb5-8991-60457b1e8bac"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.feature_extraction import DictVectorizer\n",
+ "\n",
+ "\n",
+ "# we only consider top-1k most frequent companies to minimize memory usage\n",
+ "top_companies, top_counts = zip(*Counter(data[\"Company\"]).most_common(1000))\n",
+ "recognized_companies = set(top_companies)\n",
+ "data.loc[~data.Company.isin(recognized_companies), \"Company\"] = \"Other\"\n",
+ "\n",
+ "categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)\n",
+ "categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "yk4jmtAYycO8"
+ },
+ "source": [
+ "### The deep learning part\n",
+ "\n",
+ "Once we've learned to tokenize the data, let's design a machine learning experiment.\n",
+ "\n",
+ "As before, we won't focus too much on validation, opting for a simple train-test split.\n",
+ "\n",
+ "__To be completely rigorous,__ we've comitted a small crime here: we used the whole data for tokenization and vocabulary building. A more strict way would be to do that part on training set only. You may want to do that and measure the magnitude of changes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 53
+ },
+ "colab_type": "code",
+ "id": "TngLcWA0ycO_",
+ "outputId": "6731b28c-07b1-41dc-9574-f76b01785bba"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "\n",
+ "data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)\n",
+ "data_train.index = range(len(data_train))\n",
+ "data_val.index = range(len(data_val))\n",
+ "\n",
+ "print(\"Train size = \", len(data_train))\n",
+ "print(\"Validation size = \", len(data_val))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "2PXuKgOSycPB"
+ },
+ "outputs": [],
+ "source": [
+ "def make_batch(data, max_len=None, word_dropout=0):\n",
+ " \"\"\"\n",
+ " Creates a neural-network-friendly dict from the batch data.\n",
+ " :param word_dropout: replaces token index with UNK_IX with this probability\n",
+ " :returns: a dict with {'title' : int64[batch, title_max_len]\n",
+ " \"\"\"\n",
+ " batch = {}\n",
+ " batch[\"Title\"] = as_matrix(data[\"Title\"].values, max_len)\n",
+ " batch[\"FullDescription\"] = as_matrix(data[\"FullDescription\"].values, max_len)\n",
+ " batch[\"Categorical\"] = categorical_vectorizer.transform(\n",
+ " data[categorical_columns].apply(dict, axis=1)\n",
+ " )\n",
+ "\n",
+ " if word_dropout != 0:\n",
+ " batch[\"FullDescription\"] = apply_word_dropout(batch[\"FullDescription\"], 1.0 - word_dropout)\n",
+ "\n",
+ " if target_column in data.columns:\n",
+ " batch[target_column] = data[target_column].values\n",
+ "\n",
+ " return batch\n",
+ "\n",
+ "\n",
+ "def apply_word_dropout(\n",
+ " matrix,\n",
+ " keep_prop,\n",
+ " replace_with=UNK_IX,\n",
+ " pad_ix=PAD_IX,\n",
+ "):\n",
+ " dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])\n",
+ " dropout_mask &= matrix != pad_ix\n",
+ " return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 251
+ },
+ "colab_type": "code",
+ "id": "I6LpEQf0ycPD",
+ "outputId": "e3520cae-fba1-46cc-a216-56287b6e4929"
+ },
+ "outputs": [],
+ "source": [
+ "batch = make_batch(data_train[:3], max_len=10)\n",
+ "batch"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "0eI5h9UMycPF"
+ },
+ "source": [
+ "#### Architecture\n",
+ "\n",
+ "Our main model consists of three branches:\n",
+ "* Title encoder\n",
+ "* Description encoder\n",
+ "* Categorical features encoder\n",
+ "\n",
+ "We will then feed all 3 branches into one common network that predicts salary.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "This clearly doesn't fit into PyTorch __Sequential__ interface. To build such a network, one will have to use [__PyTorch nn.Module API__](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "But to start with let's build the simple model using only the part of the data. Let's create the baseline solution using only the description part (so it should definetely fit into the Sequential model)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import torch\n",
+ "from torch import nn"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# You will need these to make it simple\n",
+ "\n",
+ "\n",
+ "class Flatten(nn.Module):\n",
+ " def forward(self, input):\n",
+ " return input.view(input.size(0), -1)\n",
+ "\n",
+ "\n",
+ "class Reorder(nn.Module):\n",
+ " def forward(self, input):\n",
+ " return input.permute((0, 2, 1))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To generate minibatches we will use simple pyton generator."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, **kwargs):\n",
+ " \"\"\"iterates minibatches of data in random order\"\"\"\n",
+ " while True:\n",
+ " indices = np.arange(len(data))\n",
+ " if shuffle:\n",
+ " indices = np.random.permutation(indices)\n",
+ "\n",
+ " for start in range(0, len(indices), batch_size):\n",
+ " batch = make_batch(data.iloc[indices[start : start + batch_size]], **kwargs)\n",
+ " target = batch.pop(target_column)\n",
+ " yield batch, target\n",
+ "\n",
+ " if not cycle:\n",
+ " break"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "iterator = iterate_minibatches(data_train, 3)\n",
+ "batch, target = next(iterator)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Here is some startup code:\n",
+ "n_tokens = len(tokens)\n",
+ "n_cat_features = len(categorical_vectorizer.vocabulary_)\n",
+ "hid_size = 64"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "simple_model = nn.Sequential()\n",
+ "simple_model.add_module(\"emb\", nn.Embedding(num_embeddings=n_tokens, embedding_dim=hid_size))\n",
+ "simple_model.add_module(\"reorder\", Reorder())\n",
+ "# YOUR CODE HERE: add more layers!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "__Remember!__ We are working with regression problem and predicting only one number."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Try this to check your model. `torch.long` tensors are required for nn.Embedding layers.\n",
+ "simple_model(torch.tensor(batch[\"FullDescription\"], dtype=torch.long))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "And now simple training pipeline:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from IPython.display import clear_output\n",
+ "from random import sample\n",
+ "\n",
+ "epochs = 1\n",
+ "\n",
+ "model = simple_model\n",
+ "opt = torch.optim.Adam(model.parameters())\n",
+ "loss_func = None # YOUR CODE HERE\n",
+ "\n",
+ "history = []\n",
+ " for epoch_num in range(epochs):\n",
+ " for idx, (batch, target) in enumerate(iterate_minibatches(data_train)):\n",
+ " # Preprocessing the batch data and target\n",
+ " batch = torch.tensor(batch['FullDescription'], dtype=torch.long)\n",
+ " target = torch.tensor(target)\n",
+ "\n",
+ "\n",
+ " predictions = model(batch)\n",
+ " predictions = predictions.view(predictions.size(0))\n",
+ "\n",
+ " loss = None # YOUR CODE HERE\n",
+ "\n",
+ " # YOUR CODE HERE: train with backprop\n",
+ "\n",
+ " history.append(loss.data.numpy())\n",
+ " if (idx+1)%10==0:\n",
+ " clear_output(True)\n",
+ " plt.plot(history,label='loss')\n",
+ " plt.legend()\n",
+ " plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To evaluate the model it can be switched to `eval` state."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "simple_model.eval()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's check the model quality."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "batch_size = 256"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def print_metrics(model, data, batch_size=batch_size, name=\"\", **kw):\n",
+ " squared_error = abs_error = num_samples = 0.0\n",
+ " for batch_x, batch_y in tqdm(\n",
+ " iterate_minibatches(data, batch_size=batch_size, shuffle=False, **kw)\n",
+ " ):\n",
+ " batch = torch.tensor(batch_x[\"FullDescription\"], dtype=torch.long)\n",
+ " batch_pred = model(batch)[:, 0].detach().numpy()\n",
+ " squared_error += np.sum(np.square(batch_pred - batch_y))\n",
+ " abs_error += np.sum(np.abs(batch_pred - batch_y))\n",
+ " num_samples += len(batch_y)\n",
+ " print(\"%s results:\" % (name or \"\"))\n",
+ " print(\"Mean square error: %.5f\" % (squared_error / num_samples))\n",
+ " print(\"Mean absolute error: %.5f\" % (abs_error / num_samples))\n",
+ " return squared_error, abs_error\n",
+ "\n",
+ "\n",
+ "print_metrics(simple_model, data_train, name=\"Train\")\n",
+ "print_metrics(simple_model, data_val, name=\"Val\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Bonus area: three-headed network.\n",
+ "\n",
+ "Now you can try to implement the network we've discussed above. Use [__PyTorch nn.Module API__](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class ThreeInputsNet(nn.Module):\n",
+ " def __init__(\n",
+ " self,\n",
+ " n_tokens=len(tokens),\n",
+ " n_cat_features=len(categorical_vectorizer.vocabulary_),\n",
+ " hid_size=64,\n",
+ " ):\n",
+ " super(ThreeInputsNet, self).__init__()\n",
+ " self.title_emb = nn.Embedding(n_tokens, embedding_dim=hid_size)\n",
+ " self.full_emb = nn.Embedding(num_embeddings=n_tokens, embedding_dim=hid_size)\n",
+ " self.category_out = None # YOUR CODE HERE\n",
+ "\n",
+ " def forward(self, whole_input):\n",
+ " input1, input2, input3 = whole_input\n",
+ " title_beg = self.title_emb(input1).permute((0, 2, 1)) # noqa: F841\n",
+ " title = None # YOUR CODE HERE\n",
+ "\n",
+ " full_beg = self.full_emb(input2).permute((0, 2, 1)) # noqa: F841\n",
+ " full = None # YOUR CODE HERE\n",
+ "\n",
+ " category = None # YOUR CODE HERE\n",
+ "\n",
+ " concatenated = torch.cat( # noqa: F841\n",
+ " [\n",
+ " title.view(title.size(0), -1),\n",
+ " full.view(full.size(0), -1),\n",
+ " category.view(category.size(0), -1),\n",
+ " ],\n",
+ " dim=1,\n",
+ " )\n",
+ "\n",
+ " out = None # YOUR CODE HERE\n",
+ "\n",
+ " return out"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Bonus area 2: comparing RNN to CNN\n",
+ "Try implementing simple RNN (or LSTM) and applying it to this task. Compare the quality/performance of these networks. \n",
+ "*Hint: try to build networks with ~same number of paremeters.*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# YOUR CODE HERE"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Bonus area 3: fixing the data leaks\n",
+ "Fix the data leak we ignored in the beginning of the __Deep Learning part__. Compare results with and without data leaks using same architectures and training time.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# YOUR CODE HERE"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "__Terrible start-up idea #1962:__ make a tool that automaticaly rephrases your job description (or CV) to meet salary expectations :)"
+ ]
+ }
+ ],
+ "metadata": {
+ "accelerator": "GPU",
+ "colab": {
+ "name": "week04_practice_CNN_for_texts.ipynb",
+ "provenance": [],
+ "version": "0.3.2"
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.11"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/week1_02_cnn_for_texts_and_more_embeddings/week1_02_cnn_for_texts__completed.ipynb b/week1_02_cnn_for_texts_and_more_embeddings/week1_02_cnn_for_texts__completed.ipynb
new file mode 100644
index 000000000..931ed7249
--- /dev/null
+++ b/week1_02_cnn_for_texts_and_more_embeddings/week1_02_cnn_for_texts__completed.ipynb
@@ -0,0 +1,1198 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "13pL--6rycN3"
+ },
+ "source": [
+ "## Practice 02: Dealing with texts using CNN\n",
+ "\n",
+ "Today we're gonna apply the newly learned tools for the task of predicting job salary.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "Based on YSDA [materials](https://github.com/yandexdataschool/nlp_course/blob/master/week02_classification/seminar.ipynb). _Special thanks to [Oleg Vasilev](https://github.com/Omrigan/) for the core assignment idea._"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "P8zS7m-gycN5"
+ },
+ "outputs": [],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "\n",
+ "%matplotlib inline"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "34x92vWQycN_"
+ },
+ "source": [
+ "### About the challenge\n",
+ "For starters, let's download and unpack the data from [here](https://www.dropbox.com/s/5msc5ix7ndyba10/Train_rev1.csv.tar.gz?dl=0). \n",
+ "\n",
+ "You can also get it from [yadisk url](https://yadi.sk/d/vVEOWPFY3NruT7) the competition [page](https://www.kaggle.com/c/job-salary-prediction/data) (pick `Train_rev1.*`)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "colab_type": "code",
+ "id": "vwN72gd4ycOA",
+ "outputId": "7b9e8549-3128-4041-c4be-33fb6f326c78"
+ },
+ "outputs": [],
+ "source": [
+ "# Do this only once\n",
+ "!curl -L \"https://www.dropbox.com/s/5msc5ix7ndyba10/Train_rev1.csv.tar.gz?dl=1\" -o Train_rev1.csv.tar.gz\n",
+ "!tar -xvzf ./Train_rev1.csv.tar.gz"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 143
+ },
+ "colab_type": "code",
+ "id": "vwN72gd4ycOA",
+ "outputId": "7b9e8549-3128-4041-c4be-33fb6f326c78"
+ },
+ "outputs": [],
+ "source": [
+ "data = pd.read_csv(\"./Train_rev1.csv\", index_col=None)\n",
+ "data.shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "z7kznuJfycOH"
+ },
+ "source": [
+ "One problem with salary prediction is that it's oddly distributed: there are many people who are paid standard salaries and a few that get tons o money. The distribution is fat-tailed on the right side, which is inconvenient for MSE minimization.\n",
+ "\n",
+ "There are several techniques to combat this: using a different loss function, predicting log-target instead of raw target or even replacing targets with their percentiles among all salaries in the training set. We gonna use logarithm for now.\n",
+ "\n",
+ "_You can read more [in the official description](https://www.kaggle.com/c/job-salary-prediction#description)._"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 265
+ },
+ "colab_type": "code",
+ "id": "UuuKIKfrycOH",
+ "outputId": "e5de0f94-a4f6-4b51-db80-9d11ddc1db31"
+ },
+ "outputs": [],
+ "source": [
+ "data[\"Log1pSalary\"] = np.log1p(data[\"SalaryNormalized\"]).astype(\"float32\")\n",
+ "\n",
+ "plt.figure(figsize=[8, 4])\n",
+ "plt.subplot(1, 2, 1)\n",
+ "plt.hist(data[\"SalaryNormalized\"], bins=20)\n",
+ "\n",
+ "plt.subplot(1, 2, 2)\n",
+ "plt.hist(data[\"Log1pSalary\"], bins=20);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "Fcu-qmHRycOK"
+ },
+ "source": [
+ "Our task is to predict one number, __Log1pSalary__.\n",
+ "\n",
+ "To do so, our model can access a number of features:\n",
+ "* Free text: __`Title`__ and __`FullDescription`__\n",
+ "* Categorical: __`Category`__, __`Company`__, __`LocationNormalized`__, __`ContractType`__, and __`ContractTime`__."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 332
+ },
+ "colab_type": "code",
+ "id": "p9vyA_erycOK",
+ "outputId": "af9a21f3-10b7-4fde-d4cd-1f66939566b8"
+ },
+ "outputs": [],
+ "source": [
+ "text_columns = [\"Title\", \"FullDescription\"]\n",
+ "categorical_columns = [\"Category\", \"Company\", \"LocationNormalized\", \"ContractType\", \"ContractTime\"]\n",
+ "target_column = \"Log1pSalary\"\n",
+ "\n",
+ "# cast missing values to string \"NaN\"\n",
+ "data[categorical_columns] = data[categorical_columns].fillna(\"NaN\")\n",
+ "data.sample(3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "IUdclucmycON"
+ },
+ "source": [
+ "### Preprocessing text data\n",
+ "\n",
+ "Just like last week, applying NLP to a problem begins from tokenization: splitting raw text into sequences of tokens (words, punctuation, etc).\n",
+ "\n",
+ "__Your task__ is to lowercase and tokenize all texts under `Title` and `FullDescription` columns. Store the tokenized data as a __space-separated__ string of tokens for performance reasons.\n",
+ "\n",
+ "It's okay to use nltk tokenizers. Assertions were designed for WordPunctTokenizer, slight deviations are okay."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 107
+ },
+ "colab_type": "code",
+ "id": "YzeOxD_aycOO",
+ "outputId": "b4826117-1196-4a0e-92fa-6fd3ca609202"
+ },
+ "outputs": [],
+ "source": [
+ "print(\"Raw text:\")\n",
+ "print(data[\"FullDescription\"][2::100000])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "RUWkpd7PycOQ"
+ },
+ "outputs": [],
+ "source": [
+ "import nltk\n",
+ "\n",
+ "\n",
+ "tokenizer = nltk.tokenize.WordPunctTokenizer()\n",
+ "\n",
+ "\n",
+ "def normalize(text):\n",
+ " text = str(text).lower()\n",
+ " tokens = tokenizer.tokenize(text)\n",
+ " return \" \".join(tokens)\n",
+ "\n",
+ "\n",
+ "data[text_columns] = data[text_columns].applymap(normalize)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "o3pQdHihycOT"
+ },
+ "source": [
+ "Now we can assume that our text is a space-separated list of tokens:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 107
+ },
+ "colab_type": "code",
+ "id": "Gs-6lnS_ycOU",
+ "outputId": "8948250d-7117-4e4f-a38d-00405f9b2cec"
+ },
+ "outputs": [],
+ "source": [
+ "print(\"Tokenized:\")\n",
+ "print(data[\"FullDescription\"][2::100000])\n",
+ "assert data[\"FullDescription\"][2][:50] == \"mathematical modeller / simulation analyst / opera\"\n",
+ "assert data[\"Title\"][54321] == \"international digital account manager ( german )\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "ouE3L2hyycOX"
+ },
+ "source": [
+ "Not all words are equally useful. Some of them are typos or rare words that are only present a few times. \n",
+ "\n",
+ "Let's count how many times is each word present in the data so that we can build a \"white list\" of known words."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ },
+ "colab_type": "code",
+ "id": "iC7hBwwjycOX",
+ "outputId": "70eb75fc-535f-45a3-ad97-95a98e1d020f"
+ },
+ "outputs": [],
+ "source": [
+ "# Count how many times does each token occur in both \"Title\" and \"FullDescription\" in total\n",
+ "# build a dictionary { token -> it's count }\n",
+ "from collections import Counter\n",
+ "\n",
+ "from tqdm import tqdm as tqdm"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "token_counts = Counter()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%time\n",
+ "\n",
+ "for _, row in data[text_columns].iterrows():\n",
+ " for text in row:\n",
+ " token_counts.update(text.split(\" \"))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "token_counts_second = Counter()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%time\n",
+ "for row in data[text_columns].values.flatten():\n",
+ " token_counts_second.update(row.split(\" \"))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "token_counts == token_counts_second"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 215
+ },
+ "colab_type": "code",
+ "id": "GiOWbc15ycOb",
+ "outputId": "1e807140-5513-4af0-d9a9-9f029059a553"
+ },
+ "outputs": [],
+ "source": [
+ "print(\"Total unique tokens :\", len(token_counts))\n",
+ "print(\"\\n\".join(map(str, token_counts.most_common(n=5))))\n",
+ "print(\"...\")\n",
+ "print(\"\\n\".join(map(str, token_counts.most_common()[-3:])))\n",
+ "\n",
+ "assert token_counts.most_common(1)[0][1] in range(2600000, 2700000)\n",
+ "assert len(token_counts) in range(200000, 210000)\n",
+ "print(\"Correct!\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 279
+ },
+ "colab_type": "code",
+ "id": "nd5v3BNfycOf",
+ "outputId": "1c59b386-f052-4340-bf5d-09ae8d15983c"
+ },
+ "outputs": [],
+ "source": [
+ "# Let's see how many words are there for each count\n",
+ "plt.hist(list(token_counts.values()), range=[0, 10 ** 4], bins=50, log=True)\n",
+ "plt.xlabel(\"Word counts\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "znuXxeghycOh"
+ },
+ "source": [
+ "Now filter tokens a list of all tokens that occur at least 10 times."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "SeNFBWx5ycOh"
+ },
+ "outputs": [],
+ "source": [
+ "min_count = 10\n",
+ "\n",
+ "# tokens from token_counts keys that had at least min_count occurrences throughout the dataset\n",
+ "tokens = [token for token, count in token_counts.items() if count >= min_count]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 53
+ },
+ "colab_type": "code",
+ "id": "RATIRyPKycOk",
+ "outputId": "6bb7482c-7c46-4f7e-81f2-6b70e04abc64"
+ },
+ "outputs": [],
+ "source": [
+ "# Add a special tokens for unknown and empty words\n",
+ "UNK, PAD = \"UNK\", \"PAD\"\n",
+ "tokens = [UNK, PAD] + sorted(tokens)\n",
+ "print(\"Vocabulary size:\", len(tokens))\n",
+ "\n",
+ "assert type(tokens) == list\n",
+ "assert len(tokens) in range(32000, 35000)\n",
+ "assert \"me\" in tokens\n",
+ "assert UNK in tokens\n",
+ "print(\"Correct!\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "cqEsgbjZycOo"
+ },
+ "source": [
+ "Build an inverse token index: a dictionary from token(string) to it's index in `tokens` (int)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "L60lo1l_ycOq"
+ },
+ "outputs": [],
+ "source": [
+ "# You have already done that ;)\n",
+ "token_to_id = {token: idx for idx, token in enumerate(tokens)}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 35
+ },
+ "colab_type": "code",
+ "id": "DeAoVo4mycOr",
+ "outputId": "8f29ef68-f9bd-4628-8222-1dc17f8f2590"
+ },
+ "outputs": [],
+ "source": [
+ "assert isinstance(token_to_id, dict)\n",
+ "assert len(token_to_id) == len(tokens)\n",
+ "for tok in tokens:\n",
+ " assert tokens[token_to_id[tok]] == tok\n",
+ "\n",
+ "print(\"Correct!\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "cmJAkq3gycOv"
+ },
+ "source": [
+ "And finally, let's use the vocabulary you've built to map text lines into neural network-digestible matrices."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "JEsLeBjVycOw"
+ },
+ "outputs": [],
+ "source": [
+ "UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])\n",
+ "\n",
+ "\n",
+ "def as_matrix(sequences, max_len=None):\n",
+ " \"\"\"Convert a list of tokens into a matrix with padding\"\"\"\n",
+ " if isinstance(sequences[0], str):\n",
+ " sequences = list(map(str.split, sequences))\n",
+ "\n",
+ " max_len = min(max(map(len, sequences)), max_len or float(\"inf\"))\n",
+ "\n",
+ " matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))\n",
+ " for i, seq in enumerate(sequences):\n",
+ " row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]\n",
+ " matrix[i, : len(row_ix)] = row_ix\n",
+ "\n",
+ " return matrix"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 179
+ },
+ "colab_type": "code",
+ "id": "JiBlPkdKycOy",
+ "outputId": "3866b444-1e2d-4d79-d429-fecc6d8e02a8"
+ },
+ "outputs": [],
+ "source": [
+ "print(\"Lines:\")\n",
+ "print(\"\\n\".join(data[\"Title\"][::100000].values), end=\"\\n\\n\")\n",
+ "print(\"Matrix:\")\n",
+ "print(as_matrix(data[\"Title\"][::100000]))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "nGOdZ3-dycO4"
+ },
+ "source": [
+ "Now let's encode the categirical data we have.\n",
+ "\n",
+ "As usual, we shall use one-hot encoding for simplicity. Kudos if you implement more advanced encodings: tf-idf, pseudo-time-series, etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 53
+ },
+ "colab_type": "code",
+ "id": "DpOlBp7ZycO6",
+ "outputId": "30a911f2-7d35-4cb5-8991-60457b1e8bac"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.feature_extraction import DictVectorizer\n",
+ "\n",
+ "\n",
+ "# we only consider top-1k most frequent companies to minimize memory usage\n",
+ "top_companies, top_counts = zip(*Counter(data[\"Company\"]).most_common(1000))\n",
+ "recognized_companies = set(top_companies)\n",
+ "data.loc[~data.Company.isin(recognized_companies), \"Company\"] = \"Other\"\n",
+ "\n",
+ "categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)\n",
+ "categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "yk4jmtAYycO8"
+ },
+ "source": [
+ "### The deep learning part\n",
+ "\n",
+ "Once we've learned to tokenize the data, let's design a machine learning experiment.\n",
+ "\n",
+ "As before, we won't focus too much on validation, opting for a simple train-test split.\n",
+ "\n",
+ "__To be completely rigorous,__ we've comitted a small crime here: we used the whole data for tokenization and vocabulary building. A more strict way would be to do that part on training set only. You may want to do that and measure the magnitude of changes."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Once again about embeddings"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import torch\n",
+ "from torch import nn"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "example_matrix = torch.LongTensor(as_matrix(data[\"Title\"][::100000]))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "example_matrix"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "embedding_layer = nn.Embedding(len(tokens), 32)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "embedded_example = embedding_layer(example_matrix)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "embedded_example.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "name, weight_matrix = list(embedding_layer.named_parameters())[0]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "weight_matrix.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "weight_matrix[10807]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "torch.allclose(embedded_example[0, 0], weight_matrix[10807])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 53
+ },
+ "colab_type": "code",
+ "id": "TngLcWA0ycO_",
+ "outputId": "6731b28c-07b1-41dc-9574-f76b01785bba"
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "\n",
+ "data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)\n",
+ "data_train.index = range(len(data_train))\n",
+ "data_val.index = range(len(data_val))\n",
+ "\n",
+ "print(\"Train size = \", len(data_train))\n",
+ "print(\"Validation size = \", len(data_val))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "2PXuKgOSycPB"
+ },
+ "outputs": [],
+ "source": [
+ "def make_batch(data, max_len=None, word_dropout=0):\n",
+ " \"\"\"\n",
+ " Creates a neural-network-friendly dict from the batch data.\n",
+ " :param word_dropout: replaces token index with UNK_IX with this probability\n",
+ " :returns: a dict with {'title' : int64[batch, title_max_len]\n",
+ " \"\"\"\n",
+ " batch = {}\n",
+ " batch[\"Title\"] = as_matrix(data[\"Title\"].values, max_len)\n",
+ " batch[\"FullDescription\"] = as_matrix(data[\"FullDescription\"].values, max_len)\n",
+ " batch[\"Categorical\"] = categorical_vectorizer.transform(\n",
+ " data[categorical_columns].apply(dict, axis=1)\n",
+ " )\n",
+ "\n",
+ " if word_dropout != 0:\n",
+ " batch[\"FullDescription\"] = apply_word_dropout(batch[\"FullDescription\"], 1.0 - word_dropout)\n",
+ "\n",
+ " if target_column in data.columns:\n",
+ " batch[target_column] = data[target_column].values\n",
+ "\n",
+ " return batch\n",
+ "\n",
+ "\n",
+ "def apply_word_dropout(\n",
+ " matrix,\n",
+ " keep_prop,\n",
+ " replace_with=UNK_IX,\n",
+ " pad_ix=PAD_IX,\n",
+ "):\n",
+ " dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])\n",
+ " dropout_mask &= matrix != pad_ix\n",
+ " return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "batch = make_batch(data_train[:3], max_len=10)\n",
+ "batch"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "0eI5h9UMycPF"
+ },
+ "source": [
+ "#### Architecture\n",
+ "\n",
+ "Our main model consists of three branches:\n",
+ "* Title encoder\n",
+ "* Description encoder\n",
+ "* Categorical features encoder\n",
+ "\n",
+ "We will then feed all 3 branches into one common network that predicts salary.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "This clearly doesn't fit into PyTorch __Sequential__ interface. To build such a network, one will have to use [__PyTorch nn.Module API__](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "But to start with let's build the simple model using only the part of the data. Let's create the baseline solution using only the description part (so it should definetely fit into the Sequential model)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from collections import OrderedDict\n",
+ "\n",
+ "import torch\n",
+ "from torch import nn"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "nn.Flatten()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# You will need these to make it simple\n",
+ "\n",
+ "\n",
+ "class Flatten(nn.Module):\n",
+ " def forward(self, input):\n",
+ " return input.view(input.size(0), -1)\n",
+ "\n",
+ "\n",
+ "class Reorder(nn.Module):\n",
+ " def forward(self, input):\n",
+ " return input.permute((0, 2, 1))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "reorder = Reorder()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "embedded_example.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "reorder(embedded_example).shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To generate minibatches we will use simple pyton generator."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, **kwargs):\n",
+ " \"\"\"iterates minibatches of data in random order\"\"\"\n",
+ " while True:\n",
+ " indices = np.arange(len(data))\n",
+ " if shuffle:\n",
+ " indices = np.random.permutation(indices)\n",
+ "\n",
+ " for start in range(0, len(indices), batch_size):\n",
+ " batch = make_batch(data.iloc[indices[start : start + batch_size]], **kwargs)\n",
+ " target = batch.pop(target_column)\n",
+ " yield batch, target\n",
+ "\n",
+ " if not cycle:\n",
+ " break"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "iterator = iterate_minibatches(data_train, 3)\n",
+ "batch, target = next(iterator)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Here is some startup code:\n",
+ "n_tokens = len(tokens)\n",
+ "n_cat_features = len(categorical_vectorizer.vocabulary_)\n",
+ "hid_size = 64"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "layers = OrderedDict(\n",
+ " [\n",
+ " (\"emb\", nn.Embedding(num_embeddings=n_tokens, embedding_dim=hid_size)),\n",
+ " (\"reorder\", Reorder()),\n",
+ " (\"conv1\", nn.Conv1d(in_channels=hid_size, out_channels=hid_size * 2, kernel_size=3)),\n",
+ " (\"relu1\", nn.ReLU()),\n",
+ " (\"conv2\", nn.Conv1d(in_channels=hid_size * 2, out_channels=hid_size * 2, kernel_size=3)),\n",
+ " (\"relu2\", nn.ReLU()),\n",
+ " (\"bn1\", nn.BatchNorm1d(num_features=hid_size * 2)),\n",
+ " (\"conv3\", nn.Conv1d(in_channels=hid_size * 2, out_channels=hid_size * 2, kernel_size=2)),\n",
+ " (\"relu3\", nn.ReLU()),\n",
+ " (\"adaptive_pool\", nn.AdaptiveMaxPool1d(1)),\n",
+ " (\"flatten\", nn.Flatten()),\n",
+ " (\"out\", nn.Linear(2 * hid_size, 1)),\n",
+ " ]\n",
+ ")\n",
+ "\n",
+ "simple_model = nn.Sequential(layers)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "__Remember!__ We are working with regression problem and predicting only one number."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Try this to check your model. `torch.long` tensors are required for nn.Embedding layers.\n",
+ "simple_model(torch.tensor(batch[\"FullDescription\"], dtype=torch.long))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "device"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "simple_model.to(device)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "And now simple training pipeline:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from IPython.display import clear_output\n",
+ "\n",
+ "\n",
+ "epochs = 1\n",
+ "\n",
+ "model = simple_model\n",
+ "opt = torch.optim.Adam(model.parameters())\n",
+ "loss_func = nn.MSELoss()\n",
+ "\n",
+ "history = []\n",
+ "for epoch_num in range(epochs):\n",
+ " for idx, (batch, target) in enumerate(iterate_minibatches(data_train)):\n",
+ " opt.zero_grad()\n",
+ "\n",
+ " # Preprocessing the batch data and target\n",
+ " batch = torch.tensor(batch[\"FullDescription\"], dtype=torch.long).to(device)\n",
+ " target = torch.tensor(target).to(device)\n",
+ "\n",
+ " predictions = model(batch)\n",
+ " predictions = predictions.reshape(predictions.size(0))\n",
+ "\n",
+ " loss = loss_func(predictions, target)\n",
+ "\n",
+ " # train with backprop\n",
+ " loss.backward()\n",
+ " opt.step()\n",
+ "\n",
+ " history.append(loss.item())\n",
+ " if (idx + 1) % 10 == 0:\n",
+ " clear_output(True)\n",
+ " plt.plot(history, label=\"loss\")\n",
+ " plt.legend()\n",
+ " plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To evaluate the model it can be switched to `eval` state."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's check the model quality."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "batch_size = 256"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def print_metrics(model, data, batch_size=batch_size, name=\"\", **kw):\n",
+ " squared_error = abs_error = num_samples = 0.0\n",
+ " for batch_x, batch_y in tqdm(\n",
+ " iterate_minibatches(data, batch_size=batch_size, shuffle=False, **kw)\n",
+ " ):\n",
+ " batch = torch.tensor(batch_x[\"FullDescription\"], dtype=torch.long).to(device)\n",
+ " batch_pred = model(batch)[:, 0].detach().cpu().numpy()\n",
+ " squared_error += np.sum(np.square(batch_pred - batch_y))\n",
+ " abs_error += np.sum(np.abs(batch_pred - batch_y))\n",
+ " num_samples += len(batch_y)\n",
+ " print(\"%s results:\" % (name or \"\"))\n",
+ " print(\"Mean square error: %.5f\" % (squared_error / num_samples))\n",
+ " print(\"Mean absolute error: %.5f\" % (abs_error / num_samples))\n",
+ " return squared_error, abs_error\n",
+ "\n",
+ "\n",
+ "print_metrics(simple_model, data_train, name=\"Train\")\n",
+ "print_metrics(simple_model, data_val, name=\"Val\");"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Bonus area: three-headed network.\n",
+ "\n",
+ "Now you can try to implement the network we've discussed above. Use [__PyTorch nn.Module API__](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class ThreeInputsNet(nn.Module):\n",
+ " def __init__(\n",
+ " self,\n",
+ " n_tokens=len(tokens),\n",
+ " n_cat_features=len(categorical_vectorizer.vocabulary_),\n",
+ " hid_size=64,\n",
+ " ):\n",
+ " super(ThreeInputsNet, self).__init__()\n",
+ " self.title_emb = nn.Embedding(n_tokens, embedding_dim=hid_size)\n",
+ " self.full_emb = nn.Embedding(num_embeddings=n_tokens, embedding_dim=hid_size)\n",
+ " self.category_out = None # YOUR CODE HERE\n",
+ "\n",
+ " def forward(self, whole_input):\n",
+ " input1, input2, input3 = whole_input\n",
+ " title_beg = self.title_emb(input1).permute((0, 2, 1)) # noqa: F841\n",
+ " title = None # YOUR CODE HERE\n",
+ "\n",
+ " full_beg = self.full_emb(input2).permute((0, 2, 1)) # noqa: F841\n",
+ " full = None # YOUR CODE HERE\n",
+ "\n",
+ " category = None # YOUR CODE HERE\n",
+ "\n",
+ " concatenated = torch.cat( # noqa: F841\n",
+ " [\n",
+ " title.view(title.size(0), -1),\n",
+ " full.view(full.size(0), -1),\n",
+ " category.view(category.size(0), -1),\n",
+ " ],\n",
+ " dim=1,\n",
+ " )\n",
+ "\n",
+ " out = None # YOUR CODE HERE\n",
+ "\n",
+ " return out"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Bonus area 2: comparing RNN to CNN\n",
+ "Try implementing simple RNN (or LSTM) and applying it to this task. Compare the quality/performance of these networks. \n",
+ "*Hint: try to build networks with ~same number of paremeters.*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# YOUR CODE HERE"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Bonus area 3: fixing the data leaks\n",
+ "Fix the data leak we ignored in the beginning of the __Deep Learning part__. Compare results with and without data leaks using same architectures and training time.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# YOUR CODE HERE"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "__Terrible start-up idea #1962:__ make a tool that automaticaly rephrases your job description (or CV) to meet salary expectations :)"
+ ]
+ }
+ ],
+ "metadata": {
+ "accelerator": "GPU",
+ "colab": {
+ "name": "week04_practice_CNN_for_texts.ipynb",
+ "provenance": [],
+ "version": "0.3.2"
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.11"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}