Skip to content

Commit

Permalink
Merge pull request #19 from BU-Spark/labeling-data-cleaning
Browse files Browse the repository at this point in the history
[UPDATED] Changing Data Cleaning & Labeling Dropped Observations
  • Loading branch information
zacharymeurer authored Jun 26, 2024
2 parents 2961cc4 + 0986075 commit e5ae59d
Show file tree
Hide file tree
Showing 141 changed files with 1,873,924 additions and 1,260,682 deletions.
157 changes: 30 additions & 127 deletions -2_values.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -40,18 +40,27 @@
"source": [
"df = pd.read_csv(\"alldata.csv\")\n",
"\n",
"all_species = list(df['Species_name'].value_counts().index) # Top 50 most prevalent species\n",
"phenophases = list(df.columns[9:])\n",
"# Replacing incorrect -2 values with either NA or -2\n",
"all_species = list(df['Species_name'].value_counts().index) # All species names in order of frequency\n",
"phenophases = list(df.columns[9:]) # Phenophase column names\n",
"\n",
"def create_species_dict(*absent_phenophases):\n",
" \"\"\"\n",
" Creates a dictionary informing if each phenophase is seen in a species or not (0 for seen, 1 for not seen)\n",
"\n",
" Args:\n",
" absent_phenophase (List(string)): List of phenophases not seen in a species\n",
" use_q (bool): False means use MDP value iteration, true means use Q-learning\n",
" Returns:\n",
" species_dict (Dict(string,int)): A Dictionary mapping phenophases to binaries indicating presence in a species\n",
" \"\"\"\n",
" species_dict = dict(zip(phenophases, np.zeros(len(phenophases), int)))\n",
" for phenophase in absent_phenophases:\n",
" species_dict[phenophase] = 1\n",
" return species_dict\n",
"\n",
"handbook_dicts = {} # Dict mapping species to phenophase dicts. \n",
"# phenophase dicts give absent phenophases in the associated species.\n",
"# Manually input absent phenophases from SeasonWatch handbook\n",
"handbook_dicts = {} # Dict mapping species to species dicts. \n",
"# Below are manually input absent phenophases for each species in the citizen dataset. Labels derived from SeasonWatch Tree Phenology Guide\n",
"handbook_dicts[all_species[0]] = create_species_dict('Flowers_open','Fruits_open')\n",
"handbook_dicts[all_species[1]] = create_species_dict('Flowers_male', 'Flowers_Female', 'Fruits_open')\n",
"handbook_dicts[all_species[2]] = create_species_dict('Flowers_open', 'Fruits_open')\n",
Expand Down Expand Up @@ -436,7 +445,7 @@
"id": "38b6ce61",
"metadata": {},
"source": [
"# Replace False Positives With NaN Values & False Negative With -2 Values"
"# [Option 1] Replace False Positives With NaN & False Negatives With -2 Values"
]
},
{
Expand All @@ -448,26 +457,25 @@
},
"outputs": [],
"source": [
"# Replace Incorrect -2 Values\n",
"for species in all_species:\n",
" species_df = df[df['Species_name'] == species]\n",
" species_dict = handbook_dicts[species]\n",
" for phenophase in phenophases:\n",
" if species_dict[phenophase] == 0:\n",
" false_positive_idx = species_df.index[species_df[phenophase] == -2] # Indices of reports that incorrectly assign -2 values (false positive) to phenophases that DO appear in the species\n",
" df.loc[false_positive_idx, phenophase] = np.full(len(false_positive_idx),np.nan) # turn into NaN so they will be dropped\n",
" #incorrect_negative_2_df.loc[false_positive_idx, phenophase] = np.ones(len(false_positive_idx)) # Label false positive reports as 1 for the phenophase they incorrectly report.\n",
" if species_dict[phenophase] == 1:\n",
" false_negative_idx = species_df.index[species_df[phenophase] != -2] # Indices of reports that incorrectly assign values other than -2 (false positive) to phenophases that DO NOT appear in the species\n",
" df.loc[false_negative_idx, phenophase] = np.full(len(false_negative_idx),-2.0)\n",
" #incorrect_negative_2_df.loc[false_negative_idx, f\"{phenophase}_incorrect_-2\"] = 2*np.ones(len(false_negative_idx)) # Label false negative reports as 2 for the phenophase they incorrectly report."
" if species_dict[phenophase] == 0: # Phenophase seen in species\n",
" false_positive_idx = species_df.index[species_df[phenophase] == -2] # Indices of reports that incorrectly assign -2 values (false positive) to phenophases SEEN in the species\n",
" df.loc[false_positive_idx, phenophase] = np.full(len(false_positive_idx),np.nan) # Turn all false positives into NaN values (these observations will later be dropped)\n",
" if species_dict[phenophase] == 1: # Phenophase NOT seen in species\n",
" false_negative_idx = species_df.index[species_df[phenophase] != -2] # Indices of reports that incorrectly assign values other than -2 (false negative) to phenophases NOT SEEN in the species\n",
" df.loc[false_negative_idx, phenophase] = np.full(len(false_negative_idx),-2.0) # Turn all false negatives into -2 values"
]
},
{
"cell_type": "markdown",
"id": "0e19e5f9",
"metadata": {},
"source": [
"# Reconstruct alldata.csv With New Columns Indicating Incorrect -2 Values"
"# [Option 2] Reconstruct alldata.csv With New Column Labeling Incorrect -2 Values"
]
},
{
Expand All @@ -487,11 +495,11 @@
" species_df = df[df['Species_name'] == species]\n",
" species_dict = handbook_dicts[species]\n",
" for phenophase in phenophases:\n",
" if species_dict[phenophase] == 0:\n",
" false_positive_idx = species_df.index[species_df[phenophase] == -2] # Indices of reports that incorrectly assign -2 values (false positive) to phenophases that DO appear in the species\n",
" if species_dict[phenophase] == 0: # Phenophase seen in species\n",
" false_positive_idx = species_df.index[species_df[phenophase] == -2] # Indices of reports that incorrectly assign -2 values (false positive) to phenophases SEEN in the species\n",
" incorrect_negative_2_df.loc[false_positive_idx, f\"{phenophase}_incorrect_-2\"] = np.ones(len(false_positive_idx)) # Label false positive reports as 1 for the phenophase they incorrectly report.\n",
" if species_dict[phenophase] == 1:\n",
" false_negative_idx = species_df.index[species_df[phenophase] != -2] # Indices of reports that incorrectly assign values other than -2 (false positive) to phenophases that DO NOT appear in the species\n",
" if species_dict[phenophase] == 1: # Phenophase NOT seen in species\n",
" false_negative_idx = species_df.index[species_df[phenophase] != -2] # Indices of reports that incorrectly assign values other than -2 (false positive) to phenophases NOT SEEN in the species\n",
" incorrect_negative_2_df.loc[false_negative_idx, f\"{phenophase}_incorrect_-2\"] = 2*np.ones(len(false_negative_idx)) # Label false negative reports as 2 for the phenophase they incorrectly report."
]
},
Expand Down Expand Up @@ -109439,9 +109447,7 @@
"cell_type": "code",
"execution_count": 8,
"id": "bb80a11a",
"metadata": {
"scrolled": false
},
"metadata": {},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -109506,7 +109512,7 @@
"id": "9c6931f2",
"metadata": {},
"source": [
"# Download CSV File"
"# Download Labeled Incorrect -2 Values CSV File"
]
},
{
Expand All @@ -109518,109 +109524,6 @@
"source": [
"incorrect_negative_2_df.to_csv('alldata_labeling_-2_all_species.csv', index=False)"
]
},
{
"cell_type": "markdown",
"id": "2c599084",
"metadata": {},
"source": [
"# Mango Variety Inspection"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "7370f32e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mango (all varieties)- Mangifera indica 81855\n",
"Airi Mango- Mangifera indica 459\n",
"Alphonso Mango- Mangifera indica 181\n",
"Aabehayat Mango- Mangifera indica 127\n",
"Manjeera Mango- Mangifera indica 82\n",
"Mallika Mango- Mangifera indica 9\n",
"Chosa Mango- Mangifera indica 1\n",
"Olour Mango- Mangifera indica 1\n"
]
}
],
"source": [
"import regex as re\n",
"mango_df = pd.DataFrame()\n",
"for species in all_species:\n",
" if re.search('mangifera indica', species.lower()):\n",
" if species != \"Mango (all varieties)- Mangifera indica\":\n",
" mango_df = pd.concat([mango_df,df[df['Species_name'] == species]])\n",
" print(species, sum(df['Species_name'] == species))"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "cf95e4c3",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"(array([2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]),\n",
" array([ 15, 60, 67, 118, 124, 154, 93, 112, 96, 21]))"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Number of observations of 7 mango varieties (excluding \"Mango (all varieties)- Mangifera indica\") each year\n",
"np.unique(np.array([int(re.findall('\\d\\d\\d\\d', i)[0]) for i in mango_df['Date_of_observation']]), return_counts=True)"
]
},
{
"cell_type": "markdown",
"id": "6550db43",
"metadata": {},
"source": [
"# [Updated] Questions & Comments About SW Handbook for Geetha\n",
"\n",
"*!!! Note: Species names are written in the format Common Name (Scientific Name) !!!*\n",
"\n",
"### Q1: Are the open flowers phenophase and the gendered flowers phenophases mutually exclusive (i.e. If open flowers is a present phenophase, there cannot be male or female flowers and vice versa)?\n",
"\n",
"Silkworm Mulberry (Morus Alba), Box-myrtle (Myrica Esculenta), and Wild Almond (Sterculia Foetida) are listed in the handbook as having open flowers **AND** female flowers, but no male flowers. Is this a mistake or is the open flowers section for these species supposed to represent male flowers? The respective SW handbook pages for these species are pp. 109, 112, & 155.\n",
"\n",
"According to the Wikipedia pages for [Morus Alba](https://en.wikipedia.org/wiki/Morus_alba) and [Sterculia Foetida](https://en.wikipedia.org/wiki/Sterculia_foetida), there are male and female flowers for these species. This points to more evidence for an error in the handbook. Please let us know if this is an error in the handbook or if it is purposeful.\n",
"\n",
"### Q2: Is there a difference between Indian Charcoal Trees and Indian Mahagony?\n",
"\n",
"In the [SeasonWatch Handbook](https://www.seasonwatch.in/wp-content/uploads/2023/11/SW-phenophases-guide-compressed.pdf), there are pages for Indian Mahagony (Trema Orientale) and Indian Mahagony (Toona Ciliata). Contrarily, in the citizen database, the species names for the associated scientific names are different: Indian Charcoal Tree (Trema Orientale) and Indian Mahagony (Toona Ciliata). There is an inconsistency between the common name for Trema Orientale. Thus, I was wondering whether to rely on the common name or scientific name when associating the citizen data with the handbook.\n",
"\n",
"### Q3: What is the difference between the variety of mangos? Are the phenophases that appear in each mango variety consistent?\n",
"\n",
"Each mango variety in the species name phenophase and the number of recorded observations:\n",
"\n",
"```\n",
"Mango (all varieties)- Mangifera indica : 81855\n",
"Airi Mango- Mangifera indica : 459\n",
"Alphonso Mango- Mangifera indica : 181\n",
"Aabehayat Mango- Mangifera indica : 127\n",
"Manjeera Mango- Mangifera indica : 82\n",
"Mallika Mango- Mangifera indica : 9\n",
"Chosa Mango- Mangifera indica : 1\n",
"Olour Mango- Mangifera indica : 1\n",
"```\n",
"\n",
"### Comment: I believe there is a mistake in the SW Handbook about Mohru Oak.\n",
"\n",
"There is an error in the [SeasonWatch Handbook](https://www.seasonwatch.in/wp-content/uploads/2023/11/SW-phenophases-guide-compressed.pdf) (pp. 142) for the entry of Mohru Oak (Quercus Floribunda). It lists a box for male flowers and male/female flowers. I expect male/female flowers is meant to just be female flowers."
]
}
],
"metadata": {
Expand All @@ -109639,7 +109542,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
"version": "3.11.6"
}
},
"nbformat": 4,
Expand Down
10 changes: 10 additions & 0 deletions Extracting reference data (Isolation forest).ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,16 @@
"# Selecting New Reference Data With Isolation Forests"
]
},
{
"cell_type": "code",
"execution_count": 133,
"id": "1b733f40",
"metadata": {},
"outputs": [],
"source": [
"a = pd.read_csv(\"all data/citizen/kerala.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
Expand Down
Loading

0 comments on commit e5ae59d

Please sign in to comment.