diff --git a/README.md b/README.md index 5dfaee8..a2c61b8 100644 --- a/README.md +++ b/README.md @@ -18,18 +18,25 @@ More information on how to install and run the program can be found in the [Docu ![Screenshot](docs/gallery/MINT-interface-1.png) +## News +Starting with version 1.0.0, we have updated the installation setup to use pyproject.toml. Additionally, the main script to start Mint has been changed from Mint.py to Mint. Furthermore, each release of the repository will now be assigned a DOI to facilitate citation of the software. + +## Publications that used Mint +1. Brown K, Thomson CA, Wacker S, Drikic M, Groves R, Fan V, et al. [Microbiota alters the metabolome in an age- and sex- dependent manner in mice.](https://pubmed.ncbi.nlm.nih.gov/36906623/) Nat Commun. 2023;14: 1348. + +2. Ponce LF, Bishop SL, Wacker S, Groves RA, Lewis IA. [SCALiR: A Web Application for Automating Absolute Quantification of Mass Spectrometry-Based Metabolomics Data. Anal Chem.](https://pubs.acs.org/doi/10.1021/acs.analchem.3c04988) 2024;96: 6566–6574. + ## Installation You can find installation instructions [here](https://lewisresearchgroup.github.io/ms-mint-app/install/) ## Contributions -All contributions, bug reports, code reviews, bug fixes, documentation improvements, enhancements, and ideas are welcome. -Before you modify the code please reach out to us using the [issues](https://github.com/LewisResearchGroup/ms-mint/issues) page. +All contributions, bug reports, code reviews, bug fixes, documentation improvements, enhancements, and ideas are welcome. This includes recommendations for software architecture, code design, and efficiency improvements. ## Code standards The project follows PEP8 standard and uses Black and Flake8 to ensure a consistent code format throughout the project. ## Get in touch -Open an [issue](https://github.com/LewisResearchGroup/ms-mint-app/issues) or join the [slack](https://ms-mint.slack.com/) channel. +To get in touch, please open a GitHub [issue](https://github.com/LewisResearchGroup/ms-mint-app/issues). ## Acknowledgements This project would not be possible without the help of the open-source community. diff --git a/docs/gallery/analysis-pca.png b/docs/gallery/analysis-pca.png new file mode 100644 index 0000000..41aa37a Binary files /dev/null and b/docs/gallery/analysis-pca.png differ diff --git a/docs/gui.md b/docs/gui.md index 26ce304..79103c7 100644 --- a/docs/gui.md +++ b/docs/gui.md @@ -63,6 +63,8 @@ with the optimization tools. ## Add Metabolites +> Since version 1.0.0 this functionality has been removed and will be provided as an optional plugin. + - Search for metabolites from ChEBI three stars database - Add selected metabolites to peaklist (without RT estimation) @@ -94,7 +96,6 @@ identifes the closest peak with respect to the expected RT which is displayed as - Remove peaks from peaklist - Set expected retention time - ![Manual peak optimization](image/manual-peak-optimization.png "Manual peak optimization") When a peak is selected in the drop down box the chromatograms for the particular mass windows @@ -110,23 +111,35 @@ to the current view and updated the peaklist accordingly. ![Processing](image/processing.png "Processing") - When all peaks look good the data can be processed using `RUN MINT`. This will apply the current peaklist to the MS-files in the workspace and extract additional properties. When the results tables are present the results can be explored with the following tabs. The generated results can be downloaded with the `DOWNLOAD` button. +- `RUN MINT`: Will process all files in the workspace using the current target list. The progress is displayed in the progress bar on the top. +- `DOWNLOAD ALL RESULTS`: The generated results can be downloaded in tidy format. +- `DOWNLOAD DENSE MATRIX`: This will download a dense data table with targets as rows and files as columns. The observable used for the cells can be selected in the drop down menu. Optionllay, you can transpose the table, by checking the `Transposed` checkbox. +- `DELETE RESULTS`: Delete results file if present, and start from scratch. -## Analysis +## Quality Control +Analytical visualizations to display a few quality metrics and comparisons. The `m/z drift` compares the observed m/z values with the ones set in the target list. This value will always be lower than the `mz_width` set in the target list for each target. It is one way of evaluating how well the machine is calibrated. Generally speaking, values between [-5, 5] are acceptible, but it depends on the specific assay and experiment. + +The graphs are categorized by `sample_type` set in the Metadata tab. You should have some quality control, or calibration samples with known metabolite composition, to be able to make judgements about the quality. + +The second plot breaks down the `m/z drift` by target, to see how the calibration varies between targets. + +The PCA (Principal Components Analysis) plot shows a PCA using `peak_area_top3`. You can compare different groups of samples as set in the `sample_types` column in the Metadata tab. + +The final plot displays peak shapes of a random sample of files for all targets. To change the sample, you can refresh this page. +## Analysis After running MINT the results can be downloaed or analysed using the provided tools. For quality control purposes histograms and boxplots can be generated in the quality control tab. The interactive heatmap tool can be used to explore the results data after `RUN MINT` has been exectuted. The tool allows to explore the generated data in from of heatmaps. -## General selection elements - +## Selections and transformations - Include/exclude file types (based on `Type` column in metadata) - Include/exclude peak labels for analysis - Set file sorting (e.g. by name, by batch etc.) @@ -134,22 +147,71 @@ has been exectuted. The tool allows to explore the generated data in from of hea ![Selections](image/general-selection-elements.png "Selections") +- `Types of files to include`: Uses the `sample_types` column in the Metadata tab to select files. If nothing is selected, all files are included. +- `Include peak_labels`: Targets to include. If nothing is selected all targets are included. +- `Exclude peak_labels`: Targets to exclude. If nothing is selected no target is excluded. +- `Variable to plot`: This determines which column to run the analysis on. For example, you can set this to `peak_mass_diff_50pc` to analyse the instrument calibration. The default is `peak_area_top3`. +- `MS-file sorting`: Before plotting sets the order of the MS-files in the underlying dataframe. This will change the order of files in some plots. +- `Color by`: PCA and `Plotting` tool can use a categoric or numeric column to color code samples. Some plots (e.g. Hierarchical clustering tool are unaffected). +- `Transformation`: The values can be log transformed before subjected to normalization. If nothing is selected, the raw values are used. +- `Scaling group(s)`: Column or selection of columns to group the data and apply the normalization function in the dropdown menu for each group. If you want to z-scores for each target, you need to select `peak_label` here, and in the dropdown menu 'Standard scaling`. +- `Scaling technique`: You can choose between standard scaling, min-max scaling, or robust scaling, or no scaling (if nothing is selected). -## Heatmap +### Scaling Techniques + +#### 1. Standard Scaling + +**Standard scaling** (also known as z-score normalization) transforms the data such that the mean of each feature becomes 0 and the standard deviation becomes 1. This is useful when the features have different units or magnitudes, as it ensures they are on the same scale. + +The formula for standard scaling is: + + z = (x - mean) / standard_deviation + +Where: +- `x` is the original value. +- `mean` is the mean of the feature. +- `standard_deviation` is the standard deviation of the feature. +#### 2. Robust Scaling + +**Robust scaling** is used to scale features using statistics that are robust to outliers. This scaling technique uses the median and the interquartile range (IQR) instead of the mean and standard deviation, making it more suitable for datasets with outliers. + +The formula for robust scaling is: + + x_scaled = (x - median) / IQR + +Where: +- `x` is the original value. +- `median` is the median of the feature. +- `IQR` is the interquartile range of the feature (IQR = Q3 - Q1). + +#### 3. Min-Max Scaling + +**Min-max scaling** (also known as normalization) transforms the data to fit within a the range [0, 1]. This scaling techique is useful when you want to preserve the relationships within the data, but want to adjust the scale. + +The formula for min-max scaling is: + + x_scaled = (x - x_min) / (x_max - x_min) + +Where: +- `x` is the original value. +- `x_min` is the minimum value of the feature. +- `x_max` is the maximum value of the feature. + + +## Heatmap ![Heatmap](image/heatmap.png "Heatmap") The first dropdown menu allows to include certain file types e.g. biological samples rather than quality control samples. The second dropdown menu distinguishes the how the heatmap is generated. -- Normalized by biomarer: devide values by column maxium. -- Cluster: Cluster rows with hierachical clustering. -- Dendrogram: Plots a dendrogram instead of row labels. -- Transpose: Switch columns and rows. -- Correlation: Calculate pearson correlation between columns. -- Show in new tab: The figure will be generated in a new independent tab. That way multiple heatmaps can be generated at the same time. +- `Cluster`: Cluster rows with hierachical clustering. +- `Dendrogram`: Plots a dendrogram instead of row labels (only in combination with `Cluster`). +- `Transpose`: Switch columns and rows. +- `Correlation`: Calculate pearson correlation between columns. +- `Show in new tab`: The figure will be generated in a new independent tab. That way multiple heatmaps can be generated at the same time. This may only work when you serve MINT locally, since the plot is served on a different port. If the app becomes unresponsive to changes, reload the tab. -### Correlation of (scaled) peak_max +### Example: Plot correlation between metabolites using scaled peak_area_top3 values ![Heatmap](image/heatmap-correlation.png "Correlation") @@ -160,6 +222,8 @@ The second dropdown menu distinguishes the how the heatmap is generated. - Density distributions - Boxplots +### Example: Box-plot of scaled peak_area_top3 values by metabolite + ![Quality Control](image/distributions.png "Quality Control") The MS-files can be grouped based on the values in the metadata table. If nothing @@ -169,9 +233,9 @@ to generate. The third dropdown menu allows to include certain file types. For example, the analysis can be limited to only the biological samples if such a type has been defined in the type column of the metadata table. -The checkbox can be used to create a dense view. If the box is unchecked the output will be -visually grouped into an individual section for each metabolite. +The checkbox can be used to create a dense view. If the box is unchecked the output will be visually grouped into an individual section for each metabolite. +The plots are interactive. You can switch off labels, zoom in on particular areas of interest, or hover the mouse cursor over a datapoint to get more information about underlying sample and/or target. ## PCA @@ -183,20 +247,30 @@ visually grouped into an individual section for each metabolite. ## Hierarchical clustering +Hierarchical clustering is a technique for cluster analysis that seeks to build a hierarchy of clusters. It can be divided into two main types: **agglomerative** and **divisive**. MINT uses agglomerative hierarchical clustering, also known as bottom-up clustering, starts with each data point as a separate cluster and iteratively merges the closest clusters until all points are in a single cluster or a stopping criterion is met. -![Hierarchical clustering](image/hierarchical_clustering.png "Hierarchical clustering") +### Steps for Agglomerative Clustering +1. **Initialization**: Start with each data point as its own cluster. +2. **Distance Calculation**: Compute the pairwise distance between all clusters. +3. **Merge Closest Clusters**: Find the two closest clusters and merge them into a single cluster. +4. **Update Distances**: Recalculate the distances between the new cluster and all other clusters. +5. **Repeat**: Repeat steps 3 and 4 until all data points are in a single cluster or the desired number of clusters is achieved. +### Dendrogram -## Plotting +The output of hierarchical clustering is often visualized using a dendrogram, which is a tree-like diagram that shows the arrangement of clusters and their hierarchical relationships. Each branch of the dendrogram represents a merge or split, and the height of the branches indicates the distance or dissimilarity between clusters. -MINT comes with a flexible and powerful plotting interface that is based on the powerful [Seaborn](http://seaborn.pydata.org/) library. +### Example: Hirarchical clustering with different metrics using z-scores (for each metabolite) +![Hierarchical clustering](image/hierarchical-clustering.png "Hierarchical clustering") - - Bar plots - - Violin plots - - Boxen plot - - Scatter plots - - and more... +## Plotting +With great power comes great responsibility. The plotting tool can generate impressive, and very complex plots, but it can be a bit overwhelming in the beginning. It uses the [Seaborn](http://seaborn.pydata.org/) library under the hood. Familiarity, with this library can help understanding what the different settings are doing. We recommend starting with a basic plot and then increase its complexity stepwisely. +- Bar plots +- Violin plots +- Boxen plot +- Scatter plots +- and more... ![Plot setting](image/plotting_settings.png "Plot settings") diff --git a/docs/image/distributions.png b/docs/image/distributions.png index aeae40d..68a58d7 100644 Binary files a/docs/image/distributions.png and b/docs/image/distributions.png differ diff --git a/docs/image/general-selection-elements.png b/docs/image/general-selection-elements.png index 5d6c405..59fcc33 100644 Binary files a/docs/image/general-selection-elements.png and b/docs/image/general-selection-elements.png differ diff --git a/docs/image/heatmap-correlation.png b/docs/image/heatmap-correlation.png index 251efde..8177c07 100644 Binary files a/docs/image/heatmap-correlation.png and b/docs/image/heatmap-correlation.png differ diff --git a/docs/image/heatmap.png b/docs/image/heatmap.png index 083d8d0..db66490 100644 Binary files a/docs/image/heatmap.png and b/docs/image/heatmap.png differ diff --git a/docs/image/hierarchical-clustering.png b/docs/image/hierarchical-clustering.png new file mode 100644 index 0000000..2b51452 Binary files /dev/null and b/docs/image/hierarchical-clustering.png differ diff --git a/docs/image/hierarchical_clustering.png b/docs/image/hierarchical_clustering.png deleted file mode 100644 index 887ee7b..0000000 Binary files a/docs/image/hierarchical_clustering.png and /dev/null differ diff --git a/docs/image/pca.png b/docs/image/pca.png index c1e5cae..5730b8a 100644 Binary files a/docs/image/pca.png and b/docs/image/pca.png differ diff --git a/docs/image/processing.png b/docs/image/processing.png index 353e452..63db768 100644 Binary files a/docs/image/processing.png and b/docs/image/processing.png differ diff --git a/docs/targets.md b/docs/targets.md index 7f46a49..b3cb637 100644 --- a/docs/targets.md +++ b/docs/targets.md @@ -10,12 +10,13 @@ The target list is the determining protocol for the data processing step. You ca The input files contains a number of columns headers in the target list should contain: -- **peak_label** : A __unique__ identifier such as the biomarker name or ID. Even if multiple peaklist files are used, the label have to be unique across all the files. +- **peak_label** : A __unique__ identifier such as the biomarker name or ID. - **mz_mean** : The target mass (m/z-value) in [Da]. - **mz_width** : The width of the peak in the m/z-dimension in units of ppm. The window will be *mz_mean* +/- (mz_width * mz_mean * 1e-6). Usually, a values between 5 and 10 are used. -- **rt** : Estimated retention time in [min] (optional, see above). -- **rt_min** : The start of the retention time for each peak in [min]. -- **rt_max** : The end of the retention time for each peak in [min]. +- **rt** : Estimated retention time (optional, see above), for reference and used in automated peak optimization. +- **rt_min** : The start of the retention time for each peak. +- **rt_max** : The end of the retention time for each peak. +- **rt_unit** : Time unit can be `min` (minutes) or `s` (seconds), Mint will always convert the values to seconds. - **intensity_threshold** : A threshold that is applied to filter noise for each window individually. Can be set to 0 or any positive value. #### Example file @@ -27,4 +28,7 @@ Biomarker-B,151.02585,10,4.18,4.53,0 ``` -A template can be created using the [GUI](gui.md). +A template can be created using the [GUI](gui.md): + +1. Go to the targets tab. +2. Click on `EXPORT` to download a `target.csv` file with all necessary columns. diff --git a/src/ms_mint_app/plugins/analysis.py b/src/ms_mint_app/plugins/analysis.py index 24244e5..db763a8 100644 --- a/src/ms_mint_app/plugins/analysis.py +++ b/src/ms_mint_app/plugins/analysis.py @@ -49,9 +49,9 @@ def outputs(self): var_name_options = T.list_to_options(RESULTS_COLUMNS) scaler_options = [ - {"value": "standard", "label": "Standard Scaling (z-scores)"}, - #{"value": "minmax", "label": "MinMax Scaling"}, - {"value": "robust", "label": "Robust Scaling"} + {"value": "standard", "label": "Standard scaling (z-scores)"}, + {"value": "minmax", "label": "Min-Max scaling"}, + {"value": "robust", "label": "Robust scaling"} ] @@ -92,7 +92,7 @@ def outputs(self): dcc.Dropdown( id="ana-var-name", options=var_name_options, - value='peak_max', + value='peak_area_top3', placeholder="Variable to plot", ) ]), @@ -117,14 +117,14 @@ def outputs(self): id="ana-groupby", options=[], value=None, - placeholder="Normalize by", + placeholder="Scaling group(s)", multi=True, ), dcc.Dropdown( id="ana-scaler", options=scaler_options, - value=[], - placeholder="Scaler", + value=None, + placeholder="Scaling method", multi=False, ), ]), diff --git a/src/ms_mint_app/plugins/analysis_tools/plotting.py b/src/ms_mint_app/plugins/analysis_tools/plotting.py index 6738c9f..cebf8b8 100644 --- a/src/ms_mint_app/plugins/analysis_tools/plotting.py +++ b/src/ms_mint_app/plugins/analysis_tools/plotting.py @@ -469,7 +469,7 @@ def create_figure( sharex="share-x" in options, sharey="share-y" in options, dodge="no-dodge" not in options, - facet_kws=dict(legend_out=True), + #facet_kws=dict(legend_out=True), ) try: @@ -488,8 +488,8 @@ def create_figure( **kwargs ) except Exception as e: - logging.error(e) - return dbc.Alert(str(e), color="danger") + logging.error(f"Failed to generate plot: {e}\nwith arguments:\n{kwargs}") + return "" #dbc.Alert(str(e), color="danger") g.fig.subplots_adjust(top=0.9) g.set_titles(col_template="{col_name}", row_template="{row_name}", y=1.05)