.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/plot_hbn_site_profiles.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_plot_hbn_site_profiles.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_plot_hbn_site_profiles.py:


===============================
Harmonize HBN data using ComBat
===============================

This example loads AFQ data from the Healthy Brain Network (HBN) preprocessed
diffusion derivatives [1]_. The HBN is a landmark pediatric mental health study.
Over the course of the study, it will collect diffusion MRI data from
approximately 5,000 children and adolescents. We recently processed the
available data from over 2,000 of these subjects, and provide the tract profiles
from this dataset, which can be downloaded from AWS thanks to
[INDI](http://fcon_1000.projects.nitrc.org/).

We first load the data by using the :func:`AFQDataset.from_files` static method
and supplying AWS S3 URIs instead of local file names. We then impute missing
values and plot the mean bundle profiles by scanning site, noting that there are
substantial site differences. Lastly, we harmonize the site differences using
NeuroComBat [2]_ and plot the harmonized bundle profiles to verify that the site
differences have been removed.

.. [1]  Adam Richie-Halford, Matthew Cieslak, Lei Ai, Sendy Caffarra, Sydney
   Covitz, Alexandre R. Franco, Iliana I. Karipidis, John Kruper, Michael
   Milham, Bárbara Avelar-Pereira, Ethan Roy, Valerie J. Sydnor, Jason Yeatman,
   The Fibr Community Science Consortium, Theodore D. Satterthwaite, and Ariel
   Rokem,
   "An open, analysis-ready, and quality controlled resource for pediatric brain
   white-matter research"
   bioRxiv 2022.02.24.481303;
   doi: https://doi.org/10.1101/2022.02.24.481303

.. [2] Jean-Philippe Fortin, Drew Parker, Birkan Tunc, Takanori Watanabe, Mark A
   Elliott, Kosha Ruparel, David R Roalf, Theodore D Satterthwaite, Ruben C Gur,
   Raquel E Gur, Robert T Schultz, Ragini Verma, Russell T Shinohara.
   "Harmonization Of Multi-Site Diffusion Tensor Imaging Data"
   NeuroImage, 161, 149-170, 2017;
   doi: https://doi.org/10.1016/j.neuroimage.2017.08.047

.. GENERATED FROM PYTHON SOURCE LINES 39-47

.. code-block:: default

    import numpy as np

    from afqinsight import AFQDataset
    from afqinsight.plot import plot_tract_profiles
    from neurocombat_sklearn import CombatModel
    from sklearn.impute import SimpleImputer
    from sklearn.model_selection import train_test_split


.. GENERATED FROM PYTHON SOURCE LINES 48-55

Fetch the HBN data
------------------

The :func:`AFQDataset.from_files` static method expects a path to
nodes.csv and subjects.csv files, but these file paths can be remote
URLs or AWS S3 URIs. We'll use S3 URIs to grab the HBN data. After dropping
participants with null phenotypic values, it has 1,867 participants.

.. GENERATED FROM PYTHON SOURCE LINES 55-67

.. code-block:: default


    dataset = AFQDataset.from_files(
        fn_nodes="s3://fcp-indi/data/Projects/HBN/BIDS_curated/derivatives/afq/combined_tract_profiles.csv",
        fn_subjects="s3://fcp-indi/data/Projects/HBN/BIDS_curated/derivatives/qsiprep/participants.tsv",
        dwi_metrics=["dki_fa", "dki_md"],
        target_cols=["age", "sex", "scan_site_id"],
        label_encode_cols=["sex", "scan_site_id"],
        index_col="subject_id",
    )
    dataset.drop_target_na()
    print(dataset)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    AFQDataset(n_samples=1867, n_features=4800, n_targets=3, targets=['age', 'sex', 'scan_site_id'])


.. GENERATED FROM PYTHON SOURCE LINES 68-73

Train / test split
------------------

We can use the dataset in the :func:`train_test_split` function just as we
would with an array.

.. GENERATED FROM PYTHON SOURCE LINES 73-76

.. code-block:: default


    dataset_train, dataset_test = train_test_split(dataset, test_size=0.5)


.. GENERATED FROM PYTHON SOURCE LINES 77-83

Impute missing values
---------------------

Next we impute missing values using median imputation. We fit the imputer
using the training set and then use it to transform both the training and test
sets.

.. GENERATED FROM PYTHON SOURCE LINES 83-88

.. code-block:: default


    imputer = dataset_train.model_fit(SimpleImputer(strategy="median"))
    dataset_train = dataset_train.model_transform(imputer)
    dataset_test = dataset_test.model_transform(imputer)


.. GENERATED FROM PYTHON SOURCE LINES 89-96

Plot average bundle profiles by scan site
-----------------------------------------

Next we plot the mean bundle profiles in the test set by scanning site. The
:func:`plot_tract_profiles` function takes as input an :class:`AFQDataset` and
returns matplotlib figures displaying the mean bundle profile for each bundle
and metric, optionally grouped by a categorical or continuous variable.

.. GENERATED FROM PYTHON SOURCE LINES 96-104

.. code-block:: default


    site_figs = plot_tract_profiles(
        X=dataset_test,
        group_by=dataset_test.classes["scan_site_id"][dataset_test.y[:, 2].astype(int)],
        group_by_name="Site",
        figsize=(14, 14),
    )


.. rst-class:: sphx-glr-horizontal


    *

      .. image-sg:: /auto_examples/images/sphx_glr_plot_hbn_site_profiles_001.png
         :alt: IFOL, UNCL, UNCR, IFOR, ATRL, CSTL, CSTR, ATRR, ARCL, SLFL, SLFR, ARCR, ILFL, CGCL, CGCR, ILFR, Orbital, AntFrontal, SupFrontal, Motor, SupParietal, Temporal, PostParietal, Occipital
         :srcset: /auto_examples/images/sphx_glr_plot_hbn_site_profiles_001.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_plot_hbn_site_profiles_002.png
         :alt: IFOL, UNCL, UNCR, IFOR, ATRL, CSTL, CSTR, ATRR, ARCL, SLFL, SLFR, ARCR, ILFL, CGCL, CGCR, ILFR, Orbital, AntFrontal, SupFrontal, Motor, SupParietal, Temporal, PostParietal, Occipital
         :srcset: /auto_examples/images/sphx_glr_plot_hbn_site_profiles_002.png
         :class: sphx-glr-multi-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

      0%|          | 0/2 [00:00<?, ?it/s]
      0%|          | 0/24 [00:00<?, ?it/s]
      4%|4         | 1/24 [00:05<01:56,  5.07s/it]
      8%|8         | 2/24 [00:10<01:52,  5.13s/it]
     12%|#2        | 3/24 [00:15<01:48,  5.17s/it]
     17%|#6        | 4/24 [00:20<01:41,  5.10s/it]
     21%|##        | 5/24 [00:25<01:37,  5.11s/it]
     25%|##5       | 6/24 [00:30<01:32,  5.14s/it]
     29%|##9       | 7/24 [00:35<01:26,  5.11s/it]
     33%|###3      | 8/24 [00:40<01:20,  5.01s/it]
     38%|###7      | 9/24 [00:45<01:14,  4.96s/it]
     42%|####1     | 10/24 [00:50<01:08,  4.91s/it]
     46%|####5     | 11/24 [00:55<01:03,  4.89s/it]
     50%|#####     | 12/24 [00:59<00:58,  4.87s/it]
     54%|#####4    | 13/24 [01:04<00:53,  4.85s/it]
     58%|#####8    | 14/24 [01:09<00:48,  4.85s/it]
     62%|######2   | 15/24 [01:14<00:43,  4.85s/it]
     67%|######6   | 16/24 [01:19<00:38,  4.85s/it]
     71%|#######   | 17/24 [01:24<00:33,  4.84s/it]
     75%|#######5  | 18/24 [01:28<00:29,  4.84s/it]
     79%|#######9  | 19/24 [01:33<00:24,  4.84s/it]
     83%|########3 | 20/24 [01:38<00:19,  4.82s/it]
     88%|########7 | 21/24 [01:43<00:14,  4.82s/it]
     92%|#########1| 22/24 [01:48<00:09,  4.82s/it]
     96%|#########5| 23/24 [01:53<00:04,  4.82s/it]
    100%|##########| 24/24 [01:58<00:00,  4.87s/it]    100%|##########| 24/24 [01:58<00:00,  4.92s/it]
     50%|#####     | 1/2 [01:58<01:58, 118.89s/it]
      0%|          | 0/24 [00:00<?, ?it/s]
      4%|4         | 1/24 [00:04<01:51,  4.84s/it]
      8%|8         | 2/24 [00:09<01:47,  4.91s/it]
     12%|#2        | 3/24 [00:14<01:42,  4.90s/it]
     17%|#6        | 4/24 [00:19<01:37,  4.88s/it]
     21%|##        | 5/24 [00:24<01:32,  4.87s/it]
     25%|##5       | 6/24 [00:29<01:27,  4.84s/it]
     29%|##9       | 7/24 [00:34<01:22,  4.84s/it]
     33%|###3      | 8/24 [00:38<01:17,  4.84s/it]
     38%|###7      | 9/24 [00:43<01:12,  4.85s/it]
     42%|####1     | 10/24 [00:48<01:07,  4.85s/it]
     46%|####5     | 11/24 [00:53<01:03,  4.86s/it]
     50%|#####     | 12/24 [00:58<00:58,  4.88s/it]
     54%|#####4    | 13/24 [01:03<00:53,  4.87s/it]
     58%|#####8    | 14/24 [01:08<00:49,  4.92s/it]
     62%|######2   | 15/24 [01:13<00:44,  4.98s/it]
     67%|######6   | 16/24 [01:18<00:39,  4.95s/it]
     71%|#######   | 17/24 [01:23<00:34,  4.91s/it]
     75%|#######5  | 18/24 [01:27<00:29,  4.89s/it]
     79%|#######9  | 19/24 [01:32<00:24,  4.88s/it]
     83%|########3 | 20/24 [01:37<00:19,  4.89s/it]
     88%|########7 | 21/24 [01:42<00:14,  4.88s/it]
     92%|#########1| 22/24 [01:47<00:09,  4.87s/it]
     96%|#########5| 23/24 [01:52<00:04,  4.87s/it]
    100%|##########| 24/24 [01:57<00:00,  4.89s/it]    100%|##########| 24/24 [01:57<00:00,  4.88s/it]
    100%|##########| 2/2 [03:57<00:00, 118.49s/it]    100%|##########| 2/2 [03:57<00:00, 118.55s/it]


.. GENERATED FROM PYTHON SOURCE LINES 105-123

Harmonize the sites and replot
------------------------------

We can see that there are substantial scan site differences in both the
FA and MD profiles. Let's use neuroComBat to harmonize the site differences
and then replot the mean bundle profiles.

N.B. We use the excellent `neurocombat_sklearn
<https://github.com/Warvito/neurocombat_sklearn>`_ package to apply ComBat to
our data. We love this library, however it is not fully compliant with the
scikit-learn transformer API, so we cannot use the
:func:`AFQDataset.model_fit_transform` method to apply this transformer to our
dataset. No problem! We can simply copy the unharmonized dataset into a new
variable and then overwrite the features of the new dataset with the ComBat
output.

Lastly, we replot the mean bundle profiles and confirm that ComBat did its
job.

.. GENERATED FROM PYTHON SOURCE LINES 123-150

.. code-block:: default


    # Fit the ComBat transformer to the training set
    combat = CombatModel()
    combat.fit(
        dataset_train.X,
        dataset_train.y[:, 2][:, np.newaxis],
        dataset_train.y[:, 1][:, np.newaxis],
        dataset_train.y[:, 0][:, np.newaxis],
    )

    # And then transform a copy of the test set
    harmonized_test = dataset_test.copy()
    harmonized_test.X = combat.transform(
        dataset_test.X,
        dataset_test.y[:, 2][:, np.newaxis],
        dataset_test.y[:, 1][:, np.newaxis],
        dataset_test.y[:, 0][:, np.newaxis],
    )

    site_figs = plot_tract_profiles(
        X=harmonized_test,
        group_by=harmonized_test.classes["scan_site_id"][
            harmonized_test.y[:, 2].astype(int)
        ],
        group_by_name="Site",
        figsize=(14, 14),
    )


.. rst-class:: sphx-glr-horizontal


    *

      .. image-sg:: /auto_examples/images/sphx_glr_plot_hbn_site_profiles_003.png
         :alt: IFOL, UNCL, UNCR, IFOR, ATRL, CSTL, CSTR, ATRR, ARCL, SLFL, SLFR, ARCR, ILFL, CGCL, CGCR, ILFR, Orbital, AntFrontal, SupFrontal, Motor, SupParietal, Temporal, PostParietal, Occipital
         :srcset: /auto_examples/images/sphx_glr_plot_hbn_site_profiles_003.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/images/sphx_glr_plot_hbn_site_profiles_004.png
         :alt: IFOL, UNCL, UNCR, IFOR, ATRL, CSTL, CSTR, ATRR, ARCL, SLFL, SLFR, ARCR, ILFL, CGCL, CGCR, ILFR, Orbital, AntFrontal, SupFrontal, Motor, SupParietal, Temporal, PostParietal, Occipital
         :srcset: /auto_examples/images/sphx_glr_plot_hbn_site_profiles_004.png
         :class: sphx-glr-multi-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

      0%|          | 0/2 [00:00<?, ?it/s]
      0%|          | 0/24 [00:00<?, ?it/s]
      4%|4         | 1/24 [00:04<01:54,  4.99s/it]
      8%|8         | 2/24 [00:09<01:49,  4.96s/it]
     12%|#2        | 3/24 [00:14<01:43,  4.94s/it]
     17%|#6        | 4/24 [00:19<01:38,  4.93s/it]
     21%|##        | 5/24 [00:24<01:34,  4.98s/it]
     25%|##5       | 6/24 [00:29<01:30,  5.01s/it]
     29%|##9       | 7/24 [00:34<01:25,  5.04s/it]
     33%|###3      | 8/24 [00:40<01:20,  5.04s/it]
     38%|###7      | 9/24 [00:45<01:15,  5.05s/it]
     42%|####1     | 10/24 [00:50<01:10,  5.07s/it]
     46%|####5     | 11/24 [00:55<01:05,  5.07s/it]
     50%|#####     | 12/24 [01:00<01:00,  5.05s/it]
     54%|#####4    | 13/24 [01:05<00:55,  5.04s/it]
     58%|#####8    | 14/24 [01:10<00:50,  5.07s/it]
     62%|######2   | 15/24 [01:15<00:45,  5.10s/it]
     67%|######6   | 16/24 [01:20<00:40,  5.12s/it]
     71%|#######   | 17/24 [01:25<00:35,  5.13s/it]
     75%|#######5  | 18/24 [01:31<00:30,  5.14s/it]
     79%|#######9  | 19/24 [01:36<00:25,  5.16s/it]
     83%|########3 | 20/24 [01:41<00:20,  5.15s/it]
     88%|########7 | 21/24 [01:46<00:15,  5.10s/it]
     92%|#########1| 22/24 [01:51<00:10,  5.12s/it]
     96%|#########5| 23/24 [01:56<00:05,  5.09s/it]
    100%|##########| 24/24 [02:01<00:00,  5.03s/it]    100%|##########| 24/24 [02:01<00:00,  5.06s/it]
     50%|#####     | 1/2 [02:02<02:02, 122.36s/it]
      0%|          | 0/24 [00:00<?, ?it/s]
      4%|4         | 1/24 [00:04<01:52,  4.88s/it]
      8%|8         | 2/24 [00:09<01:47,  4.88s/it]
     12%|#2        | 3/24 [00:14<01:41,  4.85s/it]
     17%|#6        | 4/24 [00:19<01:37,  4.87s/it]
     21%|##        | 5/24 [00:24<01:32,  4.89s/it]
     25%|##5       | 6/24 [00:29<01:27,  4.89s/it]
     29%|##9       | 7/24 [00:34<01:24,  4.96s/it]
     33%|###3      | 8/24 [00:39<01:20,  5.01s/it]
     38%|###7      | 9/24 [00:44<01:14,  4.99s/it]
     42%|####1     | 10/24 [00:49<01:10,  5.04s/it]
     46%|####5     | 11/24 [00:54<01:05,  5.06s/it]
     50%|#####     | 12/24 [00:59<01:00,  5.07s/it]
     54%|#####4    | 13/24 [01:04<00:54,  4.98s/it]
     58%|#####8    | 14/24 [01:09<00:49,  4.94s/it]
     62%|######2   | 15/24 [01:14<00:44,  4.97s/it]
     67%|######6   | 16/24 [01:19<00:39,  5.00s/it]
     71%|#######   | 17/24 [01:24<00:34,  4.97s/it]
     75%|#######5  | 18/24 [01:29<00:29,  4.93s/it]
     79%|#######9  | 19/24 [01:34<00:24,  4.92s/it]
     83%|########3 | 20/24 [01:39<00:19,  4.90s/it]
     88%|########7 | 21/24 [01:43<00:14,  4.90s/it]
     92%|#########1| 22/24 [01:48<00:09,  4.90s/it]
     96%|#########5| 23/24 [01:53<00:04,  4.91s/it]
    100%|##########| 24/24 [01:58<00:00,  4.90s/it]    100%|##########| 24/24 [01:58<00:00,  4.94s/it]
    100%|##########| 2/2 [04:01<00:00, 120.67s/it]    100%|##########| 2/2 [04:01<00:00, 120.93s/it]


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 8 minutes  45.792 seconds)


.. _sphx_glr_download_auto_examples_plot_hbn_site_profiles.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_hbn_site_profiles.py <plot_hbn_site_profiles.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_hbn_site_profiles.ipynb <plot_hbn_site_profiles.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_