{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuromatch/climate-course-content/blob/main/tutorials/W1D4_Paleoclimate/instructor/W1D4_Tutorial5.ipynb)   <a href=\"https://kaggle.com/kernels/welcome?src=https://raw.githubusercontent.com/neuromatch/climate-course-content/main/tutorials/W1D4_Paleoclimate/instructor/W1D4_Tutorial5.ipynb\" target=\"_blank\"><img alt=\"Open in Kaggle\" src=\"https://kaggle.com/static/images/open-in-kaggle.svg\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Tutorial 5: Paleoclimate Data Analysis Tools\n",
    "**Week 1, Day 4, Paleoclimate**\n",
    "\n",
    "**Content creators:** Sloane Garelick\n",
    "\n",
    "**Content reviewers:** Yosmely Bermúdez, Dionessa Biton, Katrina Dobson, Maria Gonzalez, Will Gregory, Nahid Hasan, Paul Heubel, Sherry Mi, Beatriz Cosenza Muralles, Brodie Pearson, Jenna Pearson, Chi Zhang, Ohad Zivan \n",
    "\n",
    "**Content editors:** Yosmely Bermúdez, Paul Heubel, Zahra Khodakaramimaghsoud, Jenna Pearson, Agustina Pesce, Chi Zhang, Ohad Zivan\n",
    "\n",
    "**Production editors:** Wesley Banfield, Paul Heubel, Jenna Pearson, Konstantine Tsafatinos, Chi Zhang, Ohad Zivan\n",
    "\n",
    "**Our 2024 Sponsors:** CMIP, NFDI4Earth"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Tutorial Objectives\n",
    "\n",
    "*Estimated timing of tutorial:* 30 minutes\n",
    "\n",
    "In this tutorial, you will explore various computational analyses for interpreting paleoclimate data and understand why these methods are useful. A common issue in paleoclimate is the presence of uneven time spacing between consecutive observations. `Pyleoclim` includes several methods that can deal with uneven sampling effectively, but there are certain applications and analyses for which it's necessary to place the records on a uniform time axis. In this tutorial, you'll learn a few ways to do this with `Pyleoclim`. Additionally, we will explore another useful paleoclimate data analysis tool, Principal Component Analysis (PCA), which allows us to identify a common signal between various paleoclimate reconstructions. \n",
    "\n",
    "By the end of this tutorial, you'll be able to perform the following data analysis techniques on proxy-based climate reconstructions:\n",
    "\n",
    "*   Interpolation\n",
    "*   Binning \n",
    "*   Principal component analysis (PCA)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# installations ( uncomment and run this cell ONLY when using google colab or kaggle )\n",
    "\n",
    "# !pip install pyleoclim"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# imports\n",
    "import pandas as pd\n",
    "import pyleoclim as pyleo\n",
    "import matplotlib.pyplot as plt\n",
    "import pooch\n",
    "import os\n",
    "import tempfile\n",
    "\n",
    "from sklearn.decomposition import PCA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Install and import feedback gadget\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Install and import feedback gadget\n",
    "\n",
    "!pip3 install vibecheck datatops --quiet\n",
    "\n",
    "from vibecheck import DatatopsContentReviewContainer\n",
    "def content_review(notebook_section: str):\n",
    "    return DatatopsContentReviewContainer(\n",
    "        \"\",  # No text prompt\n",
    "        notebook_section,\n",
    "        {\n",
    "            \"url\": \"https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab\",\n",
    "            \"name\": \"comptools_4clim\",\n",
    "            \"user_key\": \"l5jpxuee\",\n",
    "        },\n",
    "    ).render()\n",
    "\n",
    "\n",
    "feedback_prefix = \"W1D4_T5\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Figure Settings\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Figure Settings\n",
    "import ipywidgets as widgets  # interactive display\n",
    "\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "plt.style.use(\n",
    "    \"https://raw.githubusercontent.com/neuromatch/climate-course-content/main/cma.mplstyle\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Helper functions\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Helper functions\n",
    "\n",
    "def pooch_load(filelocation=None, filename=None, processor=None):\n",
    "    shared_location = \"/home/jovyan/shared/Data/tutorials/W1D4_Paleoclimate\"  # this is different for each day\n",
    "    user_temp_cache = tempfile.gettempdir()\n",
    "\n",
    "    if os.path.exists(os.path.join(shared_location, filename)):\n",
    "        file = os.path.join(shared_location, filename)\n",
    "    else:\n",
    "        file = pooch.retrieve(\n",
    "            filelocation,\n",
    "            known_hash=None,\n",
    "            fname=os.path.join(user_temp_cache, filename),\n",
    "            processor=processor,\n",
    "        )\n",
    "\n",
    "    return file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Video 1: Proxy Data Analysis Tools\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Video 1: Proxy Data Analysis Tools\n",
    "\n",
    "from ipywidgets import widgets\n",
    "from IPython.display import YouTubeVideo\n",
    "from IPython.display import IFrame\n",
    "from IPython.display import display\n",
    "\n",
    "\n",
    "class PlayVideo(IFrame):\n",
    "  def __init__(self, id, source, page=1, width=400, height=300, **kwargs):\n",
    "    self.id = id\n",
    "    if source == 'Bilibili':\n",
    "      src = f'https://player.bilibili.com/player.html?bvid={id}&page={page}'\n",
    "    elif source == 'Osf':\n",
    "      src = f'https://mfr.ca-1.osf.io/render?url=https://osf.io/download/{id}/?direct%26mode=render'\n",
    "    super(PlayVideo, self).__init__(src, width, height, **kwargs)\n",
    "\n",
    "\n",
    "def display_videos(video_ids, W=400, H=300, fs=1):\n",
    "  tab_contents = []\n",
    "  for i, video_id in enumerate(video_ids):\n",
    "    out = widgets.Output()\n",
    "    with out:\n",
    "      if video_ids[i][0] == 'Youtube':\n",
    "        video = YouTubeVideo(id=video_ids[i][1], width=W,\n",
    "                             height=H, fs=fs, rel=0)\n",
    "        print(f'Video available at https://youtube.com/watch?v={video.id}')\n",
    "      else:\n",
    "        video = PlayVideo(id=video_ids[i][1], source=video_ids[i][0], width=W,\n",
    "                          height=H, fs=fs, autoplay=False)\n",
    "        if video_ids[i][0] == 'Bilibili':\n",
    "          print(f'Video available at https://www.bilibili.com/video/{video.id}')\n",
    "        elif video_ids[i][0] == 'Osf':\n",
    "          print(f'Video available at https://osf.io/{video.id}')\n",
    "      display(video)\n",
    "    tab_contents.append(out)\n",
    "  return tab_contents\n",
    "\n",
    "\n",
    "video_ids = [('Youtube', 'UOunkA_NB1Q'), ('Bilibili', 'BV1rh4y1E7NV')]\n",
    "tab_contents = display_videos(video_ids, W=730, H=410)\n",
    "tabs = widgets.Tab()\n",
    "tabs.children = tab_contents\n",
    "for i in range(len(tab_contents)):\n",
    "  tabs.set_title(i, video_ids[i][0])\n",
    "display(tabs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_Proxy_Data_Analysis_Tools_Video\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @markdown\n",
    "from ipywidgets import widgets\n",
    "from IPython.display import IFrame\n",
    "\n",
    "link_id = \"ujbqz\"\n",
    "\n",
    "print(f\"If you want to download the slides: https://osf.io/download/{link_id}/\")\n",
    "IFrame(src=f\"https://mfr.ca-1.osf.io/render?url=https://osf.io/{link_id}/?direct%26mode=render%26action=download%26mode=render\", width=854, height=480)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_Proxy_Data_Analysis_Tools_Slides\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Section 1: Load the sample dataset for analysis\n",
    "\n",
    "For this tutorial, we are going to use an example dataset to practice the various data analysis techniques. The dataset we'll be using is a record of hydrogen isotopes of leaf waxes (δD<sub>wax</sub>) from Lake Tanganyika in East Africa [(Tierney et al., 2008)](https://www.science.org/doi/10.1126/science.1160485?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed). Recall from the video that δD<sub>wax</sub> is a proxy that is typically thought to record changes in the amount of precipitation in the tropics via the amount effect. In the previous tutorial, we looked at δD data from high-latitude ice cores. In that case, δD was a proxy for temperature, but in the tropics, δD more commonly reflects rainfall amount, as explained in the introductory video.\n",
    "\n",
    "Let's first read the data from a `.csv` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "filename_tang = \"tanganyika_dD.csv\"\n",
    "url_tang = \"https://osf.io/sujvp/download/\"\n",
    "tang_dD = pd.read_csv(pooch_load(filelocation=url_tang, filename=filename_tang))\n",
    "tang_dD.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We can now create a `Series` in Pyleoclim and assign names to different variables so that we can easily plot the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 736,
     "status": "ok",
     "timestamp": 1681484071869,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "ts_tang = pyleo.Series(\n",
    "    time=tang_dD[\"Age\"],\n",
    "    value=tang_dD[\"dD_IVonly\"],\n",
    "    time_name=\"Age\",\n",
    "    time_unit=\"yr BP\",\n",
    "    value_name=\"dDwax\",\n",
    "    value_unit=\"per mille\",\n",
    "    label=\"Lake Tanganyika dDprecip\",\n",
    ")\n",
    "\n",
    "ts_tang.plot(color=\"C1\", invert_yaxis=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "You may notice that the y-axis is inverted. When we're plotting δD data, we typically invert the y-axis because more negative (\"depleted\") values suggest increased rainfall, whereas more positive (\"enriched\") values suggest decreased rainfall."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Section 2: Uniform Time-Sampling of the Data\n",
    "There are a number of different reasons we might want to assign new values to our data. For example, if the data is not evenly spaced, we might need to resample in order to use a sepcific data analysis technique or to more easily compare to other data of a different sampling resolution. \n",
    "\n",
    "First, let's check whether our data is already evenly spaced using the `.is_evenly_spaced()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 138,
     "status": "ok",
     "timestamp": 1681484075182,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "ts_tang.is_evenly_spaced()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Our data is not evenly spaced. There are a few different methods available in `pyleoclim` to place the data on a uniform axis, and in this tutorial, we'll explore two methods: interpolating and binning. In general, these methods use the available data near a chosen time to estimate what the value was at that time, but each method differs in which nearby data points it uses and how it uses them.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## Section 2.1: Interpolation\n",
    "To start out, let's try using interpolation to evenly space our data. Interpolation projects the data onto an evenly spaced time axis with a distance between points (step size) of our choosing. There are a variety of different methods by which the data can be interpolated, these being: `linear`, `nearest`, `zero`, `slinear`, `quadratic`, `cubic`, `previous`, and `next`. More on these and their associated key word arguments can be found in the [documentation](https://pyleoclim-util.readthedocs.io/en/latest/core/api.html#pyleoclim.core.series.Series.interp). By default, the function `.interp()` implements linear interpolation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "tang_linear = ts_tang.interp()  # default method = 'linear'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 167,
     "status": "ok",
     "timestamp": 1681484080381,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# check whether or not the series is now evenly spaced\n",
    "tang_linear.is_evenly_spaced()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Now that we've interpolated our data, let's compare the original dataset to the linearly interpolated dataset we just created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 443,
     "status": "ok",
     "timestamp": 1681484174043,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()  # assign a new plot axis\n",
    "ts_tang.plot(ax=ax, label=\"Original\", invert_yaxis=True)\n",
    "tang_linear.plot(ax=ax, label=\"Linear\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Notice there are only some minor differences between the original and linearly interpolated data.\n",
    "\n",
    "You can print the data in the original and interpolated time series to see the difference in the ages between the two datasets. The interpolated dataset is now evenly spaced with a δD value every ~290 years."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "ts_tang"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "tang_linear"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Let's compare a few of the different interpolation methods (e.g., quadratic, next, zero) with one another just to see how they are similar and different:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()  # assign a new plot axis\n",
    "ts_tang.plot(ax=ax, label=\"original\", invert_yaxis=True)\n",
    "for method in [\"linear\", \"quadratic\", \"next\", \"zero\"]:\n",
    "    ts_tang.interp(method=method).plot(\n",
    "        ax=ax, label=method, alpha=0.9\n",
    "    )  # plot all the method we want"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The methods can produce slightly different results, but mostly reproduce the same overall trend. In this case, the quadractic method may be less appropriate than the other methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "## Section 2.2: Binning\n",
    "Another option for resampling our data onto a uniform time axis is binning. Binning is when a set of time intervals is defined and data is grouped or binned with other data in the same interval, then all those points in a ***bin*** are averaged to get a data value for that bin. The defaults for binning pick a bin size at the coarsest time spacing present in the dataset and average data over a uniform sequence of such intervals. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "tang_bin = (\n",
    "    ts_tang.bin()\n",
    ")  # default settings pick the coarsest time spacing in the data as the binning period"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 457,
     "status": "ok",
     "timestamp": 1681484224785,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()  # assign a new plot axis\n",
    "ts_tang.plot(ax=ax, label=\"Original\", invert_yaxis=True)\n",
    "tang_bin.plot(ax=ax, label=\"Binned\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Again, notice that although there are some minor differences between the original and binned data, the records still capture the same overall trend."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "### Coding Exercises 2.2\n",
    "\n",
    "\n",
    "1. Experiment with different bin sizes to see how this affects the resampling. You can do so by adding  `bin_size = ` to `ts_tang.bin()`. Try a bin size of 500 years and 1,000 years and plot both of them on the same graph as the original data and the binned data using the default bin size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "execution": {}
   },
   "source": [
    "```python\n",
    "# bin size of 500\n",
    "tang_bin_500 = ...\n",
    "\n",
    "# bin size of 1000\n",
    "tang_bin_1000 = ...\n",
    "\n",
    "# plot\n",
    "fig, ax = plt.subplots()  # assign a new plot axis\n",
    "_ = ...\n",
    "_ = ...\n",
    "_ = ...\n",
    "_ = ...\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "\n",
    "# bin size of 500\n",
    "tang_bin_500 = ts_tang.bin(bin_size=500)\n",
    "\n",
    "# bin size of 1000\n",
    "tang_bin_1000 = ts_tang.bin(bin_size=1000)\n",
    "\n",
    "# plot\n",
    "fig, ax = plt.subplots()  # assign a new plot axis\n",
    "_ = ts_tang.plot(ax=ax, label=\"Original\", invert_yaxis=True)\n",
    "_ = tang_bin.plot(ax=ax, label=\"Binned Default\")\n",
    "_ = tang_bin_500.plot(ax=ax, label=\"Binned 500 yrs\")\n",
    "_ = tang_bin_1000.plot(ax=ax, label=\"Binned 1000 yrs\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_Coding_Exercises_2_2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Section 3: Principal Component Analysis (PCA)\n",
    "Large datasets, such as global climate datasets, are often difficult to interpret due to their multiple dimensions. Although tools such as Xarray help us to organize large, multidimensional climate datasets, it can still sometimes be difficult to interpret certain aspects of such data. Principal Component Analysis (PCA) is a tool for reducing the dimensionality of such datasets and increasing interpretability with minimal modification or loss of data. In other words, PCA allows us to reduce the number of variables of a dataset while preserving as much information as possible.\n",
    "\n",
    "The first step in PCA is to calculate a matrix that summarizes how the variables in a dataset all relate to one another. This matrix is then broken down into new uncorrelated variables that are linear combinations or mixtures of the initial variables. These new variables are the **principal components**. The initial dimensions of the dataset determine the number of principal components calculated. Most of the information within the initial variables is compressed into the first components. Additional details about PCA and the calculations involved can be found [here](https://builtin.com/data-science/step-step-explanation-principal-component-analysis) or [here](https://github.com/pbloem/pca-book/releases).\n",
    "\n",
    "Applied to paleoclimate, PCA can reduce the dimensionality of large paleoclimate datasets with multiple variables and can help us identify a common signal between various paleoclimate reconstructions. An example of a study that applies PCA to paleoclimate is [Otto-Bliesner et al., 2014](https://www.science.org/doi/full/10.1126/science.1259531). This study applies PCA to rainfall reconstructions from models and proxies from throughout Africa to determine common climate signals in these reconstructions.\n",
    "\n",
    "In this section, we will calculate the PCA of four δD paleoclimate records from Africa to assess common climate signals in the four records.  In order to calculate the PCA of multiple paleoclimate time series, all of the records need to be on a common time step. To do so, we can apply the resampling tools we've learned in this tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "So far, we've been looking at δD data from Lake Tanganyika in tropical East Africa. Let's compare this δD record to other existing δD records from lake and marine sediment cores in tropical Africa from the Gulf of Aden [(Tierney and deMenocal, 2017)](https://doi.org/10.1126/science.1240411), Lake Bosumtwi [(Shanahan et al., 2015)](https://doi.org/10.1038/ngeo2329), and the West African Margin [(Tierney et al., 2017)](https://doi.org/10.1126/sciadv.1601503).\n",
    "\n",
    "First, let's load these datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 484,
     "status": "ok",
     "timestamp": 1681484228855,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# Gulf of Aden\n",
    "filename_aden = \"aden_dD.csv\"\n",
    "url_aden = \"https://osf.io/gm2v9/download/\"\n",
    "aden_dD = pd.read_csv(pooch_load(filelocation=url_aden, filename=filename_aden))\n",
    "aden_dD.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 370,
     "status": "ok",
     "timestamp": 1681484235076,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# Lake Bosumtwi\n",
    "filename_Bosumtwi = \"bosumtwi_dD.csv\"\n",
    "url_Bosumtwi = \"https://osf.io/mr7d9/download/\"\n",
    "bosumtwi_dD = pd.read_csv(\n",
    "    pooch_load(filelocation=url_Bosumtwi, filename=filename_Bosumtwi)\n",
    ")\n",
    "bosumtwi_dD.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 354,
     "status": "ok",
     "timestamp": 1681484237400,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "# GC27 (West African Margin)\n",
    "filename_GC27 = \"gc27_dD.csv\"\n",
    "url_GC27 = \"https://osf.io/k6e3a/download/\"\n",
    "gc27_dD = pd.read_csv(pooch_load(filelocation=url_GC27, filename=filename_GC27))\n",
    "gc27_dD.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Next, let's convert each dataset into a `Series` in Pyleoclim."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "ts_tanganyika = pyleo.Series(\n",
    "    time=tang_dD[\"Age\"],\n",
    "    value=tang_dD[\"dD_IVonly\"],\n",
    "    time_name=\"Age\",\n",
    "    time_unit=\"yr BP\",\n",
    "    value_name=\"dDwax\",\n",
    "    label=\"Lake Tanganyika\",\n",
    ")\n",
    "ts_aden = pyleo.Series(\n",
    "    time=aden_dD[\"age_calBP\"],\n",
    "    value=aden_dD[\"dDwaxIVcorr\"],\n",
    "    time_name=\"Age\",\n",
    "    time_unit=\"yr BP\",\n",
    "    value_name=\"dDwax\",\n",
    "    label=\"Gulf of Aden\",\n",
    ")\n",
    "ts_bosumtwi = pyleo.Series(\n",
    "    time=bosumtwi_dD[\"age_calBP\"],\n",
    "    value=bosumtwi_dD[\"d2HleafwaxC31ivc\"],\n",
    "    time_name=\"Age\",\n",
    "    time_unit=\"yr BP\",\n",
    "    value_name=\"dDwax\",\n",
    "    label=\"Lake Bosumtwi\",\n",
    ")\n",
    "ts_gc27 = pyleo.Series(\n",
    "    time=gc27_dD[\"age_BP\"],\n",
    "    value=gc27_dD[\"dDwax_iv\"],\n",
    "    time_name=\"Age\",\n",
    "    time_unit=\"yr BP\",\n",
    "    value_name=\"dDwax\",\n",
    "    label=\"GC27\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "Now let's set up a `MultipleSeries` using Pyleoclim with all four δD datasets. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "ts_list = [ts_tanganyika, ts_aden, ts_bosumtwi, ts_gc27]\n",
    "ms_africa = pyleo.MultipleSeries(ts_list, label=\"African dDwax\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We can now create a stackplot with all four δD records:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 756,
     "status": "ok",
     "timestamp": 1681484266500,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "fig, ax = ms_africa.stackplot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "By creating a stackplot, we can visually compare between the datasets. However, the four δD records aren't the same resolution and don't span the same time interval.\n",
    "\n",
    "To better compare the records and assess a common trend, we can use PCA. First, we can use [`.common_time()`](https://pyleoclim-util.readthedocs.io/en/v0.7.2/core/ui.html#pyleoclim.core.ui.MultipleSeries.common_time) to place the records on a shared time axis with a common sampling frequency. This function takes the argument `method`, which can be either `bin`, `interp`, and `gdkernel`. The binning and interpolation methods are what we just covered in the previous section. Let's set the time step to 500 years, the method to `interp`, and standardize the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 1048,
     "status": "ok",
     "timestamp": 1681484350993,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "africa_ct = ms_africa.common_time(method=\"interp\", step=0.5).standardize()\n",
    "fig, ax = africa_ct.stackplot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "We now have standardized δD records that are the same sampling resolution and span the same time interval. Note this meant trimming the longer time series down to around 20,000 years in length.\n",
    "\n",
    "We can now apply PCA which will allow us to quantitatively identify a common signal between the four δD paleoclimate reconstructions through the [`.pca()`](https://pyleoclim-util.readthedocs.io/en/latest/core/api.html#pyleoclim.core.multipleseries.MultipleSeries.pca) method. \n",
    "\n",
    "The `pyleoclim` package has its own pca method ( e.g. `africa_ct.pca()`), but it is computationally intensive for use in this course. For our purposes we will be using the PCA method from `sklearn`. Please refer to this [example](https://vitalflux.com/pca-explained-variance-concept-python-example/) for further reference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "pca = PCA()\n",
    "africa_PCA = pca.fit(africa_ct.to_pandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The result is an object containing multiple outputs. The two outputs we'll look at is the percentage of variance accounted for by each mode as well as the principal components themselves. \n",
    "\n",
    "First, let's print the percentage of variance accounted for by each mode, noting that the first principle component is first in the list, and so on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 224,
     "status": "ok",
     "timestamp": 1681484449770,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "print(africa_PCA.explained_variance_ratio_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "This means that most of the variance in the four paleoclimate records is explained by the first principal component (the first value displayed). The number of datasets in the PCA constrains the number of principal components that can be defined, which is why we only have four components in this example.\n",
    "\n",
    "We can now look at the principal component of the first mode of variance. Let's create a new series for the first principle component and plot it against the original datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "africa_pc1 = africa_PCA.transform(africa_ct.to_pandas())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "africa_mode1 = pyleo.Series(\n",
    "    time=africa_ct.series_list[0].time,\n",
    "    value=africa_pc1[:,0],\n",
    "    label=r'$PC_1$',\n",
    "    value_name='PC1',\n",
    "    time_name ='age',\n",
    "    time_unit = 'yr BP'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {},
    "executionInfo": {
     "elapsed": 1026,
     "status": "ok",
     "timestamp": 1681484506972,
     "user": {
      "displayName": "Sloane Garelick",
      "userId": "04706287370408131987"
     },
     "user_tz": 240
    }
   },
   "outputs": [],
   "source": [
    "fig, ax1 = plt.subplots()\n",
    "\n",
    "ax1.set_ylabel(\"dDwax\")\n",
    "ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis\n",
    "ax2.set_ylabel(\"PC1\")  # we already handled the x-label with ax1\n",
    "\n",
    "# plt.plot(mode1.time,pc1_scaled)\n",
    "africa_mode1.plot(color=\"black\", ax=ax2, invert_yaxis=True)\n",
    "africa_ct.plot(ax=ax1, linewidth=0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "The original δD records are shown in the colored lines, and the first principal component (PC1) time series is shown in black. \n",
    "\n",
    "## Questions 3: Climate Connection\n",
    " \n",
    "\n",
    "1.   How do the original time series compare to the PC1 time series? Do they record similar trends?\n",
    "2.   Which original δD record most closely resembles the PC1 time series? Which is the most different?\n",
    "3.   What changes in climate does the PC1 time series record over the past 20,000 years? *Hint: remember that more depleted (more negative) δD suggests increased rainfall.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {}
   },
   "outputs": [],
   "source": [
    "# to_remove explanation\n",
    "\n",
    "\"\"\"\n",
    "1. The overall trends are similar. However, there is more variability in the original δD records than in the PC1 time series. This makes sense because the principal component analysis identifies a common signal between original δD reconstructions and therefore is inherently less variable.\n",
    "2. The Gulf of Aden δD record is the most similar to the PC1 time series. The Lake Bosumtwi δD record is the least similar to the PC1 time series. This difference is particularly noticeable between 5,000-10,000 years ago.\n",
    "3. The PC1 time series suggests a drier climate (more positive δD) in Africa over the past 20,000 years. Recall that 20,000 years ago was the last glacial period. African rainfall increased ~15,000 years ago during the deglacial transition. There is a short period of drying ~12,000 years ago, which corresponds to the timing of the Younger Dryas (a millennial-scale, Northern Hemisphere high-latitude cooling event). This is followed by an increase in rainfall associated with what is known as the African Humid Period (a period of wet, humid conditions in northern and equatorial Africa, driven by intensification of the African monsoon). Finally, there is a decrease in rainfall over the past ~8,000 years towards the present. Note: δD only records qualitative changes in rainfall (i.e., wetter or drier relative to another time period), and does not provide quantitative measurements of the amount of rainfall.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Submit your feedback\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "execution": {},
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# @title Submit your feedback\n",
    "content_review(f\"{feedback_prefix}_Questions_3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Summary\n",
    "In this tutorial, you explored a variety of computational techniques for analyzing paleoclimate data. You learned how to handle irregular data and place these records on a uniform time axis using interpolation and binning. \n",
    "\n",
    "You then explored Principal Component Analysis (PCA), a method that reveals common signals across various paleoclimate reconstructions. To effectively utilize PCA, it's essential to have all records on the same sampling resolution and same time step, which can be achieved using the resampling tools you learned in this tutorial.\n",
    "\n",
    "For your practical example, we used a dataset of hydrogen isotopes of leaf waxes (δD<sub>wax</sub>) from Lake Tanganyika in East Africa to further enhance your understanding of the discussed techniques, equipping you to better analyze and understand the complexities of paleoclimate data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "execution": {}
   },
   "source": [
    "# Resources\n",
    "\n",
    "Code for this tutorial is based on existing notebooks from LinkedEarth for [anlayzing LiPD datasets](https://github.com/LinkedEarth/paleoHackathon/blob/main/notebooks/PaleoHack-nb03_EDA.ipynb) and [resampling data with `Pyleoclim`](https://github.com/LinkedEarth/PyleoTutorials/blob/main/notebooks/L1_uniform_time_sampling.ipynb).\n",
    "\n",
    "Data from the following sources are used in this tutorial:\n",
    "\n",
    "*   Tierney, J.E., et al. 2008. Northern Hemisphere Controls on Tropical Southeast African Climate During the Past 60,000 Years. Science, Vol. 322, No. 5899, pp. 252-255, 10 October 2008. https://doi.org/10.1126/science.1160485.\n",
    "*   Tierney, J.E., and deMenocal, P.. 2013. Abrupt Shifts in Horn of Africa Hydroclimate Since the Last Glacial Maximum. Science, 342(6160), 843-846. https://doi.org/10.1126/science.1240411.\n",
    "*   Tierney, J.E., Pausata, F., deMenocal, P. 2017. Rainfall Regimes of the Green Sahara. Science Advances, 3(1), e1601503. https://doi.org/10.1126/sciadv.1601503. \n",
    "*   Shanahan, T.M., et al. 2015. The time-transgressive termination of the African Humid Period. Nature Geoscience, 8(2), 140-144. https://doi.org/10.1038/ngeo2329."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "machine_shape": "hm",
   "name": "W1D4_Tutorial5",
   "provenance": [
    {
     "file_id": "1l594NT5i1ubNU9d5esRFZvwj87ivhusI",
     "timestamp": 1680610931501
    },
    {
     "file_id": "1lHuVrVtAW4fQzc0dFdlZuwY0i71KWw_t",
     "timestamp": 1677637469824
    }
   ],
   "toc_visible": true
  },
  "gpuClass": "standard",
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}