{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pkKT4yIHPgAE"
   },
   "source": [
    "## Lecture 19, 21 October 2025"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TbBer1CPPgAa"
   },
   "source": [
    "### Pandas (Python and data analysis)\n",
    "- Built on top of numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tj2ECmErPgAb"
   },
   "source": [
    "### Series and data frames\n",
    "\n",
    "- Numpy defines homogeneous n-dimensional arrays\n",
    "\n",
    "- Data science works with tables: 2-dimensional arrays\n",
    "\n",
    "- Pandas has two fundamental data structures\n",
    "\n",
    "    - Series : A column of data\n",
    "    - Data Frame : A table of data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mGkRmLdiPgAb"
   },
   "source": [
    "### Key difference\n",
    "- Numpy indices are always `[0..n-1]` in each dimension\n",
    "- Pandas allows more flexible “named” indices for rows and columns\n",
    "    - Dictionary vs list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Vg_F3C1ZPgAb"
   },
   "source": [
    "### Load pandas\n",
    "\n",
    "- Don't need to import numpy unless one is separately using numpy arrays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "UMOE0-Z0PgAb"
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NlTrw05-PgAc"
   },
   "source": [
    "### Create a series\n",
    "\n",
    "- Convert a sequence into a series (column)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Ki4ofeqFPgAc",
    "outputId": "d6cd35c8-1905-4183-c330-e2b4b3618da5"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     AMA\n",
       "1     IBM\n",
       "2    GOOG\n",
       "3    META\n",
       "dtype: object"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "h = ['AMA', 'IBM', 'GOOG', 'META']\n",
    "s = pd.Series(h)\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "xCEa9a0kPgAd",
    "outputId": "7d2a9a11-7592-41b2-aa85-15651045f664"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The intention is that a series represents a column, so it should be of a uniform type, like a numpy array\n",
    "- However, pandas does permit non-uniform series! Best avoided in practice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       1\n",
       "1     Two\n",
       "2    True\n",
       "dtype: object"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hbad = [1,'Two',True]\n",
    "sbad = pd.Series(hbad)\n",
    "sbad"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bgnPs1yvPgAd"
   },
   "source": [
    "### Convert a dictionary to a series\n",
    "- Keys become \"row indices\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "tsPs72m5PgAd",
    "outputId": "56b4491f-d519-4505-bc0c-bec5c93c1079"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A     AMA\n",
       "B     IBM\n",
       "C    GOOG\n",
       "D    META\n",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = {'A':'AMA', 'B':'IBM', 'C':'GOOG', 'D':'META'}\n",
    "ds = pd.Series(d)\n",
    "ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "wC5zhBx3PgAe",
    "outputId": "15ca5295-3db0-4d4a-b006-1c87b2bb29e3"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.series.Series"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(ds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ggyM0h0rPgAe"
   },
   "source": [
    "### Creating an index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Provide a separate sequence of index headers, same length as values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "id": "k9uXQWIlPgAe"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A     AMA\n",
       "B     IBM\n",
       "C    GOOG\n",
       "D    META\n",
       "dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f = ['AMA', 'IBM', 'GOOG', 'META']\n",
    "fs = pd.Series(f, index = ['A','B', 'C', 'D'])\n",
    "fs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PNzTNgEBPgAf"
   },
   "source": [
    "### Accessing elements"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Using named index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "RE2GcqQNPgAf",
    "outputId": "68b4e74a-dc92-4947-bdb3-bfbc35d01a7d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'IBM'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fs['B']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Using position\n",
    "    - Note:`fs[3]` also works but is *deprecated* -- may be disallowed in future versions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 38
    },
    "id": "XVuamQAVPgAf",
    "outputId": "3b5015c1-3413-4738-e9d5-a7355d7bd4e5"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_88733/1188991964.py:1: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`\n",
      "  fs.iloc[3], fs['D'], fs[3]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "('META', 'META', 'META')"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fs.iloc[3], fs['D'], fs[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Using a slice of positions\n",
    "    - Here the indices behave like positions in a list\n",
    "    - Slice `[i:j]` runs from `i` to `j-1`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "hoEAChl3PgAf",
    "outputId": "e73953ea-6eb0-44a1-ddd0-22e25dd25607"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A    AMA\n",
       "B    IBM\n",
       "dtype: object"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fs.iloc[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Using a sequence of positions\n",
    "- Can be out of order"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "pEpTNQ01PgAg",
    "outputId": "4e920548-e8a1-451f-99cf-e10bd85cfa1d"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A     AMA\n",
       "C    GOOG\n",
       "B     IBM\n",
       "dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fs.iloc[[0,2,1]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Can do the same with named indices\n",
    "    - Slice uses position of indices\n",
    "    - However, includes last index in the slice, unlike positional slice"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "9OpuHyldPgAg",
    "outputId": "c1b4892b-a48b-4f98-e2ac-aaf4d01c0752"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "B     IBM\n",
       "C    GOOG\n",
       "D    META\n",
       "dtype: object"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fs['B':'D']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- An invalid slice range produces an empty output, as usual, not an error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Series([], dtype: object)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fs['C':'A']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "C    GOOG\n",
       "B     IBM\n",
       "dtype: object"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fs[['C','B']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mWfQXDtFPgAg"
   },
   "source": [
    "### Data frames\n",
    "\n",
    "- A table is a sequence of columns\n",
    "- A data frame is a sequence of series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "id": "tNjNBmz5PgAh"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AA</td>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IBM</td>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GOOG</td>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>32.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name        date  shares  price\n",
       "0    AA  2001-12-01     100   12.3\n",
       "1   IBM  2012-02-10      30   10.3\n",
       "2  GOOG  2010-04-09      90   32.2"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data2 = {'name' : ['AA', 'IBM', 'GOOG'], \n",
    "        'date' : ['2001-12-01', '2012-02-10', '2010-04-09'],\n",
    "        'shares' : [100, 30, 90],\n",
    "        'price' : [12.3, 10.3, 32.2]\n",
    "}\n",
    "df2 = pd.DataFrame(data2)\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Table (data frame) is specified by column (series)\n",
    "- Each column should have the same length, else error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "id": "tNjNBmz5PgAh"
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "All arrays must be of the same length",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mValueError\u001b[39m                                Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[16]\u001b[39m\u001b[32m, line 6\u001b[39m\n\u001b[32m      1\u001b[39m data2bad = {\u001b[33m'\u001b[39m\u001b[33mname\u001b[39m\u001b[33m'\u001b[39m : [\u001b[33m'\u001b[39m\u001b[33mAA\u001b[39m\u001b[33m'\u001b[39m, \u001b[33m'\u001b[39m\u001b[33mIBM\u001b[39m\u001b[33m'\u001b[39m, \u001b[33m'\u001b[39m\u001b[33mGOOG\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mMETA\u001b[39m\u001b[33m'\u001b[39m], \n\u001b[32m      2\u001b[39m         \u001b[33m'\u001b[39m\u001b[33mdate\u001b[39m\u001b[33m'\u001b[39m : [\u001b[33m'\u001b[39m\u001b[33m2001-12-01\u001b[39m\u001b[33m'\u001b[39m, \u001b[33m'\u001b[39m\u001b[33m2012-02-10\u001b[39m\u001b[33m'\u001b[39m, \u001b[33m'\u001b[39m\u001b[33m2010-04-09\u001b[39m\u001b[33m'\u001b[39m],\n\u001b[32m      3\u001b[39m         \u001b[33m'\u001b[39m\u001b[33mshares\u001b[39m\u001b[33m'\u001b[39m : [\u001b[32m100\u001b[39m, \u001b[32m30\u001b[39m, \u001b[32m90\u001b[39m],\n\u001b[32m      4\u001b[39m         \u001b[33m'\u001b[39m\u001b[33mprice\u001b[39m\u001b[33m'\u001b[39m : [\u001b[32m12.3\u001b[39m, \u001b[32m10.3\u001b[39m, \u001b[32m32.2\u001b[39m]\n\u001b[32m      5\u001b[39m }\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m df2bad = \u001b[43mpd\u001b[49m\u001b[43m.\u001b[49m\u001b[43mDataFrame\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata2bad\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m      7\u001b[39m df2bad\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/frame.py:778\u001b[39m, in \u001b[36mDataFrame.__init__\u001b[39m\u001b[34m(self, data, index, columns, dtype, copy)\u001b[39m\n\u001b[32m    772\u001b[39m     mgr = \u001b[38;5;28mself\u001b[39m._init_mgr(\n\u001b[32m    773\u001b[39m         data, axes={\u001b[33m\"\u001b[39m\u001b[33mindex\u001b[39m\u001b[33m\"\u001b[39m: index, \u001b[33m\"\u001b[39m\u001b[33mcolumns\u001b[39m\u001b[33m\"\u001b[39m: columns}, dtype=dtype, copy=copy\n\u001b[32m    774\u001b[39m     )\n\u001b[32m    776\u001b[39m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, \u001b[38;5;28mdict\u001b[39m):\n\u001b[32m    777\u001b[39m     \u001b[38;5;66;03m# GH#38939 de facto copy defaults to False only in non-dict cases\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m778\u001b[39m     mgr = \u001b[43mdict_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcopy\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcopy\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[43m=\u001b[49m\u001b[43mmanager\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    779\u001b[39m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data, ma.MaskedArray):\n\u001b[32m    780\u001b[39m     \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mnumpy\u001b[39;00m\u001b[34;01m.\u001b[39;00m\u001b[34;01mma\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m mrecords\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/internals/construction.py:503\u001b[39m, in \u001b[36mdict_to_mgr\u001b[39m\u001b[34m(data, index, columns, dtype, typ, copy)\u001b[39m\n\u001b[32m    499\u001b[39m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m    500\u001b[39m         \u001b[38;5;66;03m# dtype check to exclude e.g. range objects, scalars\u001b[39;00m\n\u001b[32m    501\u001b[39m         arrays = [x.copy() \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(x, \u001b[33m\"\u001b[39m\u001b[33mdtype\u001b[39m\u001b[33m\"\u001b[39m) \u001b[38;5;28;01melse\u001b[39;00m x \u001b[38;5;28;01mfor\u001b[39;00m x \u001b[38;5;129;01min\u001b[39;00m arrays]\n\u001b[32m--> \u001b[39m\u001b[32m503\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43marrays_to_mgr\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolumns\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtyp\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtyp\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mconsolidate\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcopy\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/internals/construction.py:114\u001b[39m, in \u001b[36marrays_to_mgr\u001b[39m\u001b[34m(arrays, columns, index, dtype, verify_integrity, typ, consolidate)\u001b[39m\n\u001b[32m    111\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m verify_integrity:\n\u001b[32m    112\u001b[39m     \u001b[38;5;66;03m# figure out the index, if necessary\u001b[39;00m\n\u001b[32m    113\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m index \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m114\u001b[39m         index = \u001b[43m_extract_index\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrays\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    115\u001b[39m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m    116\u001b[39m         index = ensure_index(index)\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/internals/construction.py:677\u001b[39m, in \u001b[36m_extract_index\u001b[39m\u001b[34m(data)\u001b[39m\n\u001b[32m    675\u001b[39m lengths = \u001b[38;5;28mlist\u001b[39m(\u001b[38;5;28mset\u001b[39m(raw_lengths))\n\u001b[32m    676\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(lengths) > \u001b[32m1\u001b[39m:\n\u001b[32m--> \u001b[39m\u001b[32m677\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[33m\"\u001b[39m\u001b[33mAll arrays must be of the same length\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m    679\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m have_dicts:\n\u001b[32m    680\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m    681\u001b[39m         \u001b[33m\"\u001b[39m\u001b[33mMixing dicts with non-Series may lead to ambiguous ordering.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m    682\u001b[39m     )\n",
      "\u001b[31mValueError\u001b[39m: All arrays must be of the same length"
     ]
    }
   ],
   "source": [
    "data2bad = {'name' : ['AA', 'IBM', 'GOOG','META'], \n",
    "        'date' : ['2001-12-01', '2012-02-10', '2010-04-09'],\n",
    "        'shares' : [100, 30, 90],\n",
    "        'price' : [12.3, 10.3, 32.2]\n",
    "}\n",
    "df2bad = pd.DataFrame(data2bad)\n",
    "df2bad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We can create a data frame from an anonymous sequence of sequences\n",
    "    - In this case, each inner sequence is interpreted as a *row*, not a *column*\n",
    "- Both rows and columns are indexed by position"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AA</td>\n",
       "      <td>IBM</td>\n",
       "      <td>GOOG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>2010-04-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>100</td>\n",
       "      <td>30</td>\n",
       "      <td>90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12.3</td>\n",
       "      <td>10.3</td>\n",
       "      <td>32.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            0           1           2\n",
       "0          AA         IBM        GOOG\n",
       "1  2001-12-01  2012-02-10  2010-04-09\n",
       "2         100          30          90\n",
       "3        12.3        10.3        32.2"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data3 = (['AA', 'IBM', 'GOOG'], \n",
    "         ['2001-12-01', '2012-02-10', '2010-04-09'],\n",
    "         [100, 30, 90],\n",
    "         [12.3, 10.3, 32.2])\n",
    "df3 = pd.DataFrame(data3)\n",
    "df3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2ehGsALqPgAh"
   },
   "source": [
    "### Add a column"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We can add a column\n",
    "    - Provide a default value for all rows, or\n",
    "    - Provide a sequence of values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "id": "6EXQJvyJPgAi"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "      <th>owner</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AA</td>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IBM</td>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GOOG</td>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>32.2</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name        date  shares  price    owner\n",
       "0    AA  2001-12-01     100   12.3  Unknown\n",
       "1   IBM  2012-02-10      30   10.3  Unknown\n",
       "2  GOOG  2010-04-09      90   32.2  Unknown"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2['owner'] = 'Unknown'\n",
    "# df2['owner'] = ['a','b','c']\n",
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 143
    },
    "id": "G4UslwEYPgAi",
    "outputId": "27f9625d-677c-4f76-86a5-63c877febaf5"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "      <th>owner</th>\n",
       "      <th>owner2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AA</td>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IBM</td>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GOOG</td>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>32.2</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name        date  shares  price    owner owner2\n",
       "0    AA  2001-12-01     100   12.3  Unknown      a\n",
       "1   IBM  2012-02-10      30   10.3  Unknown      b\n",
       "2  GOOG  2010-04-09      90   32.2  Unknown      c"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#df2['owner'] = 'Unknown'\n",
    "df2['owner2'] = ['a','b','c']\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- If we provide a sequence of values, it must cover all rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Length of values (2) does not match length of index (3)",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mValueError\u001b[39m                                Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[21]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m      1\u001b[39m \u001b[38;5;66;03m#df2['owner'] = 'Unknown'\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m \u001b[43mdf2\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m'\u001b[39;49m\u001b[33;43mowner3\u001b[39;49m\u001b[33;43m'\u001b[39;49m\u001b[43m]\u001b[49m = [\u001b[33m'\u001b[39m\u001b[33ma\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mb\u001b[39m\u001b[33m'\u001b[39m]\n\u001b[32m      3\u001b[39m df2\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/frame.py:4316\u001b[39m, in \u001b[36mDataFrame.__setitem__\u001b[39m\u001b[34m(self, key, value)\u001b[39m\n\u001b[32m   4313\u001b[39m     \u001b[38;5;28mself\u001b[39m._setitem_array([key], value)\n\u001b[32m   4314\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m   4315\u001b[39m     \u001b[38;5;66;03m# set column\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m4316\u001b[39m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_set_item\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mvalue\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/frame.py:4529\u001b[39m, in \u001b[36mDataFrame._set_item\u001b[39m\u001b[34m(self, key, value)\u001b[39m\n\u001b[32m   4519\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_set_item\u001b[39m(\u001b[38;5;28mself\u001b[39m, key, value) -> \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m   4520\u001b[39m \u001b[38;5;250m    \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m   4521\u001b[39m \u001b[33;03m    Add series to DataFrame in specified column.\u001b[39;00m\n\u001b[32m   4522\u001b[39m \n\u001b[32m   (...)\u001b[39m\u001b[32m   4527\u001b[39m \u001b[33;03m    ensure homogeneity.\u001b[39;00m\n\u001b[32m   4528\u001b[39m \u001b[33;03m    \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m4529\u001b[39m     value, refs = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_sanitize_column\u001b[49m\u001b[43m(\u001b[49m\u001b[43mvalue\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   4531\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m (\n\u001b[32m   4532\u001b[39m         key \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.columns\n\u001b[32m   4533\u001b[39m         \u001b[38;5;129;01mand\u001b[39;00m value.ndim == \u001b[32m1\u001b[39m\n\u001b[32m   4534\u001b[39m         \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(value.dtype, ExtensionDtype)\n\u001b[32m   4535\u001b[39m     ):\n\u001b[32m   4536\u001b[39m         \u001b[38;5;66;03m# broadcast across multiple columns if necessary\u001b[39;00m\n\u001b[32m   4537\u001b[39m         \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m.columns.is_unique \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(\u001b[38;5;28mself\u001b[39m.columns, MultiIndex):\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/frame.py:5273\u001b[39m, in \u001b[36mDataFrame._sanitize_column\u001b[39m\u001b[34m(self, value)\u001b[39m\n\u001b[32m   5270\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m _reindex_for_setitem(value, \u001b[38;5;28mself\u001b[39m.index)\n\u001b[32m   5272\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m is_list_like(value):\n\u001b[32m-> \u001b[39m\u001b[32m5273\u001b[39m     \u001b[43mcom\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrequire_length_match\u001b[49m\u001b[43m(\u001b[49m\u001b[43mvalue\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mindex\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   5274\u001b[39m arr = sanitize_array(value, \u001b[38;5;28mself\u001b[39m.index, copy=\u001b[38;5;28;01mTrue\u001b[39;00m, allow_2d=\u001b[38;5;28;01mTrue\u001b[39;00m)\n\u001b[32m   5275\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m (\n\u001b[32m   5276\u001b[39m     \u001b[38;5;28misinstance\u001b[39m(value, Index)\n\u001b[32m   5277\u001b[39m     \u001b[38;5;129;01mand\u001b[39;00m value.dtype == \u001b[33m\"\u001b[39m\u001b[33mobject\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   (...)\u001b[39m\u001b[32m   5280\u001b[39m     \u001b[38;5;66;03m# TODO: Remove kludge in sanitize_array for string mode when enforcing\u001b[39;00m\n\u001b[32m   5281\u001b[39m     \u001b[38;5;66;03m# this deprecation\u001b[39;00m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/common.py:573\u001b[39m, in \u001b[36mrequire_length_match\u001b[39m\u001b[34m(data, index)\u001b[39m\n\u001b[32m    569\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m    570\u001b[39m \u001b[33;03mCheck the length of data matches the length of the index.\u001b[39;00m\n\u001b[32m    571\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m    572\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(data) != \u001b[38;5;28mlen\u001b[39m(index):\n\u001b[32m--> \u001b[39m\u001b[32m573\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m    574\u001b[39m         \u001b[33m\"\u001b[39m\u001b[33mLength of values \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m    575\u001b[39m         \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m(\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mlen\u001b[39m(data)\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m) \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m    576\u001b[39m         \u001b[33m\"\u001b[39m\u001b[33mdoes not match length of index \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m    577\u001b[39m         \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m(\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mlen\u001b[39m(index)\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m)\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m    578\u001b[39m     )\n",
      "\u001b[31mValueError\u001b[39m: Length of values (2) does not match length of index (3)"
     ]
    }
   ],
   "source": [
    "#df2['owner'] = 'Unknown'\n",
    "df2['owner3'] = ['a','b']\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AH576FnePgAi"
   },
   "source": [
    "### Add row indices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "id": "Gnjq4v9mPgAi"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "      <th>owner</th>\n",
       "      <th>owner2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>one</th>\n",
       "      <td>AA</td>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>IBM</td>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>three</th>\n",
       "      <td>GOOG</td>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>32.2</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       name        date  shares  price    owner owner2\n",
       "one      AA  2001-12-01     100   12.3  Unknown      a\n",
       "two     IBM  2012-02-10      30   10.3  Unknown      b\n",
       "three  GOOG  2010-04-09      90   32.2  Unknown      c"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.index = ['one','two','three']\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6EIzcQ2fPgAj"
   },
   "source": [
    "### Convert one of the columns into an index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Note that we lose the previous indices `(one, two, three)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "id": "w7ZqJ6DNPgAj"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "      <th>owner</th>\n",
       "      <th>owner2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IBM</th>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GOOG</th>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>32.2</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            date  shares  price    owner owner2\n",
       "name                                           \n",
       "AA    2001-12-01     100   12.3  Unknown      a\n",
       "IBM   2012-02-10      30   10.3  Unknown      b\n",
       "GOOG  2010-04-09      90   32.2  Unknown      c"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = df2.set_index(['name'])\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BguaWXy9PgAj"
   },
   "source": [
    "### Replace an index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Again, we lose the previous index, `name`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "id": "VKD73SSAPgAj"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>owner</th>\n",
       "      <th>owner2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>price</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>12.3</th>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>a</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10.3</th>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32.2</th>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>c</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             date  shares    owner owner2\n",
       "price                                    \n",
       "12.3   2001-12-01     100  Unknown      a\n",
       "10.3   2012-02-10      30  Unknown      b\n",
       "32.2   2010-04-09      90  Unknown      c"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = df2.set_index(['price'])\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JfohmzPGPgAk"
   },
   "source": [
    "### Use multiple columns for indexing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "id": "s_cNSqlsPgAk"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>owner</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th>price</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <th>12.3</th>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IBM</th>\n",
       "      <th>10.3</th>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GOOG</th>\n",
       "      <th>32.2</th>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  date  shares    owner\n",
       "name price                             \n",
       "AA   12.3   2001-12-01     100  Unknown\n",
       "IBM  10.3   2012-02-10      30  Unknown\n",
       "GOOG 32.2   2010-04-09      90  Unknown"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.DataFrame(data2)  # Reset data frame to original\n",
    "df2['owner'] = 'Unknown'\n",
    "df2 = df2.set_index(['name','price'])\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZQIBx-bvPgAk"
   },
   "source": [
    "### Accessing values in a dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cmn7wKT7PgAl"
   },
   "source": [
    "### By column index\n",
    "- Similar to projection in relational algebra"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 175
    },
    "id": "KeujX9clPgAl",
    "outputId": "c3c6fce0-63f3-45ac-8865-989fbba6a608"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>shares</th>\n",
       "      <th>date</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th>price</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <th>12.3</th>\n",
       "      <td>100</td>\n",
       "      <td>2001-12-01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IBM</th>\n",
       "      <th>10.3</th>\n",
       "      <td>30</td>\n",
       "      <td>2012-02-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GOOG</th>\n",
       "      <th>32.2</th>\n",
       "      <td>90</td>\n",
       "      <td>2010-04-09</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            shares        date\n",
       "name price                    \n",
       "AA   12.3      100  2001-12-01\n",
       "IBM  10.3       30  2012-02-10\n",
       "GOOG 32.2       90  2010-04-09"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2[['shares','date']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QV7Hpjp3PgAl"
   },
   "source": [
    "### By row index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 112
    },
    "id": "EIMVZnAvPgAl",
    "outputId": "117560c8-8721-4499-ae5c-a6ad519cefc1"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>owner</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>price</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>12.3</th>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             date  shares    owner\n",
       "price                             \n",
       "12.3   2001-12-01     100  Unknown"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.loc['AA']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lLB6UWegPgAl"
   },
   "source": [
    "### Individual element - specify row and column index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "igcbooi9PgAm",
    "outputId": "91b4a4d3-a8cb-4fc7-e46f-a99fc4a9d0df"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "price\n",
       "12.3    100\n",
       "Name: shares, dtype: int64"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.loc['AA','shares']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- What happens if the index column does not have unique values?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AA</td>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IBM</td>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GOOG</td>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>32.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AA</td>\n",
       "      <td>2025-10-21</td>\n",
       "      <td>100</td>\n",
       "      <td>33.3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name        date  shares  price\n",
       "0    AA  2001-12-01     100   12.3\n",
       "1   IBM  2012-02-10      30   10.3\n",
       "2  GOOG  2010-04-09      90   32.2\n",
       "3    AA  2025-10-21     100   33.3"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data3dup = {'name' : ['AA', 'IBM', 'GOOG','AA'], \n",
    "        'date' : ['2001-12-01', '2012-02-10', '2010-04-09', '2025-10-21'],\n",
    "        'shares' : [100, 30, 90, 100],\n",
    "        'price' : [12.3, 10.3, 32.2, 33.3]\n",
    "}\n",
    "df3dup = pd.DataFrame(data3dup)\n",
    "df3dup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IBM</th>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GOOG</th>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>32.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <td>2025-10-21</td>\n",
       "      <td>100</td>\n",
       "      <td>33.3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            date  shares  price\n",
       "name                           \n",
       "AA    2001-12-01     100   12.3\n",
       "IBM   2012-02-10      30   10.3\n",
       "GOOG  2010-04-09      90   32.2\n",
       "AA    2025-10-21     100   33.3"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df3dup = df3dup.set_index('name')\n",
    "df3dup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- So far so good"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <td>2025-10-21</td>\n",
       "      <td>100</td>\n",
       "      <td>33.3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            date  shares  price\n",
       "name                           \n",
       "AA    2001-12-01     100   12.3\n",
       "AA    2025-10-21     100   33.3"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df3dup.loc['AA']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- So there is no requirement that the index column have unique values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wQR8PHE1PgAm"
   },
   "source": [
    "### Slices, etc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "bZI5mNE3PgAm",
    "outputId": "25040106-e2c4-4f56-f81f-6b640d822091"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "name  price\n",
       "AA    12.3     100\n",
       "IBM   10.3      30\n",
       "GOOG  32.2      90\n",
       "Name: shares, dtype: int64"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.loc[:,'shares']  # All rows, column 'shares'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 175
    },
    "id": "fKcermSgPgAm",
    "outputId": "f77bda9d-3050-42cb-f0b7-3dd8c22dbedc"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "      <th>owner</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IBM</th>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GOOG</th>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>32.2</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            date  shares  price    owner\n",
       "name                                    \n",
       "AA    2001-12-01     100   12.3  Unknown\n",
       "IBM   2012-02-10      30   10.3  Unknown\n",
       "GOOG  2010-04-09      90   32.2  Unknown"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.DataFrame(data2)  # Reset data frame to original\n",
    "df2['owner'] = 'Unknown'\n",
    "df2 = df2.set_index(['name'])\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- An arbitrary subtable of rows and columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 143
    },
    "id": "-p1yY-FiT3er",
    "outputId": "6a45dca9-f734-41ca-aef5-26e279a90f3a"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "      <th>owner</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IBM</th>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      shares  price    owner\n",
       "name                        \n",
       "AA       100   12.3  Unknown\n",
       "IBM       30   10.3  Unknown"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.loc['AA':'IBM','shares':'owner']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GxCUyNbAPgAm"
   },
   "source": [
    "- Unlike series, cannot use position indices for rows if \"real\" index exists"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 143
    },
    "id": "5xGtvWCCPgAn",
    "outputId": "6fa732b8-9852-4acc-df0f-c3f8db2cd028"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "      <th>owner</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AA</td>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IBM</td>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GOOG</td>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>32.2</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   name        date  shares  price    owner\n",
       "0    AA  2001-12-01     100   12.3  Unknown\n",
       "1   IBM  2012-02-10      30   10.3  Unknown\n",
       "2  GOOG  2010-04-09      90   32.2  Unknown"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = pd.DataFrame(data2) # Reset data frame to original\n",
    "df2['owner'] = 'Unknown'\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We can slice the rows\n",
    "    - Note that for data frames, `0`, `1`, ... are treated as labels, so slice works like for named indices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 112
    },
    "id": "fv7FuhpbPgAn",
    "outputId": "93e05955-8a08-4a07-af6f-70a1d04354c6"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "      <th>owner</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AA</td>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IBM</td>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  name        date  shares  price    owner\n",
       "0   AA  2001-12-01     100   12.3  Unknown\n",
       "1  IBM  2012-02-10      30   10.3  Unknown"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.loc[0:1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Now create a row index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 175
    },
    "id": "rslbUAhgPgAn",
    "outputId": "4587aa0e-59a0-4cde-92b0-b8d5930bb4cc"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "      <th>owner</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IBM</th>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GOOG</th>\n",
       "      <td>2010-04-09</td>\n",
       "      <td>90</td>\n",
       "      <td>32.2</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            date  shares  price    owner\n",
       "name                                    \n",
       "AA    2001-12-01     100   12.3  Unknown\n",
       "IBM   2012-02-10      30   10.3  Unknown\n",
       "GOOG  2010-04-09      90   32.2  Unknown"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = df2.set_index(['name'])\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Can no longer slice rows by position\n",
    "    - This is because the row numbers are really labels, not positions\n",
    "    - The behaviour for `Series` is inconsistent with this interpretation for `DataFrane`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 312
    },
    "id": "DMyeYsgqPgAn",
    "outputId": "36d1e25f-55cf-42a3-d667-d319ad654528"
   },
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "cannot do slice indexing on Index with these indexers [0] of type int",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mTypeError\u001b[39m                                 Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[38]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m \u001b[43mdf2\u001b[49m\u001b[43m.\u001b[49m\u001b[43mloc\u001b[49m\u001b[43m[\u001b[49m\u001b[32;43m0\u001b[39;49m\u001b[43m:\u001b[49m\u001b[32;43m2\u001b[39;49m\u001b[43m]\u001b[49m \n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/indexing.py:1191\u001b[39m, in \u001b[36m_LocationIndexer.__getitem__\u001b[39m\u001b[34m(self, key)\u001b[39m\n\u001b[32m   1189\u001b[39m maybe_callable = com.apply_if_callable(key, \u001b[38;5;28mself\u001b[39m.obj)\n\u001b[32m   1190\u001b[39m maybe_callable = \u001b[38;5;28mself\u001b[39m._check_deprecated_callable_usage(key, maybe_callable)\n\u001b[32m-> \u001b[39m\u001b[32m1191\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_getitem_axis\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmaybe_callable\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[43m=\u001b[49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/indexing.py:1411\u001b[39m, in \u001b[36m_LocIndexer._getitem_axis\u001b[39m\u001b[34m(self, key, axis)\u001b[39m\n\u001b[32m   1409\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(key, \u001b[38;5;28mslice\u001b[39m):\n\u001b[32m   1410\u001b[39m     \u001b[38;5;28mself\u001b[39m._validate_key(key, axis)\n\u001b[32m-> \u001b[39m\u001b[32m1411\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_get_slice_axis\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[43m=\u001b[49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1412\u001b[39m \u001b[38;5;28;01melif\u001b[39;00m com.is_bool_indexer(key):\n\u001b[32m   1413\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._getbool_axis(key, axis=axis)\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/indexing.py:1443\u001b[39m, in \u001b[36m_LocIndexer._get_slice_axis\u001b[39m\u001b[34m(self, slice_obj, axis)\u001b[39m\n\u001b[32m   1440\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m obj.copy(deep=\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[32m   1442\u001b[39m labels = obj._get_axis(axis)\n\u001b[32m-> \u001b[39m\u001b[32m1443\u001b[39m indexer = \u001b[43mlabels\u001b[49m\u001b[43m.\u001b[49m\u001b[43mslice_indexer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mslice_obj\u001b[49m\u001b[43m.\u001b[49m\u001b[43mstart\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mslice_obj\u001b[49m\u001b[43m.\u001b[49m\u001b[43mstop\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mslice_obj\u001b[49m\u001b[43m.\u001b[49m\u001b[43mstep\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1445\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(indexer, \u001b[38;5;28mslice\u001b[39m):\n\u001b[32m   1446\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m.obj._slice(indexer, axis=axis)\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:6708\u001b[39m, in \u001b[36mIndex.slice_indexer\u001b[39m\u001b[34m(self, start, end, step)\u001b[39m\n\u001b[32m   6664\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mslice_indexer\u001b[39m(\n\u001b[32m   6665\u001b[39m     \u001b[38;5;28mself\u001b[39m,\n\u001b[32m   6666\u001b[39m     start: Hashable | \u001b[38;5;28;01mNone\u001b[39;00m = \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[32m   6667\u001b[39m     end: Hashable | \u001b[38;5;28;01mNone\u001b[39;00m = \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[32m   6668\u001b[39m     step: \u001b[38;5;28mint\u001b[39m | \u001b[38;5;28;01mNone\u001b[39;00m = \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[32m   6669\u001b[39m ) -> \u001b[38;5;28mslice\u001b[39m:\n\u001b[32m   6670\u001b[39m \u001b[38;5;250m    \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m   6671\u001b[39m \u001b[33;03m    Compute the slice indexer for input labels and step.\u001b[39;00m\n\u001b[32m   6672\u001b[39m \n\u001b[32m   (...)\u001b[39m\u001b[32m   6706\u001b[39m \u001b[33;03m    slice(1, 3, None)\u001b[39;00m\n\u001b[32m   6707\u001b[39m \u001b[33;03m    \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m6708\u001b[39m     start_slice, end_slice = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mslice_locs\u001b[49m\u001b[43m(\u001b[49m\u001b[43mstart\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mend\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstep\u001b[49m\u001b[43m=\u001b[49m\u001b[43mstep\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   6710\u001b[39m     \u001b[38;5;66;03m# return a slice\u001b[39;00m\n\u001b[32m   6711\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_scalar(start_slice):\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:6934\u001b[39m, in \u001b[36mIndex.slice_locs\u001b[39m\u001b[34m(self, start, end, step)\u001b[39m\n\u001b[32m   6932\u001b[39m start_slice = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m   6933\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m start \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m6934\u001b[39m     start_slice = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mget_slice_bound\u001b[49m\u001b[43m(\u001b[49m\u001b[43mstart\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mleft\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[32m   6935\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m start_slice \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m   6936\u001b[39m     start_slice = \u001b[32m0\u001b[39m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:6849\u001b[39m, in \u001b[36mIndex.get_slice_bound\u001b[39m\u001b[34m(self, label, side)\u001b[39m\n\u001b[32m   6845\u001b[39m original_label = label\n\u001b[32m   6847\u001b[39m \u001b[38;5;66;03m# For datetime indices label may be a string that has to be converted\u001b[39;00m\n\u001b[32m   6848\u001b[39m \u001b[38;5;66;03m# to datetime boundary according to its resolution.\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m6849\u001b[39m label = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_maybe_cast_slice_bound\u001b[49m\u001b[43m(\u001b[49m\u001b[43mlabel\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mside\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   6851\u001b[39m \u001b[38;5;66;03m# we need to look up the label\u001b[39;00m\n\u001b[32m   6852\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:6782\u001b[39m, in \u001b[36mIndex._maybe_cast_slice_bound\u001b[39m\u001b[34m(self, label, side)\u001b[39m\n\u001b[32m   6780\u001b[39m \u001b[38;5;66;03m# reject them, if index does not contain label\u001b[39;00m\n\u001b[32m   6781\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m (is_float(label) \u001b[38;5;129;01mor\u001b[39;00m is_integer(label)) \u001b[38;5;129;01mand\u001b[39;00m label \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m:\n\u001b[32m-> \u001b[39m\u001b[32m6782\u001b[39m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_raise_invalid_indexer\u001b[49m\u001b[43m(\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mslice\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlabel\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   6784\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m label\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/python-venv/lib/python3.13/site-packages/pandas/core/indexes/base.py:4308\u001b[39m, in \u001b[36mIndex._raise_invalid_indexer\u001b[39m\u001b[34m(self, form, key, reraise)\u001b[39m\n\u001b[32m   4306\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m reraise \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m lib.no_default:\n\u001b[32m   4307\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(msg) \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01mreraise\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m4308\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(msg)\n",
      "\u001b[31mTypeError\u001b[39m: cannot do slice indexing on Index with these indexers [0] of type int"
     ]
    }
   ],
   "source": [
    "df2.loc[0:2] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- However, `iloc` works, like for `Series`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>shares</th>\n",
       "      <th>price</th>\n",
       "      <th>owner</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AA</th>\n",
       "      <td>2001-12-01</td>\n",
       "      <td>100</td>\n",
       "      <td>12.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>IBM</th>\n",
       "      <td>2012-02-10</td>\n",
       "      <td>30</td>\n",
       "      <td>10.3</td>\n",
       "      <td>Unknown</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            date  shares  price    owner\n",
       "name                                    \n",
       "AA    2001-12-01     100   12.3  Unknown\n",
       "IBM   2012-02-10      30   10.3  Unknown"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.iloc[0:2] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lzw_dijtPgAn"
   },
   "source": [
    "## Reading csv files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- By convention, the first line of a csv file is interpreted to be column names\n",
    "- `index_col=None` says do not create a row index from any of the given columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "id": "4YXNPgJoPgAo"
   },
   "outputs": [],
   "source": [
    "casts = pd.read_csv('cast.csv',index_col=None)\n",
    "titles = pd.read_csv('titles.csv',index_col=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Examine the first few rows using `head()`\n",
    "    - Default is 5 rows\n",
    "    - Can ask for `n` rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "CBg4-jSpPgAo",
    "outputId": "a8b92049-3273-49e1-9b53-6d3260abcad0"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "      <th>name</th>\n",
       "      <th>type</th>\n",
       "      <th>character</th>\n",
       "      <th>n</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Closet Monster</td>\n",
       "      <td>2015</td>\n",
       "      <td>Buffy #1</td>\n",
       "      <td>actor</td>\n",
       "      <td>Buffy 4</td>\n",
       "      <td>31.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Suuri illusioni</td>\n",
       "      <td>1985</td>\n",
       "      <td>Homo $</td>\n",
       "      <td>actor</td>\n",
       "      <td>Guests</td>\n",
       "      <td>22.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Battle of the Sexes</td>\n",
       "      <td>2017</td>\n",
       "      <td>$hutter</td>\n",
       "      <td>actor</td>\n",
       "      <td>Bobby Riggs Fan</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Secret in Their Eyes</td>\n",
       "      <td>2015</td>\n",
       "      <td>$hutter</td>\n",
       "      <td>actor</td>\n",
       "      <td>2002 Dodger Fan</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Steve Jobs</td>\n",
       "      <td>2015</td>\n",
       "      <td>$hutter</td>\n",
       "      <td>actor</td>\n",
       "      <td>1988 Opera House Patron</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  title  year      name   type                character     n\n",
       "0        Closet Monster  2015  Buffy #1  actor                  Buffy 4  31.0\n",
       "1       Suuri illusioni  1985    Homo $  actor                   Guests  22.0\n",
       "2   Battle of the Sexes  2017   $hutter  actor          Bobby Riggs Fan  10.0\n",
       "3  Secret in Their Eyes  2015   $hutter  actor          2002 Dodger Fan   NaN\n",
       "4            Steve Jobs  2015   $hutter  actor  1988 Opera House Patron   NaN"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "casts.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 269
    },
    "id": "TovesnKGPgAo",
    "outputId": "193bd5da-a583-447d-cb0f-933489927f89"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "      <th>name</th>\n",
       "      <th>type</th>\n",
       "      <th>character</th>\n",
       "      <th>n</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Closet Monster</td>\n",
       "      <td>2015</td>\n",
       "      <td>Buffy #1</td>\n",
       "      <td>actor</td>\n",
       "      <td>Buffy 4</td>\n",
       "      <td>31.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Suuri illusioni</td>\n",
       "      <td>1985</td>\n",
       "      <td>Homo $</td>\n",
       "      <td>actor</td>\n",
       "      <td>Guests</td>\n",
       "      <td>22.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Battle of the Sexes</td>\n",
       "      <td>2017</td>\n",
       "      <td>$hutter</td>\n",
       "      <td>actor</td>\n",
       "      <td>Bobby Riggs Fan</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Secret in Their Eyes</td>\n",
       "      <td>2015</td>\n",
       "      <td>$hutter</td>\n",
       "      <td>actor</td>\n",
       "      <td>2002 Dodger Fan</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Steve Jobs</td>\n",
       "      <td>2015</td>\n",
       "      <td>$hutter</td>\n",
       "      <td>actor</td>\n",
       "      <td>1988 Opera House Patron</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Straight Outta Compton</td>\n",
       "      <td>2015</td>\n",
       "      <td>$hutter</td>\n",
       "      <td>actor</td>\n",
       "      <td>Club Patron</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Straight Outta Compton</td>\n",
       "      <td>2015</td>\n",
       "      <td>$hutter</td>\n",
       "      <td>actor</td>\n",
       "      <td>Dopeman</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    title  year      name   type                character  \\\n",
       "0          Closet Monster  2015  Buffy #1  actor                  Buffy 4   \n",
       "1         Suuri illusioni  1985    Homo $  actor                   Guests   \n",
       "2     Battle of the Sexes  2017   $hutter  actor          Bobby Riggs Fan   \n",
       "3    Secret in Their Eyes  2015   $hutter  actor          2002 Dodger Fan   \n",
       "4              Steve Jobs  2015   $hutter  actor  1988 Opera House Patron   \n",
       "5  Straight Outta Compton  2015   $hutter  actor              Club Patron   \n",
       "6  Straight Outta Compton  2015   $hutter  actor                  Dopeman   \n",
       "\n",
       "      n  \n",
       "0  31.0  \n",
       "1  22.0  \n",
       "2  10.0  \n",
       "3   NaN  \n",
       "4   NaN  \n",
       "5   NaN  \n",
       "6   NaN  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "casts.head(7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Likewise, `tail()` gives last few lines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "csAeRKsFPgAo",
    "outputId": "57fac4a8-aa53-4528-d969-eaa671fa44b7"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>49995</th>\n",
       "      <td>Rebel</td>\n",
       "      <td>1970</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49996</th>\n",
       "      <td>Suzanne</td>\n",
       "      <td>1996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49997</th>\n",
       "      <td>Bomba</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49998</th>\n",
       "      <td>Aao Jao Ghar Tumhara</td>\n",
       "      <td>1984</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49999</th>\n",
       "      <td>Mrs. Munck</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      title  year\n",
       "49995                 Rebel  1970\n",
       "49996               Suzanne  1996\n",
       "49997                 Bomba  2013\n",
       "49998  Aao Jao Ghar Tumhara  1984\n",
       "49999            Mrs. Munck  1995"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titles.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>49992</th>\n",
       "      <td>Legend of Horror</td>\n",
       "      <td>1972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49993</th>\n",
       "      <td>Corruption.Gov</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49994</th>\n",
       "      <td>Lille Fridolf blir morfar</td>\n",
       "      <td>1957</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49995</th>\n",
       "      <td>Rebel</td>\n",
       "      <td>1970</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49996</th>\n",
       "      <td>Suzanne</td>\n",
       "      <td>1996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49997</th>\n",
       "      <td>Bomba</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49998</th>\n",
       "      <td>Aao Jao Ghar Tumhara</td>\n",
       "      <td>1984</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49999</th>\n",
       "      <td>Mrs. Munck</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           title  year\n",
       "49992           Legend of Horror  1972\n",
       "49993             Corruption.Gov  2010\n",
       "49994  Lille Fridolf blir morfar  1957\n",
       "49995                      Rebel  1970\n",
       "49996                    Suzanne  1996\n",
       "49997                      Bomba  2013\n",
       "49998       Aao Jao Ghar Tumhara  1984\n",
       "49999                 Mrs. Munck  1995"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titles.tail(8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hvbLBEJiPgAp"
   },
   "source": [
    "### Filtering data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fjswr_FNPgAp"
   },
   "source": [
    "- Movies after 1985\n",
    "- Like`select` in relational algebra"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 424
    },
    "id": "Su8HI1zcPgAp",
    "outputId": "0b9cea25-519f-4d3c-dd20-d6e20909c225"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>The Rising Son</td>\n",
       "      <td>1990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Crucea de piatra</td>\n",
       "      <td>1993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Country</td>\n",
       "      <td>2000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Gaiking II</td>\n",
       "      <td>2011</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Medusa (IV)</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49990</th>\n",
       "      <td>Junebug</td>\n",
       "      <td>2005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49993</th>\n",
       "      <td>Corruption.Gov</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49996</th>\n",
       "      <td>Suzanne</td>\n",
       "      <td>1996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49997</th>\n",
       "      <td>Bomba</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49999</th>\n",
       "      <td>Mrs. Munck</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>29814 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  title  year\n",
       "0        The Rising Son  1990\n",
       "2      Crucea de piatra  1993\n",
       "3               Country  2000\n",
       "4            Gaiking II  2011\n",
       "5           Medusa (IV)  2015\n",
       "...                 ...   ...\n",
       "49990           Junebug  2005\n",
       "49993    Corruption.Gov  2010\n",
       "49996           Suzanne  1996\n",
       "49997             Bomba  2013\n",
       "49999        Mrs. Munck  1995\n",
       "\n",
       "[29814 rows x 2 columns]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "after85 = titles[titles['year'] > 1985]\n",
    "# select * from titles where year > 1985\n",
    "after85"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Project the output of select on column `title`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0          The Rising Son\n",
       "2        Crucea de piatra\n",
       "3                 Country\n",
       "4              Gaiking II\n",
       "5             Medusa (IV)\n",
       "               ...       \n",
       "49990             Junebug\n",
       "49993      Corruption.Gov\n",
       "49996             Suzanne\n",
       "49997               Bomba\n",
       "49999          Mrs. Munck\n",
       "Name: title, Length: 29814, dtype: object"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "after85titles = titles[titles['year'] > 1985]['title']\n",
    "# Select title from titles where year > 1985\n",
    "after85titles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Boolean combinations of conditions\n",
    "    - `&` for and, `|` for or, `~` for not"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-Zx-ErapPgAp"
   },
   "source": [
    "- Movies in years 1990 - 1999"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 424
    },
    "id": "UDIb8SdnPgAp",
    "outputId": "99f67c83-931e-41ea-cc1f-87591c9e9be6"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>The Rising Son</td>\n",
       "      <td>1990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Crucea de piatra</td>\n",
       "      <td>1993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Poka Makorer Ghar Bosoti</td>\n",
       "      <td>1996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Maa Durga Shakti</td>\n",
       "      <td>1999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Conflict of Interest</td>\n",
       "      <td>1993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49969</th>\n",
       "      <td>Chi mei wang liang</td>\n",
       "      <td>1998</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49979</th>\n",
       "      <td>Gagay: Prinsesa ng brownout</td>\n",
       "      <td>1993</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49987</th>\n",
       "      <td>I Won't Dance</td>\n",
       "      <td>1992</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49996</th>\n",
       "      <td>Suzanne</td>\n",
       "      <td>1996</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49999</th>\n",
       "      <td>Mrs. Munck</td>\n",
       "      <td>1995</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>4803 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                             title  year\n",
       "0                   The Rising Son  1990\n",
       "2                 Crucea de piatra  1993\n",
       "12        Poka Makorer Ghar Bosoti  1996\n",
       "19                Maa Durga Shakti  1999\n",
       "24            Conflict of Interest  1993\n",
       "...                            ...   ...\n",
       "49969           Chi mei wang liang  1998\n",
       "49979  Gagay: Prinsesa ng brownout  1993\n",
       "49987                I Won't Dance  1992\n",
       "49996                      Suzanne  1996\n",
       "49999                   Mrs. Munck  1995\n",
       "\n",
       "[4803 rows x 2 columns]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = titles\n",
    "movies90 = t[(t['year'] >= 1990) & (t['year'] < 2000)]\n",
    "movies90"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Complement of the previous condition, using negation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>The Thousand Plane Raid</td>\n",
       "      <td>1969</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Country</td>\n",
       "      <td>2000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Gaiking II</td>\n",
       "      <td>2011</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Medusa (IV)</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>The Fresh Air Will Do You Good</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49993</th>\n",
       "      <td>Corruption.Gov</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49994</th>\n",
       "      <td>Lille Fridolf blir morfar</td>\n",
       "      <td>1957</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49995</th>\n",
       "      <td>Rebel</td>\n",
       "      <td>1970</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49997</th>\n",
       "      <td>Bomba</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49998</th>\n",
       "      <td>Aao Jao Ghar Tumhara</td>\n",
       "      <td>1984</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>45197 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                title  year\n",
       "1             The Thousand Plane Raid  1969\n",
       "3                             Country  2000\n",
       "4                          Gaiking II  2011\n",
       "5                         Medusa (IV)  2015\n",
       "6      The Fresh Air Will Do You Good  2008\n",
       "...                               ...   ...\n",
       "49993                  Corruption.Gov  2010\n",
       "49994       Lille Fridolf blir morfar  1957\n",
       "49995                           Rebel  1970\n",
       "49997                           Bomba  2013\n",
       "49998            Aao Jao Ghar Tumhara  1984\n",
       "\n",
       "[45197 rows x 2 columns]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = titles\n",
    "notmovies90 = t[~((t['year'] >= 1990) & (t['year'] < 2000))]\n",
    "notmovies90"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Complement of the previous condition, using or"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>The Thousand Plane Raid</td>\n",
       "      <td>1969</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Country</td>\n",
       "      <td>2000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Gaiking II</td>\n",
       "      <td>2011</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Medusa (IV)</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>The Fresh Air Will Do You Good</td>\n",
       "      <td>2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49993</th>\n",
       "      <td>Corruption.Gov</td>\n",
       "      <td>2010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49994</th>\n",
       "      <td>Lille Fridolf blir morfar</td>\n",
       "      <td>1957</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49995</th>\n",
       "      <td>Rebel</td>\n",
       "      <td>1970</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49997</th>\n",
       "      <td>Bomba</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49998</th>\n",
       "      <td>Aao Jao Ghar Tumhara</td>\n",
       "      <td>1984</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>45197 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                title  year\n",
       "1             The Thousand Plane Raid  1969\n",
       "3                             Country  2000\n",
       "4                          Gaiking II  2011\n",
       "5                         Medusa (IV)  2015\n",
       "6      The Fresh Air Will Do You Good  2008\n",
       "...                               ...   ...\n",
       "49993                  Corruption.Gov  2010\n",
       "49994       Lille Fridolf blir morfar  1957\n",
       "49995                           Rebel  1970\n",
       "49997                           Bomba  2013\n",
       "49998            Aao Jao Ghar Tumhara  1984\n",
       "\n",
       "[45197 rows x 2 columns]"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = titles\n",
    "notmovies90or = t[(t['year'] < 1990) | (t['year'] >= 2000)]\n",
    "notmovies90or"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8TnjO4KJPgAq"
   },
   "source": [
    "### Sorting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WI3sfBcKPgAq"
   },
   "source": [
    "- All movies named 'Macbeth'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "id": "wOfYJ5BSPgAq"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4226</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>1913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9322</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>2006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11722</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17166</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>1997</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25847</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>1998</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         title  year\n",
       "4226   Macbeth  1913\n",
       "9322   Macbeth  2006\n",
       "11722  Macbeth  2013\n",
       "17166  Macbeth  1997\n",
       "25847  Macbeth  1998"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "macbeth = t[t['title'] == 'Macbeth']\n",
    "macbeth"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "3ir-RO7tPgAq",
    "outputId": "bac49826-ac19-4de1-9f94-0b512b293574"
   },
   "source": [
    "- Sort by year\n",
    "    - Note that sort is in-place"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "id": "cq6tHjnSPgAq"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4226</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>1913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17166</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>1997</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25847</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>1998</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9322</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>2006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11722</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         title  year\n",
       "4226   Macbeth  1913\n",
       "17166  Macbeth  1997\n",
       "25847  Macbeth  1998\n",
       "9322   Macbeth  2006\n",
       "11722  Macbeth  2013"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "macbeth = macbeth.sort_values('year')\n",
    "macbeth"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 206
    },
    "id": "jCAqIe07PgAq",
    "outputId": "f2e1081b-0218-44c3-a47b-40e2086faaaa"
   },
   "source": [
    "- To restore original order, sort by index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "id": "i2GZvWkmPgAr"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4226</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>1913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9322</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>2006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11722</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17166</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>1997</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25847</th>\n",
       "      <td>Macbeth</td>\n",
       "      <td>1998</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         title  year\n",
       "4226   Macbeth  1913\n",
       "9322   Macbeth  2006\n",
       "11722  Macbeth  2013\n",
       "17166  Macbeth  1997\n",
       "25847  Macbeth  1998"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "macbeth = macbeth.sort_index()\n",
    "macbeth"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-nNvPJEpPgAr"
   },
   "source": [
    "### Summaries and descriptive statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `info()` gives overall summary of a data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "qVuB5byQPgAr",
    "outputId": "d7d3e1e6-e842-4d20-f1e8-52df3f3b0af5"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 50000 entries, 0 to 49999\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count  Dtype \n",
      "---  ------  --------------  ----- \n",
      " 0   title   50000 non-null  object\n",
      " 1   year    50000 non-null  int64 \n",
      "dtypes: int64(1), object(1)\n",
      "memory usage: 781.4+ KB\n"
     ]
    }
   ],
   "source": [
    "titles.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "dQNX6QeXPgAr",
    "outputId": "2933bbaa-c3c3-48fd-b748-7ecccad1f7be"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 75001 entries, 0 to 75000\n",
      "Data columns (total 6 columns):\n",
      " #   Column     Non-Null Count  Dtype  \n",
      "---  ------     --------------  -----  \n",
      " 0   title      75000 non-null  object \n",
      " 1   year       75001 non-null  int64  \n",
      " 2   name       75001 non-null  object \n",
      " 3   type       75001 non-null  object \n",
      " 4   character  75001 non-null  object \n",
      " 5   n          46035 non-null  float64\n",
      "dtypes: float64(1), int64(1), object(4)\n",
      "memory usage: 3.4+ MB\n"
     ]
    }
   ],
   "source": [
    "casts.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `describe()` gives statistical summary of numeric columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 300
    },
    "id": "dWq8rhPzPgAs",
    "outputId": "35f96475-f479-443c-e105-d81f724093ca"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>50000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>1986.106120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>29.293942</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1900.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1967.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>1996.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2011.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>2024.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               year\n",
       "count  50000.000000\n",
       "mean    1986.106120\n",
       "std       29.293942\n",
       "min     1900.000000\n",
       "25%     1967.000000\n",
       "50%     1996.000000\n",
       "75%     2011.000000\n",
       "max     2024.000000"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "titles.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 300
    },
    "id": "3eM258gTPgAs",
    "outputId": "f35e6bce-810a-44f0-9659-74333b4d098e"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>n</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>75001.000000</td>\n",
       "      <td>46035.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>1990.536473</td>\n",
       "      <td>16.814359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>26.748233</td>\n",
       "      <td>24.695616</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1912.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1974.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>2002.000000</td>\n",
       "      <td>10.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2012.000000</td>\n",
       "      <td>21.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>2023.000000</td>\n",
       "      <td>701.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               year             n\n",
       "count  75001.000000  46035.000000\n",
       "mean    1990.536473     16.814359\n",
       "std       26.748233     24.695616\n",
       "min     1912.000000      1.000000\n",
       "25%     1974.000000      4.000000\n",
       "50%     2002.000000     10.000000\n",
       "75%     2012.000000     21.000000\n",
       "max     2023.000000    701.000000"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "casts.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "J3cICXF1PgAs"
   },
   "source": [
    "### Descriptive statistics for categorical data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Can also get summary for a non-numeric column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "UG--SUr8PgAs",
    "outputId": "d546e20e-3f9d-4b4d-fd03-b545dee2aee6"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count           75001\n",
       "unique          29319\n",
       "top       Ernie Adams\n",
       "freq              431\n",
       "Name: name, dtype: object"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "casts['name'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "name\n",
       "$hutter             5\n",
       "'Babe' Agamenoni    1\n",
       "'Babe' Agaminono    1\n",
       "'El Guisa'          1\n",
       "'El Viti'           1\n",
       "                   ..\n",
       "Zura Abesadze       1\n",
       "Zuri Alexander      1\n",
       "Zuzu Abu            1\n",
       "Zvone Agrez         2\n",
       "gregg Alexander     1\n",
       "Name: name, Length: 29319, dtype: int64"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "casts.groupby(['name'])['name'].count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count       75001\n",
       "unique      50299\n",
       "top       Himself\n",
       "freq          405\n",
       "Name: character, dtype: object"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "casts['character'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Another example, housing data by locality in California"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "id": "ZWAaBuwdPgAs"
   },
   "outputs": [],
   "source": [
    "housing = pd.read_csv('housing.csv', index_col=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>longitude</th>\n",
       "      <th>latitude</th>\n",
       "      <th>housing_median_age</th>\n",
       "      <th>total_rooms</th>\n",
       "      <th>total_bedrooms</th>\n",
       "      <th>population</th>\n",
       "      <th>households</th>\n",
       "      <th>median_income</th>\n",
       "      <th>median_house_value</th>\n",
       "      <th>ocean_proximity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-122.23</td>\n",
       "      <td>37.88</td>\n",
       "      <td>41.0</td>\n",
       "      <td>880.0</td>\n",
       "      <td>129.0</td>\n",
       "      <td>322.0</td>\n",
       "      <td>126.0</td>\n",
       "      <td>8.3252</td>\n",
       "      <td>452600.0</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-122.22</td>\n",
       "      <td>37.86</td>\n",
       "      <td>21.0</td>\n",
       "      <td>7099.0</td>\n",
       "      <td>1106.0</td>\n",
       "      <td>2401.0</td>\n",
       "      <td>1138.0</td>\n",
       "      <td>8.3014</td>\n",
       "      <td>358500.0</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-122.24</td>\n",
       "      <td>37.85</td>\n",
       "      <td>52.0</td>\n",
       "      <td>1467.0</td>\n",
       "      <td>190.0</td>\n",
       "      <td>496.0</td>\n",
       "      <td>177.0</td>\n",
       "      <td>7.2574</td>\n",
       "      <td>352100.0</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-122.25</td>\n",
       "      <td>37.85</td>\n",
       "      <td>52.0</td>\n",
       "      <td>1274.0</td>\n",
       "      <td>235.0</td>\n",
       "      <td>558.0</td>\n",
       "      <td>219.0</td>\n",
       "      <td>5.6431</td>\n",
       "      <td>341300.0</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-122.25</td>\n",
       "      <td>37.85</td>\n",
       "      <td>52.0</td>\n",
       "      <td>1627.0</td>\n",
       "      <td>280.0</td>\n",
       "      <td>565.0</td>\n",
       "      <td>259.0</td>\n",
       "      <td>3.8462</td>\n",
       "      <td>342200.0</td>\n",
       "      <td>NEAR BAY</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \\\n",
       "0    -122.23     37.88                41.0        880.0           129.0   \n",
       "1    -122.22     37.86                21.0       7099.0          1106.0   \n",
       "2    -122.24     37.85                52.0       1467.0           190.0   \n",
       "3    -122.25     37.85                52.0       1274.0           235.0   \n",
       "4    -122.25     37.85                52.0       1627.0           280.0   \n",
       "\n",
       "   population  households  median_income  median_house_value ocean_proximity  \n",
       "0       322.0       126.0         8.3252            452600.0        NEAR BAY  \n",
       "1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  \n",
       "2       496.0       177.0         7.2574            352100.0        NEAR BAY  \n",
       "3       558.0       219.0         5.6431            341300.0        NEAR BAY  \n",
       "4       565.0       259.0         3.8462            342200.0        NEAR BAY  "
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "housing.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Vyk0rPIMPgAt",
    "outputId": "c015e299-4baa-44c3-dc01-0431fdd793cc"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 20640 entries, 0 to 20639\n",
      "Data columns (total 10 columns):\n",
      " #   Column              Non-Null Count  Dtype  \n",
      "---  ------              --------------  -----  \n",
      " 0   longitude           20640 non-null  float64\n",
      " 1   latitude            20640 non-null  float64\n",
      " 2   housing_median_age  20640 non-null  float64\n",
      " 3   total_rooms         20640 non-null  float64\n",
      " 4   total_bedrooms      20433 non-null  float64\n",
      " 5   population          20640 non-null  float64\n",
      " 6   households          20640 non-null  float64\n",
      " 7   median_income       20640 non-null  float64\n",
      " 8   median_house_value  20640 non-null  float64\n",
      " 9   ocean_proximity     20640 non-null  object \n",
      "dtypes: float64(9), object(1)\n",
      "memory usage: 1.6+ MB\n"
     ]
    }
   ],
   "source": [
    "housing.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 320
    },
    "id": "rzmHi2aLPgAt",
    "outputId": "d9a5bb20-8543-4567-81cd-0b0ead337a7d"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>longitude</th>\n",
       "      <th>latitude</th>\n",
       "      <th>housing_median_age</th>\n",
       "      <th>total_rooms</th>\n",
       "      <th>total_bedrooms</th>\n",
       "      <th>population</th>\n",
       "      <th>households</th>\n",
       "      <th>median_income</th>\n",
       "      <th>median_house_value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>20640.000000</td>\n",
       "      <td>20640.000000</td>\n",
       "      <td>20640.000000</td>\n",
       "      <td>20640.000000</td>\n",
       "      <td>20433.000000</td>\n",
       "      <td>20640.000000</td>\n",
       "      <td>20640.000000</td>\n",
       "      <td>20640.000000</td>\n",
       "      <td>20640.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>-119.569704</td>\n",
       "      <td>35.631861</td>\n",
       "      <td>28.639486</td>\n",
       "      <td>2635.763081</td>\n",
       "      <td>537.870553</td>\n",
       "      <td>1425.476744</td>\n",
       "      <td>499.539680</td>\n",
       "      <td>3.870671</td>\n",
       "      <td>206855.816909</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>2.003532</td>\n",
       "      <td>2.135952</td>\n",
       "      <td>12.585558</td>\n",
       "      <td>2181.615252</td>\n",
       "      <td>421.385070</td>\n",
       "      <td>1132.462122</td>\n",
       "      <td>382.329753</td>\n",
       "      <td>1.899822</td>\n",
       "      <td>115395.615874</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-124.350000</td>\n",
       "      <td>32.540000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.499900</td>\n",
       "      <td>14999.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>-121.800000</td>\n",
       "      <td>33.930000</td>\n",
       "      <td>18.000000</td>\n",
       "      <td>1447.750000</td>\n",
       "      <td>296.000000</td>\n",
       "      <td>787.000000</td>\n",
       "      <td>280.000000</td>\n",
       "      <td>2.563400</td>\n",
       "      <td>119600.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>-118.490000</td>\n",
       "      <td>34.260000</td>\n",
       "      <td>29.000000</td>\n",
       "      <td>2127.000000</td>\n",
       "      <td>435.000000</td>\n",
       "      <td>1166.000000</td>\n",
       "      <td>409.000000</td>\n",
       "      <td>3.534800</td>\n",
       "      <td>179700.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>-118.010000</td>\n",
       "      <td>37.710000</td>\n",
       "      <td>37.000000</td>\n",
       "      <td>3148.000000</td>\n",
       "      <td>647.000000</td>\n",
       "      <td>1725.000000</td>\n",
       "      <td>605.000000</td>\n",
       "      <td>4.743250</td>\n",
       "      <td>264725.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>-114.310000</td>\n",
       "      <td>41.950000</td>\n",
       "      <td>52.000000</td>\n",
       "      <td>39320.000000</td>\n",
       "      <td>6445.000000</td>\n",
       "      <td>35682.000000</td>\n",
       "      <td>6082.000000</td>\n",
       "      <td>15.000100</td>\n",
       "      <td>500001.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          longitude      latitude  housing_median_age   total_rooms  \\\n",
       "count  20640.000000  20640.000000        20640.000000  20640.000000   \n",
       "mean    -119.569704     35.631861           28.639486   2635.763081   \n",
       "std        2.003532      2.135952           12.585558   2181.615252   \n",
       "min     -124.350000     32.540000            1.000000      2.000000   \n",
       "25%     -121.800000     33.930000           18.000000   1447.750000   \n",
       "50%     -118.490000     34.260000           29.000000   2127.000000   \n",
       "75%     -118.010000     37.710000           37.000000   3148.000000   \n",
       "max     -114.310000     41.950000           52.000000  39320.000000   \n",
       "\n",
       "       total_bedrooms    population    households  median_income  \\\n",
       "count    20433.000000  20640.000000  20640.000000   20640.000000   \n",
       "mean       537.870553   1425.476744    499.539680       3.870671   \n",
       "std        421.385070   1132.462122    382.329753       1.899822   \n",
       "min          1.000000      3.000000      1.000000       0.499900   \n",
       "25%        296.000000    787.000000    280.000000       2.563400   \n",
       "50%        435.000000   1166.000000    409.000000       3.534800   \n",
       "75%        647.000000   1725.000000    605.000000       4.743250   \n",
       "max       6445.000000  35682.000000   6082.000000      15.000100   \n",
       "\n",
       "       median_house_value  \n",
       "count        20640.000000  \n",
       "mean        206855.816909  \n",
       "std         115395.615874  \n",
       "min          14999.000000  \n",
       "25%         119600.000000  \n",
       "50%         179700.000000  \n",
       "75%         264725.000000  \n",
       "max         500001.000000  "
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "housing.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Only one non-numeric column, `ocean_proximity`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count         20640\n",
       "unique            5\n",
       "top       <1H OCEAN\n",
       "freq           9136\n",
       "Name: ocean_proximity, dtype: object"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "housing['ocean_proximity'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['NEAR BAY', 'NEAR BAY', 'NEAR BAY', ..., 'INLAND', 'INLAND',\n",
       "       'INLAND'], shape=(20640,), dtype=object)"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "housing['ocean_proximity'].values"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Lecture-18-25nov2021.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}