diff --git a/tools_pandas.ipynb b/tools_pandas.ipynb index f9b4a75..b4ec061 100644 --- a/tools_pandas.ipynb +++ b/tools_pandas.ipynb @@ -5,11 +5,16 @@ "metadata": {}, "source": [ "# Tools - pandas\n", - "*The `pandas` library provides high-performance, easy-to-use data structures and data analysis tools. The main data structure is the `DataFrame`, which you can think of as a spreadsheet (including column names and row labels).*\n", + "*The `pandas` library provides high-performance, easy-to-use data structures and data analysis tools. The main data structure is the `DataFrame`, which you can think of as an in-memory 2D table (like a spreadsheet, with column names and row labels). Many features available in Excel are available programmatically, such as creating pivot tables, computing columns based on other columns, plotting graphs, etc. You can also group rows by column value, or join tables much like in SQL. Pandas is also great at handling time series.*\n", "\n", "**Prerequisites:**\n", - "* NumPy – if you are not familiar with NumPy, we recommend that you go through the [NumPy tutorial](tools_numpy.ipynb) now.\n", - "\n", + "* NumPy – if you are not familiar with NumPy, we recommend that you go through the [NumPy tutorial](tools_numpy.ipynb) now." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## Setup\n", "First, let's make sure this notebook works well in both python 2 and 3:" ] @@ -51,10 +56,15 @@ "source": [ "## `Series` objects\n", "The `pandas` library contains these useful data structures:\n", - "* `Series` objects, that we will discuss now. A `Series` object is similar to a column in a spreadsheet (with a column name and row labels).\n", - "* `DataFrame` objects. You can see this as a full spreadsheet (with column names and row labels).\n", - "* `Panel` objects. You can see a `Panel` a a dictionary of `DataFrame`s (less used). These are less used, so we will not discuss them here.\n", - "\n", + "* `Series` objects, that we will discuss now. A `Series` object is 1D array, similar to a column in a spreadsheet (with a column name and row labels).\n", + "* `DataFrame` objects. This is a 2D table, similar to a spreadsheet (with column names and row labels).\n", + "* `Panel` objects. You can see a `Panel` as a dictionary of `DataFrame`s. These are less used, so we will not discuss them here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "### Creating a `Series`\n", "Let's start by creating our first `Series` object!" ] @@ -106,14 +116,14 @@ }, "outputs": [], "source": [ - "s + pd.Series([1000,2000,3000,4000])" + "s + [1000,2000,3000,4000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Similar to NumPy, if you add a single number to a `Series`, that number is added to all items in the `Series`:" + "Similar to NumPy, if you add a single number to a `Series`, that number is added to all items in the `Series`. This is called * broadcasting*:" ] }, { @@ -150,7 +160,7 @@ "metadata": {}, "source": [ "### Index labels\n", - "Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the index of the item in the `Series` but you can also set the index labels manually:" + "Each item in a `Series` object has a unique identifier called the *index label*. By default, it is simply the rank of the item in the `Series` (starting at `0`) but you can also set the index labels manually:" ] }, { @@ -187,7 +197,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can still access the items by location, like in a regular array:" + "You can still access the items by integer location, like in a regular array:" ] }, { @@ -205,7 +215,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Slicing a `Series` also slices the index labels:" + "To make it clear when you are accessing by label or by integer location, it is recommended to always use the `loc` attribute when accessing by label, and the `iloc` attribute when accessing by integer location:" ] }, { @@ -216,19 +226,48 @@ }, "outputs": [], "source": [ - "s2[1:3]" + "s2.loc[\"bob\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "s2.iloc[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This can lead to unexpected results when using the default labels, so be careful:" + "Slicing a `Series` also slices the index labels:" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 13, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "s2.iloc[1:3]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This can lead to unexpected results when using the default numeric labels, so be careful:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, "metadata": { "collapsed": false }, @@ -240,7 +279,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 15, "metadata": { "collapsed": false }, @@ -259,7 +298,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 16, "metadata": { "collapsed": false }, @@ -275,12 +314,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "But you can access elements by location using the `iloc` attribute:" + "But remember that you can access elements by integer location using the `iloc` attribute. This illustrates another reason why it's always better to use `loc` and `iloc` to access `Series` objects:" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 17, "metadata": { "collapsed": false }, @@ -299,7 +338,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 18, "metadata": { "collapsed": false }, @@ -314,18 +353,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can control which elements you want to include in the `Series` and in what order by passing a second argument to the constructor with the list of desired index labels:" + "You can control which elements you want to include in the `Series` and in what order by explicitly specifying the desired `index`:" ] }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ - "s4 = pd.Series(weights, [\"colin\", \"alice\"])\n", + "s4 = pd.Series(weights, index = [\"colin\", \"alice\"])\n", "s4" ] }, @@ -339,7 +378,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 20, "metadata": { "collapsed": false }, @@ -347,6 +386,7 @@ "source": [ "print(s2.keys())\n", "print(s3.keys())\n", + "\n", "s2 + s3" ] }, @@ -354,14 +394,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `\"colin\"` is missing from `s2` and `\"charles\"` is missing from `s3`, these items have a `NaN` result value (ie. Not-a-Number means *missing*).\n", + "The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `\"colin\"` is missing from `s2` and `\"charles\"` is missing from `s3`, these items have a `NaN` result value. (ie. Not-a-Number means *missing*).\n", "\n", "Automatic alignment is very handy when working with data that may come from various sources with varying structure and missing items. But if you forget to set the right index labels, you can have surprising results:" ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 21, "metadata": { "collapsed": false }, @@ -370,10 +410,17 @@ "s5 = pd.Series([1000,1000,1000,1000])\n", "print(\"s2 =\", s2.values)\n", "print(\"s5 =\", s5.values)\n", - "print(\"s2 + s5 =\")\n", + "\n", "s2 + s5" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Pandas could not align the `Series`, since their labels do not match at all, hence the full `NaN` result." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -384,7 +431,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 22, "metadata": { "collapsed": false }, @@ -404,7 +451,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 23, "metadata": { "collapsed": false }, @@ -424,7 +471,7 @@ }, { "cell_type": "code", - "execution_count": 85, + "execution_count": 24, "metadata": { "collapsed": false, "scrolled": true @@ -433,7 +480,8 @@ "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", - "s7 = pd.Series([4,9,10,8,14,12,11,9,17,16,19,13], name=\"temperature\")\n", + "temperatures = [4.4,5.1,6.1,6.2,6.1,6.1,5.7,5.2,4.7,4.1,3.9,3.5]\n", + "s7 = pd.Series(temperatures, name=\"Temperature\")\n", "s7.plot()\n", "plt.show()" ] @@ -445,12 +493,472 @@ "There are *many* options for plotting your data. It is not necessary to list them all here: if you need a particular type of plot (histograms, pie charts, etc.), just look for it in the excellent [Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html) section of pandas' documentation, and look at the example code." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Time series\n", + "Many datasets have timestamps, and pandas is awesome at manipulating such data:\n", + "* it can represent periods (such as 2016Q3) and frequencies (such as \"monthly\"),\n", + "* it can convert periods to actual timestamps, and *vice versa*,\n", + "* it can resample data and aggregate values any way you like,\n", + "* it can handle timezones.\n", + "\n", + "### Time range\n", + "Let's start by creating a time series using `timerange`. This returns a `DatetimeIndex` containing one datetime per hour for 12 hours starting on October 29th 2016 at 5:30pm." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "dates = pd.date_range('2016/10/29 5:30pm', periods=12, freq='H')\n", + "dates" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This `DatetimeIndex` may be used as an index in a `Series`:" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series = pd.Series(temperatures, dates)\n", + "temp_series" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's plot this series:" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series.plot(kind=\"bar\")\n", + "\n", + "plt.grid(True)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Resampling\n", + "Pandas let's us resample a time series very simply. Just call the `resample` method and specify a new frequency:" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series_freq_2H = temp_series.resample(\"2H\")\n", + "temp_series_freq_2H" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's take a look at the result:" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series_freq_2H.plot(kind=\"bar\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note how the values have automatically been aggregated into 2-hour periods. If we look at the 6-8pm period, for example, we had a value of `5.1` at 6:30pm, and `6.1` at 7:30pm. After resampling, we just have one value of `5.6`, which is the mean of `5.1` and `6.1`. Computing the mean is the default behavior, but it is also possible to use a different aggregation function, for example we can decide to keep the minimum value of each period:" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series_freq_2H = temp_series.resample(\"2H\", how=np.min)\n", + "temp_series_freq_2H" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Upsampling and interpolation\n", + "This was an example of downsampling. We can also upsample (ie. increase the frequency), but this creates holes in our data:" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series_freq_15min = temp_series.resample(\"15Min\")\n", + "temp_series_freq_15min.head(n=10) # `head` displays the top n values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One solution is to fill the gaps by interpolating. We just call the `interpolate` method. The default is to use linear interpolation, but we can also select another method, such as cubic interpolation:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "collapsed": false, + "scrolled": true + }, + "outputs": [], + "source": [ + "temp_series_freq_15min = temp_series.resample(\"15Min\").interpolate(method=\"cubic\")\n", + "temp_series_freq_15min.head(n=10)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series.plot(label=\"Period: 1 hour\")\n", + "temp_series_freq_15min.plot(label=\"Period: 15 minutes\")\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Timezones\n", + "By default datetimes are *naive*: they are not aware of timezones, so 2016-10-30 02:30 might mean October 30th 2016 at 2:30am in Paris or in New York. We can make datetimes timezone *aware* by calling the `tz_localize` method:" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series_ny = temp_series.tz_localize(\"America/New_York\")\n", + "temp_series_ny" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that `-04:00` is now appended to all the datetimes. This means that these datetimes refer to [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) - 4 hours.\n", + "\n", + "We can convert these datetimes to Paris time like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series_paris = temp_series_ny.tz_convert(\"Europe/Paris\")\n", + "temp_series_paris" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You may have noticed that the UTC offset changes from `+02:00` to `+01:00`: this is because France switches to winter time at 3am that particular night (time goes back to 2am). Notice that 2:30am occurs twice! Let's go back to a naive representation (if you log some data hourly using local time, without storing the timezone, you might get something like this):" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series_paris_naive = temp_series_paris.tz_localize(None)\n", + "temp_series_paris_naive" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now `02:30` is really ambiguous. If we try to localize these naive datetimes to the Paris timezone, we get an error:" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "try:\n", + " temp_series_paris_naive.tz_localize(\"Europe/Paris\")\n", + "except Exception as e:\n", + " print(type(e))\n", + " print(e)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Fortunately using the `ambiguous` argument we can tell pandas to infer the right DST (Daylight Saving Time) based on the order of the ambiguous timestamps:" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "temp_series_paris_naive.tz_localize(\"Europe/Paris\", ambiguous=\"infer\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Periods\n", + "The `period_range` function returns a `PeriodIndex` instead of a `DatetimeIndex`. For example, let's get all quarters in 2016 and 2017:" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "quarters = pd.period_range('2016Q1', periods=8, freq='Q')\n", + "quarters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Adding a number `N` to a `PeriodIndex` shifts the periods by `N` times the `PeriodIndex`'s frequency:" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "quarters + 3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `asfreq` method lets us change the frequency of the `PeriodIndex`. All periods are lengthened or shortened accordingly. For example, let's convert all the quarterly periods to monthly periods (zooming in):" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "quarters.asfreq(\"M\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By default, the `asfreq` zooms on the end of each period. We can tell it to zoom on the start of each period instead:" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "quarters.asfreq(\"M\", how=\"start\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And we can zoom out:" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "quarters.asfreq(\"A\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Of course we can create a `Series` with a `PeriodIndex`:" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "quarterly_revenue = pd.Series([300, 320, 290, 390, 320, 360, 310, 410], index = quarters)\n", + "quarterly_revenue" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "quarterly_revenue.plot(kind=\"line\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can convert periods to timestamps by calling `to_timestamp`. By default this will give us the first day of each period, but by setting `how` and `freq`, we can get the last hour of each period:" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "last_hours = quarterly_revenue.to_timestamp(how=\"end\", freq=\"H\")\n", + "last_hours" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And back to periods by calling `to_period`:" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "last_hours.to_period()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Pandas also provides many other time-related functions that we recommend you check out in the [documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html). To whet your appetite, here is one way to get the last business day of each month in 2016, at 9am:" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "months_2016 = pd.period_range(\"2016\", periods=12, freq=\"M\")\n", + "one_day_after_last_days = months_2016.asfreq(\"D\") + 1\n", + "last_bdays = one_day_after_last_days.to_timestamp() - pd.tseries.offsets.BDay()\n", + "last_bdays.to_period(\"H\") + 9" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `DataFrame` objects\n", - "A DataFrame object represents a spreadsheet, with cell values, column names and row index labels. You can think of them as dictionaries of `Series` objects.\n", + "A DataFrame object represents a spreadsheet, with cell values, column names and row index labels. You can define expressions to compute columns based on other columns, create pivot-tables, group rows, draw graphs, etc. You can see `DataFrame`s as dictionaries of `Series`.\n", "\n", "### Creating a `DataFrame`\n", "You can create a DataFrame by passing a dictionary of `Series` objects:" @@ -458,18 +966,17 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [], "source": [ - "people_ids = [\"alice\", \"bob\", \"charles\"]\n", "people_dict = {\n", - " \"weight\": pd.Series([68, 83, 112], index=people_ids),\n", - " \"birthyear\": pd.Series([1985, 1984, 1992], index=people_ids, name=\"year\"),\n", - " \"children\": pd.Series([np.nan, 3, 0], index=people_ids),\n", - " \"hobby\": pd.Series([\"Biking\", \"Dancing\", \"Reading\"], index=people_ids),\n", + " \"weight\": pd.Series([68, 83, 112], index=[\"alice\", \"bob\", \"charles\"]),\n", + " \"birthyear\": pd.Series([1984, 1985, 1992], index=[\"bob\", \"alice\", \"charles\"], name=\"year\"),\n", + " \"children\": pd.Series([0, 3], index=[\"charles\", \"bob\"]),\n", + " \"hobby\": pd.Series([\"Biking\", \"Dancing\"], index=[\"alice\", \"bob\"]),\n", "}\n", "people = pd.DataFrame(people_dict)\n", "people" @@ -479,7 +986,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that DataFrames are displayed nicely in Jupyter notebooks! Also, note that `Series` names are ignored (`\"year\"` was dropped)." + "A few things to note:\n", + "* the `Series` were automatically aligned based on their index,\n", + "* missing values are represented as `NaN`,\n", + "* `Series` names are ignored (the name `\"year\"` was dropped),\n", + "* `DataFrame`s are displayed nicely in Jupyter notebooks, woohoo!" ] }, { @@ -491,7 +1002,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 50, "metadata": { "collapsed": false }, @@ -500,6 +1011,24 @@ "people[\"birthyear\"]" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also get multiple columns at once:" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "people[[\"birthyear\", \"hobby\"]]" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -509,7 +1038,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 52, "metadata": { "collapsed": false }, @@ -527,22 +1056,47 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Another convenient way to create a `DataFrame` is to pass all the values to the constructor as an `ndarray`, and specify the column names and row index labels separately:" + "Another convenient way to create a `DataFrame` is to pass all the values to the constructor as an `ndarray`, or a list of lists, and specify the column names and row index labels separately:" ] }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [], "source": [ - "values = np.array([\n", + "values = [\n", " [1985, np.nan, \"Biking\", 68],\n", " [1984, 3, \"Dancing\", 83],\n", - " [1992, 0, \"Reading\", 112]\n", - " ])\n", + " [1992, 0, np.nan, 112]\n", + " ]\n", + "d3 = pd.DataFrame(\n", + " values,\n", + " columns=[\"birthyear\", \"children\", \"hobby\", \"weight\"],\n", + " index=[\"alice\", \"bob\", \"charles\"]\n", + " )\n", + "d3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To specify missing values, you can either use `np.nan` or NumPy's masked arrays:" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "masked_array = np.ma.asarray(values, dtype=np.object)\n", + "masked_array[(0, 2), (1, 2)] = np.ma.masked\n", "d3 = pd.DataFrame(\n", " values,\n", " columns=[\"birthyear\", \"children\", \"hobby\", \"weight\"],\n", @@ -560,7 +1114,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 55, "metadata": { "collapsed": false }, @@ -583,17 +1137,17 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [], "source": [ "people = pd.DataFrame({\n", - " \"birthyear\": {\"alice\":1985, \"bob\": 1984, \"charles\": 1992},\n", - " \"hobby\": {\"alice\":\"Biking\", \"bob\": \"Dancing\", \"charles\": \"Reading\"},\n", - " \"weight\": {\"alice\":68, \"bob\": 83, \"charles\": 112},\n", - " \"children\": {\"alice\":np.nan, \"bob\": 3, \"charles\": 0}\n", + " \"birthyear\": {\"alice\":1985, \"bob\": 1984, \"charles\": 1992},\n", + " \"hobby\": {\"alice\":\"Biking\", \"bob\": \"Dancing\"},\n", + " \"weight\": {\"alice\":68, \"bob\": 83, \"charles\": 112},\n", + " \"children\": {\"bob\": 3, \"charles\": 0}\n", "})\n", "people" ] @@ -608,7 +1162,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 57, "metadata": { "collapsed": false }, @@ -619,7 +1173,7 @@ " (\"public\", \"birthyear\"):\n", " {(\"Paris\",\"alice\"):1985, (\"Paris\",\"bob\"): 1984, (\"London\",\"charles\"): 1992},\n", " (\"public\", \"hobby\"):\n", - " {(\"Paris\",\"alice\"):\"Biking\", (\"Paris\",\"bob\"): \"Dancing\", (\"London\",\"charles\"): \"Reading\"},\n", + " {(\"Paris\",\"alice\"):\"Biking\", (\"Paris\",\"bob\"): \"Dancing\"},\n", " (\"private\", \"weight\"):\n", " {(\"Paris\",\"alice\"):68, (\"Paris\",\"bob\"): 83, (\"London\",\"charles\"): 112},\n", " (\"private\", \"children\"):\n", @@ -638,7 +1192,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 58, "metadata": { "collapsed": false }, @@ -649,7 +1203,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 59, "metadata": { "collapsed": false }, @@ -658,6 +1212,152 @@ "d5[\"public\", \"hobby\"] # Same result as d4[\"public\"][\"hobby\"]" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dropping a level\n", + "Let's look at `d5` again:" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "d5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are two levels of columns, and two levels of indices. We can drop a column level by calling `droplevel` (the same goes for indices):" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "d5.columns = d5.columns.droplevel(level = 0)\n", + "d5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Transposing\n", + "You can swap columns and indices using the `T` attribute:" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "d6 = d5.T\n", + "d6" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Stacking and unstacking levels\n", + "Calling the `stack` method will push the lowest column level after the lowest index:" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "d7 = d6.stack()\n", + "d7" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that many `NaN` values appeared. This makes sense because many new combinations did not exist before (eg. there was no `bob` in `London`).\n", + "\n", + "Calling `unstack` will do the reverse, once again creating many `NaN` values." + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "d8 = d7.unstack()\n", + "d8" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we call `unstack` again, we end up with a `Series` object:" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "d9 = d8.unstack()\n", + "d9" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `stack` and `unstack` methods let you select the `level` to stack/unstack. You can even stack/unstack multiple levels at once:" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": { + "collapsed": false, + "scrolled": true + }, + "outputs": [], + "source": [ + "d10 = d9.unstack(level = (0,1))\n", + "d10" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Most methods return modified copies\n", + "As you may have noticed, the `stack` and `unstack` methods do not modify the object they apply to. Instead, they work on a copy and return that copy. This is true of most methods in pandas." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -668,7 +1368,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 67, "metadata": { "collapsed": false }, @@ -686,7 +1386,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 68, "metadata": { "collapsed": false }, @@ -699,12 +1399,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can also access rows by location using the `iloc` attribute:" + "You can also access rows by integer location using the `iloc` attribute:" ] }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 69, "metadata": { "collapsed": false }, @@ -722,7 +1422,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 70, "metadata": { "collapsed": false }, @@ -740,7 +1440,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 71, "metadata": { "collapsed": false }, @@ -758,7 +1458,7 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 72, "metadata": { "collapsed": false }, @@ -777,7 +1477,7 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": 73, "metadata": { "collapsed": false }, @@ -788,7 +1488,7 @@ }, { "cell_type": "code", - "execution_count": 39, + "execution_count": 74, "metadata": { "collapsed": false }, @@ -804,7 +1504,7 @@ }, { "cell_type": "code", - "execution_count": 40, + "execution_count": 75, "metadata": { "collapsed": false }, @@ -822,7 +1522,7 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 76, "metadata": { "collapsed": false }, @@ -841,7 +1541,7 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 77, "metadata": { "collapsed": false }, @@ -861,7 +1561,7 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": 78, "metadata": { "collapsed": false }, @@ -882,7 +1582,7 @@ }, { "cell_type": "code", - "execution_count": 44, + "execution_count": 79, "metadata": { "collapsed": false }, @@ -906,7 +1606,7 @@ }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 80, "metadata": { "collapsed": false }, @@ -925,7 +1625,7 @@ }, { "cell_type": "code", - "execution_count": 46, + "execution_count": 81, "metadata": { "collapsed": false }, @@ -949,7 +1649,7 @@ }, { "cell_type": "code", - "execution_count": 47, + "execution_count": 82, "metadata": { "collapsed": false }, @@ -978,7 +1678,7 @@ }, { "cell_type": "code", - "execution_count": 48, + "execution_count": 83, "metadata": { "collapsed": false }, @@ -996,7 +1696,7 @@ }, { "cell_type": "code", - "execution_count": 49, + "execution_count": 84, "metadata": { "collapsed": false }, @@ -1015,7 +1715,7 @@ }, { "cell_type": "code", - "execution_count": 50, + "execution_count": 85, "metadata": { "collapsed": false }, @@ -1036,7 +1736,7 @@ }, { "cell_type": "code", - "execution_count": 51, + "execution_count": 86, "metadata": { "collapsed": false }, @@ -1055,7 +1755,7 @@ }, { "cell_type": "code", - "execution_count": 52, + "execution_count": 87, "metadata": { "collapsed": false }, @@ -1073,7 +1773,7 @@ }, { "cell_type": "code", - "execution_count": 53, + "execution_count": 88, "metadata": { "collapsed": false }, @@ -1092,7 +1792,7 @@ }, { "cell_type": "code", - "execution_count": 54, + "execution_count": 89, "metadata": { "collapsed": false }, @@ -1114,7 +1814,7 @@ }, { "cell_type": "code", - "execution_count": 55, + "execution_count": 90, "metadata": { "collapsed": false }, @@ -1133,7 +1833,7 @@ }, { "cell_type": "code", - "execution_count": 56, + "execution_count": 91, "metadata": { "collapsed": false, "scrolled": true @@ -1161,7 +1861,7 @@ }, { "cell_type": "code", - "execution_count": 57, + "execution_count": 92, "metadata": { "collapsed": false }, @@ -1181,7 +1881,7 @@ }, { "cell_type": "code", - "execution_count": 58, + "execution_count": 93, "metadata": { "collapsed": false }, @@ -1199,7 +1899,7 @@ }, { "cell_type": "code", - "execution_count": 59, + "execution_count": 94, "metadata": { "collapsed": false }, @@ -1217,7 +1917,7 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": 95, "metadata": { "collapsed": false, "scrolled": false @@ -1236,7 +1936,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": 96, "metadata": { "collapsed": false }, @@ -1254,7 +1954,7 @@ }, { "cell_type": "code", - "execution_count": 62, + "execution_count": 97, "metadata": { "collapsed": false }, @@ -1272,7 +1972,7 @@ }, { "cell_type": "code", - "execution_count": 63, + "execution_count": 98, "metadata": { "collapsed": false }, @@ -1290,7 +1990,7 @@ }, { "cell_type": "code", - "execution_count": 64, + "execution_count": 99, "metadata": { "collapsed": false }, @@ -1308,7 +2008,7 @@ }, { "cell_type": "code", - "execution_count": 65, + "execution_count": 100, "metadata": { "collapsed": false }, @@ -1326,7 +2026,7 @@ }, { "cell_type": "code", - "execution_count": 66, + "execution_count": 101, "metadata": { "collapsed": false }, @@ -1344,7 +2044,7 @@ }, { "cell_type": "code", - "execution_count": 67, + "execution_count": 102, "metadata": { "collapsed": false, "scrolled": true @@ -1364,7 +2064,7 @@ }, { "cell_type": "code", - "execution_count": 68, + "execution_count": 103, "metadata": { "collapsed": false }, @@ -1377,7 +2077,7 @@ }, { "cell_type": "code", - "execution_count": 69, + "execution_count": 104, "metadata": { "collapsed": false, "scrolled": true @@ -1401,7 +2101,7 @@ }, { "cell_type": "code", - "execution_count": 70, + "execution_count": 105, "metadata": { "collapsed": false, "scrolled": true @@ -1420,7 +2120,7 @@ }, { "cell_type": "code", - "execution_count": 71, + "execution_count": 106, "metadata": { "collapsed": false }, @@ -1443,7 +2143,7 @@ }, { "cell_type": "code", - "execution_count": 72, + "execution_count": 107, "metadata": { "collapsed": false }, @@ -1461,7 +2161,7 @@ }, { "cell_type": "code", - "execution_count": 73, + "execution_count": 108, "metadata": { "collapsed": false, "scrolled": false @@ -1480,7 +2180,7 @@ }, { "cell_type": "code", - "execution_count": 74, + "execution_count": 109, "metadata": { "collapsed": false }, @@ -1502,7 +2202,7 @@ }, { "cell_type": "code", - "execution_count": 75, + "execution_count": 110, "metadata": { "collapsed": false }, @@ -1515,14 +2215,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "There's not much we can do about December and Colin: it's bad enough that we are making up bonus points, but we can't reasonably make up grades (well I guess some teachers probably do).\n", - "\n", "It is slightly annoying that the September column ends up on the right. This is because the `DataFrame`s we are adding do not have the exact same columns (the `grades` `DataFrame` is missing the `\"dec\"` column), so to make things predictable, pandas orders the final columns alphabetically. To fix this, we can simply add the missing column before adding:" ] }, { "cell_type": "code", - "execution_count": 76, + "execution_count": 111, "metadata": { "collapsed": false, "scrolled": true @@ -1538,22 +2236,60 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Aggregating with `groupby`\n", - "Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group.\n", - "\n", - "First, let's add some extra data about each person so we can group them:" + "There's not much we can do about December and Colin: it's bad enough that we are making up bonus points, but we can't reasonably make up grades (well I guess some teachers probably do). So let's call the `dropna` method to get rid of rows that are full of `NaN`s:" ] }, { "cell_type": "code", - "execution_count": 77, + "execution_count": 112, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "final_grades_clean = final_grades.dropna(how=\"all\")\n", + "final_grades_clean" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's remove columns that are full of `NaN`s by setting the `axis` argument to `1`:" + ] + }, + { + "cell_type": "code", + "execution_count": 113, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "final_grades_clean = final_grades_clean.dropna(axis=1, how=\"all\")\n", + "final_grades_clean" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Aggregating with `groupby`\n", + "Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group.\n", + "\n", + "First, let's add some extra data about each person so we can group them, and let's go back to the `final_grades` `DataFrame` so we can see how `NaN` values are handled:" + ] + }, + { + "cell_type": "code", + "execution_count": 114, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ - "final_grades[\"hobby\"] = [\"Biking\", \"Dancing\", \"Reading\", \"Dancing\", \"Biking\"]\n", + "final_grades[\"hobby\"] = [\"Biking\", \"Dancing\", np.nan, \"Dancing\", \"Biking\"]\n", "final_grades" ] }, @@ -1566,7 +2302,7 @@ }, { "cell_type": "code", - "execution_count": 78, + "execution_count": 115, "metadata": { "collapsed": false }, @@ -1580,12 +2316,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now let's compute the average grade per hobby:" + "We are ready to compute the average grade per hobby:" ] }, { "cell_type": "code", - "execution_count": 79, + "execution_count": 116, "metadata": { "collapsed": false }, @@ -1598,7 +2334,112 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "That was easy! Note that the `NaN` values have simply been skipped." + "That was easy! Note that the `NaN` values have simply been skipped when computing the means." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pivot tables\n", + "Pandas supports spreadsheet-like [pivot tables](https://en.wikipedia.org/wiki/Pivot_table) that allow quick data summarization. To illustrate this, let's create a simple `DataFrame`:" + ] + }, + { + "cell_type": "code", + "execution_count": 117, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "bonus_points" + ] + }, + { + "cell_type": "code", + "execution_count": 118, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "more_grades = final_grades_clean.stack().reset_index()\n", + "more_grades.columns = [\"name\", \"month\", \"grade\"]\n", + "more_grades[\"bonus\"] = [np.nan, np.nan, np.nan, 0, np.nan, 2, 3, 3, 0, 0, 1, 0]\n", + "more_grades" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can call the `pivot_table` function for this `DataFrame`, asking to group by the `name` column. By default, `pivot_table` computes the `mean` of each numeric column:" + ] + }, + { + "cell_type": "code", + "execution_count": 119, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pd.pivot_table(more_grades, index=\"name\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can change the aggregation function by setting the `aggfunc` attribute, and we can also specify the list of columns whose values will be aggregated:" + ] + }, + { + "cell_type": "code", + "execution_count": 120, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pd.pivot_table(more_grades, index=\"name\", values=[\"grade\",\"bonus\"], aggfunc=np.max)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also specify the `columns` to aggregate over horizontally, and request the grand totals for each row and column by setting `margins=True`:" + ] + }, + { + "cell_type": "code", + "execution_count": 121, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pd.pivot_table(more_grades, index=\"name\", values=\"grade\", columns=\"month\", margins=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we can specify multiple index or column names, and pandas will create multi-level indices:" + ] + }, + { + "cell_type": "code", + "execution_count": 122, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "pd.pivot_table(more_grades, index=(\"name\", \"month\"), margins=True)" ] }, { @@ -1611,7 +2452,7 @@ }, { "cell_type": "code", - "execution_count": 80, + "execution_count": 123, "metadata": { "collapsed": false, "scrolled": false @@ -1619,10 +2460,10 @@ "outputs": [], "source": [ "much_data = np.fromfunction(lambda x,y: (x+y*y)%17*11, (10000, 26))\n", - "large = pd.DataFrame(much_data, columns=list(\"ABCDEFGHIJKLMNOPQRSTUVWXYZ\"))\n", - "large[large%16==0] = np.nan\n", - "large.insert(3,\"some_text\", \"Blabla\")\n", - "large" + "large_df = pd.DataFrame(much_data, columns=list(\"ABCDEFGHIJKLMNOPQRSTUVWXYZ\"))\n", + "large_df[large_df % 16 == 0] = np.nan\n", + "large_df.insert(3,\"some_text\", \"Blabla\")\n", + "large_df" ] }, { @@ -1634,14 +2475,14 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": 124, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ - "large.head()" + "large_df.head()" ] }, { @@ -1653,13 +2494,13 @@ }, { "cell_type": "code", - "execution_count": 82, + "execution_count": 125, "metadata": { "collapsed": false }, "outputs": [], "source": [ - "large.tail(n=2)" + "large_df.tail(n=2)" ] }, { @@ -1671,14 +2512,14 @@ }, { "cell_type": "code", - "execution_count": 83, + "execution_count": 126, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ - "large.info()" + "large_df.info()" ] }, { @@ -1696,28 +2537,511 @@ }, { "cell_type": "code", - "execution_count": 84, + "execution_count": 127, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ - "large.describe()" + "large_df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# To be continued...\n", - "Coming soon:\n", - "* categories\n", - "* pivot-tables\n", - "* stacking\n", - "* merging\n", - "* time series\n", - "* loading & saving" + "## Saving & loading\n", + "Pandas can save `DataFrame`s to various backends, including file formats such as CSV, Excel, JSON, HTML and HDF5, or to a SQL database. Let's create a `DataFrame` to demonstrate this:" + ] + }, + { + "cell_type": "code", + "execution_count": 128, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "my_df = pd.DataFrame(\n", + " [[\"Biking\", 68.5, 1985, np.nan], [\"Dancing\", 83.1, 1984, 3]], \n", + " columns=[\"hobby\",\"weight\",\"birthyear\",\"children\"],\n", + " index=[\"alice\", \"bob\"]\n", + ")\n", + "my_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Saving\n", + "Let's save it to CSV, HTML and JSON:" + ] + }, + { + "cell_type": "code", + "execution_count": 129, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "my_df.to_csv(\"my_df.csv\")\n", + "my_df.to_html(\"my_df.html\")\n", + "my_df.to_json(\"my_df.json\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Done! Let's take a peek at what was saved:" + ] + }, + { + "cell_type": "code", + "execution_count": 130, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for filename in (\"my_df.csv\", \"my_df.html\", \"my_df.json\"):\n", + " print(\"#\", filename)\n", + " with open(filename, \"rt\") as f:\n", + " print(f.read())\n", + " print()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the index is saved as the first column (with no name) in a CSV file, as `