Commit a5445ffb authored by Erxleben, Fredo's avatar Erxleben, Fredo
Browse files

Merge branch '1-split-the-episode-on-series-and-dataframes' into 'main'

Resolve "Split the Episode on Series and Dataframes"

Closes #1

See merge request hifis/software/education/workshop-pandas!3
parents 6ba13a86 b676fdfb
Pipeline #149359 passed with stage
in 2 minutes and 55 seconds
......@@ -37,9 +37,11 @@ nav:
- About the Workshop: index.md
- Episodes:
- episodes/00-introduction.md
- episodes/01-dataframes-and-series.md
- episodes/02-accessing-filtering.md
- episodes/03-modifying-dataframes.md
- episodes/01-series.md
- episodes/02-dataframes.md
- episodes/03-accessing-data.md
- episodes/04-filtering.md
- episodes/05-modifying-dataframes.md
- Tasks:
- exercises/00-about.md
- exercises/01-getting-data.md
......
# Introduction
## Overview
* Teaching: 5 min
* Exercises: -
* Questions
* What is _pandas_ and what do we use it for?
* Objectives
* Learn about the background of the _pandas_ framework
---
title: "Introduction"
---
# Introduction
## What is _pandas_?
* _Pandas_ ia a framework - i.e. a collection of functionality, not a program on its own
* _Pandas_ is based on the numerical mathmatics framework _numpy_
* Compared to _numpy_ it offers more usability and convenience but sacrifices speed
_Pandas_ ia a framework - i.e. a collection of functionality, not a program on its own.
It is based on the numerical mathmatics framework _numpy_.
Compared to _numpy_, _pandas_ offers more usability and convenience but sacrifices speed.
## What is _pandas_ used for?
Its main application cases is **data processing**.
This includes:
* Reading, exploring, cleaning, transforming and visualizing data
Common areas that make use of it are:
* Data Science
* Machine Learning
## How to get _pandas_?
It can be installed via _pip_.
It can be installed via _pip_ or _conda_ ([c.f. _pandas_ on pypi.org](https://pypi.org/project/pandas/)).
Make sure that the dependencies are installed as well.
## Where to find help?
* Official Documentation: https://pandas.pydata.org/docs/
---
!!! important "Key Points"
## Key Points
* _pandas_ is based on _numpy_
* It offers additional utility functions but sacrifices speed
* _pandas_ is a data processing framework based on _numpy_
* It offers additional utility functions but sacrifices speed
# Dataframes and Series
## Overview
* Teaching: 15 min
* Exercises: -
* Questions
* What are _Dataframes_ in _pandas_?
* What are _Series_ in _pandas?
* Objectives
* Get acquainted with the central data structures of the framework
---
## Introducing _Series_
* _Series_ in _pandas_ represent 1-dimensional data, i.e. a sequence of values
* It is provided as it's own data type by _pandas_
* _Series_ are often used to represent values changing over time
* The values within a _series_ do usually have the same data type
* Each of the values can have an index associated with it
To get import to the _Series_ run:
```python
from pandas import Series # Note the initial upper-case letter
```
### Creating _Series_ from various kinds of Data
Let's say we have a cat and we noticed it is sneezing a lot.
We suspect it might be allergic to something.
So we track the count of sneezes over one week.
```python
sneeze_counts = Series(data=[32, 41, 56, 62, 30, 22, 17])
print(sneeze_counts)
```
Output:
```
0 32
1 41
2 56
3 62
4 30
5 22
6 17
dtype: int64
```
* The _Series_ automatically adds an index on the left side
* It also automatically infers the best fitting data type for the elements (here `int64` = 64-bit integer)
To make the data a bit more meaningful, let's set a custom index:
```python
days_of_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
sneeze_counts.index = days_of_week
print(sneeze_count)
```
Output:
```
Monday 32
Tuesday 41
Wednesday 56
Thursday 62
Friday 30
Saturday 22
Sunday 17
dtype: int64
```
Also, we add a name to the series, so we can distinguish it later:
```python
sneeze_counts.name = "Sneezes"
```
* The index and name can also be passed directly while creating the series
We suspect that the illness of our cat is related to the weather, so we also log the average temperature and humidity
```python
temperatures = Series(data=[10.9, 8.2, 7.6, 7.8, 9.4, 11.1, 12.4], index=days_of_week, name="Temperature")
humidities = Series(data=[62.5, 76.3, 82.4, 98.2, 77.4, 58.9, 41.2], index= days_of_week, name="Humidity")
```
> **Note:** Alternatively you can provide the index while creating the _series_ by passing a dictionary:
> ```python
> sneeze_counts = Series(
> data= {
> "Monday": 32,
> "Tuesday": 41,
> "Wednesday": 56,
> "Thursday": 62,
> "Friday": 30,
> "Saturday": 22,
> "Sunday": 17
> },
> name="Sneezes"
> )
> ```
* To get a first statistical impression of the data, use the `describe()`-method:
```python
print(temperatures.describe())
```
Output:
```
count 7.000000
mean 9.628571
std 1.871465
min 7.600000
25% 8.000000
50% 9.400000
75% 11.000000
max 12.400000
Name: Temperature, dtype: float64
```
## Introducing _Dataframes_
To correlate our various measurements, we want some table-like data structure, so we import _Dataframes_:
```python
from pandas import DataFrame # Note the camel-case spelling
```
* A _dataframe_ can be created from a list of _series_, where each _series_ forms a **row** in the resulting table
```python
measurements = DataFrame(data=[sneeze_counts, temperatures, humidities])
print(measurements)
```
Output:
```
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Sneezes 32.0 41.0 56.0 62.0 30.0 22.0 17.0
Temperature 10.9 8.2 7.6 7.8 9.4 11.1 12.4
Humidity 62.5 76.3 82.4 98.2 77.4 58.9 41.2
```
* A _dataframe_ can be created from a dictionary of _series_ where each _series_ forms a **column** in the resulting table
```python
measurements = DataFrame(
data={
sneeze_counts.name: sneeze_counts,
temperatures.name: temperatures,
humidities.name: humidities
}
)
print(measurements)
```
Output:
```
Sneezes Temperature Humidity
Monday 32 10.9 62.5
Tuesday 41 8.2 76.3
Wednesday 56 7.6 82.4
Thursday 62 7.8 98.2
Friday 30 9.4 77.4
Saturday 22 11.1 58.9
Sunday 17 12.4 41.2
```
* To flip rows and columns, _dataframes_ can be transposed using the `T`-property:
```python
column_wise = DataFrame(data=temperatures)
print(column_wise)
print() # Add a blank line as separator
row_wise = column_wise.T
print(row_wise)
```
Output:
```
Temperature
Monday 10.9
Tuesday 8.2
Wednesday 7.6
Thursday 7.8
Friday 9.4
Saturday 11.1
Sunday 12.4
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Temperature 10.9 8.2 7.6 7.8 9.4 11.1 12.4
```
* Note: Store the transposed dataframe in a new variable, the original will not be changed.
---
## Key Points
* _Series_ represent 1-dimensional data
* _Dataframes_ represent 2-dimensional (tabular) data
* Each column in a _dataframe_ is a _series_
* _dataframes_ and _series_ have row indices to label the data
* _Dataframes_ may be transposed to switch rows and columns
---
title: Series
---
# Series
Let's say we have a cat and we noticed it is sneezing a lot.
We suspect it might be allergic to something.
So we track the count of sneezes over one week.
For this purpose, we could employ the _Series_ data type provided by _pandas_.
Start by importing it:
```python
from pandas import Series # Note the initial upper-case letter
```
## Creating a _Series_
There are different ways we can add data to a _Series_.
We start out with a simple list:
```python
sneeze_counts = Series(data=[32, 41, 56, 62, 30, 22, 17])
print(sneeze_counts)
```
??? hint "Output"
```
0 32
1 41
2 56
3 62
4 30
5 22
6 17
dtype: int64
```
Note that the _Series_ automatically adds an index on the left side.
It also automatically infers the best fitting data type for the elements (here `int64` = 64-bit integer)
> **Note:** If you are not familiar with Object-oriented Programming you might be caught a bit off guard by the way this actually works.
> In short, _pandas_ introduces the series as a new data type (like `int`, `str` and all the others) and as such the value of `sneeze_counts` is actually the whole series at once.
## Extra Information
To make the data a bit more meaningful, let's set a custom index:
```python
days_of_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
sneeze_counts.index = days_of_week
print(sneeze_count)
```
??? hint "Output"
```
Monday 32
Tuesday 41
Wednesday 56
Thursday 62
Friday 30
Saturday 22
Sunday 17
dtype: int64
```
Also, we add a name to the series, so we can distinguish it later:
```python
sneeze_counts.name = "Sneezes"
```
## All at Once
The index and name can also be passed directly while creating the series
We suspect that the illness of our cat is related to the weather, so we also log the average temperature and humidity
```python
temperatures = Series(
data=[10.9, 8.2, 7.6, 7.8, 9.4, 11.1, 12.4],
index=days_of_week,
name="Temperature"
)
humidities = Series(
data=[62.5, 76.3, 82.4, 98.2, 77.4, 58.9, 41.2],
index= days_of_week,
name="Humidity"
)
```
!!! note ""
Alternatively you can provide the index while creating the _series_ by passing a dictionary:
```python
sneeze_counts = Series(
data= {
"Monday": 32,
"Tuesday": 41,
"Wednesday": 56,
"Thursday": 62,
"Friday": 30,
"Saturday": 22,
"Sunday": 17
},
name="Sneezes"
)
```
## Quick Maths
To get a first statistical impression of the data, use the `describe()`-method:
```python
print(temperatures.describe())
```
??? hint "Output"
```
count 7.000000
mean 9.628571
std 1.871465
min 7.600000
25% 8.000000
50% 9.400000
75% 11.000000
max 12.400000
Name: Temperature, dtype: float64
```
!!! important "Key Points"
* _Series_ are a 1-dimensional data structure
* You can use indices to label the data and a name to label the whole Series
---
title: "Dataframes"
---
# Dataframes
To correlate our various measurements, we want some table-like data structure, so we import _Dataframes_:
```python
from pandas import DataFrame # Note the camel-case spelling
```
## Crating Dataframes
A _dataframe_ can be created from a list of _series_, where each _series_ forms a **row** in the resulting table.
```python
measurements = DataFrame(data=[sneeze_counts, temperatures, humidities])
print(measurements)
```
??? hint "Output"
```
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Sneezes 32.0 41.0 56.0 62.0 30.0 22.0 17.0
Temperature 10.9 8.2 7.6 7.8 9.4 11.1 12.4
Humidity 62.5 76.3 82.4 98.2 77.4 58.9 41.2
```
A _dataframe_ can also be created from a dictionary of _series_ where each _series_ forms a **column** in the resulting table.
```python
measurements = DataFrame(
data={
sneeze_counts.name: sneeze_counts,
temperatures.name: temperatures,
humidities.name: humidities
}
)
print(measurements)
```
??? hint "Output"
```
Sneezes Temperature Humidity
Monday 32 10.9 62.5
Tuesday 41 8.2 76.3
Wednesday 56 7.6 82.4
Thursday 62 7.8 98.2
Friday 30 9.4 77.4
Saturday 22 11.1 58.9
Sunday 17 12.4 41.2
```
## Turn around
To flip rows and columns, _dataframes_ can be transposed using the `T`-property:
```python
column_wise = DataFrame(data=temperatures)
row_wise = column_wise.T
print(column_wise)
print() # Add a blank line as separator
print(row_wise)
```
??? hint "Output"
```
Temperature
Monday 10.9
Tuesday 8.2
Wednesday 7.6
Thursday 7.8
Friday 9.4
Saturday 11.1
Sunday 12.4
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Temperature 10.9 8.2 7.6 7.8 9.4 11.1 12.4
```
Don't forget to store the transposed dataframe in a new variable (or overwrite the old one), as the original will not be changed by the transposition.
!!! important "Key Points"
* _Dataframes_ represent 2-dimensional (tabular) data
* Each column in a _dataframe_ is a _series_
* _Dataframes_ have row and column indices
* _Dataframes_ may be transposed to switch rows and columns
# Accessing and Filtering Data
## Overview
* Teaching: 10 min
* Exercises: -
* Questions
* How to selectively access data from a dataframe
* Objectives
* Understand the `[ ]`-access within data frames
---
title: "Accessing Data"
---
## Accessing Data
# Accessing Data
Reminder: We currently have a dataframe called `measurements` and it looks like this:
```
Sneezes Temperature Humidity
Monday 32 10.9 62.5
......@@ -25,142 +17,103 @@ Saturday 22 11.1 58.9
Sunday 17 12.4 41.2
```
### Selecting Columns
## Selecting Columns
To get all available column names, run
```python
print(measurements.columns.values)
```
Output:
```
['Sneezes' 'Temperature' 'Humidity']
```
??? hint "Output"
```
['Sneezes' 'Temperature' 'Humidity']
```
We can extract a singular column by using the `[]`-operator:
```python
print(measurements["Sneezes"])
```
Output:
```
Monday 32
Tuesday 41
Wednesday 56
Thursday 62
Friday 30
Saturday 22
Sunday 17
Name: Sneezes, dtype: int64
```
* **Note that the output is a _series_ again
??? hint "Output"
```
Monday 32
Tuesday 41
Wednesday 56
Thursday 62
Friday 30
Saturday 22
Sunday 17
Name: Sneezes, dtype: int64
```
Note that the output is a _series_ again
To access a selection of columns, we pass in a list of column names in the desired order
```python
print(measurements[ ["Humidity", "Sneezes"] ])
```
Output:
```
Humidity Sneezes
Monday 62.5 32
Tuesday 76.3 41
Wednesday 82.4 56
Thursday 98.2 62
Friday 77.4 30
Saturday 58.9 22
Sunday 41.2 17
```
### Selecting Rows
??? hint "Output"
```
Humidity Sneezes
Monday 62.5 32
Tuesday 76.3 41
Wednesday 82.4 56
Thursday 98.2 62
Friday 77.4 30
Saturday 58.9 22
Sunday 41.2 17
```
## Selecting Rows
To access given rows you can use the slicing operation as known from lists:
```python
print(measurements[0:3])
```
> **Note:** If you pass in a singular number instead of `[start:stop]` _pandas_ will look for a row with that number as a label.
> This will fail in our example since the rows are not numbered.
!!! caution ""