Verified Commit b81fb95f authored by Erxleben, Fredo's avatar Erxleben, Fredo
Browse files

Add basic material from First Steps in Python - course

parent 76dbf200
This project is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
You can read the full license text under
https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
This workshop was created for HIFIS - Helmholtz Federated IT Systems
Website: https://hifis.net
E-mail: support@hifis.net
The license holder is the Helmholtz-Zentrum Dresden-Rossendorf(HZDR)
Website: https://www.hzdr.de
site_name: Object Oriented Programming in Python
strict: false
theme:
name: 'material'
custom_dir: theme_overrides
sticky-navigation: true
plugins:
- search
- kroki:
ServerURL: https://kroki.hzdr.de
DownloadImages: True
FencePrefix: ""
docs_dir: workshop_materials
markdown_extensions:
- smarty
- toc:
baselevel: 1
permalink: true
- admonition
- def_list
- pymdownx.details
#== Hightlighting ==#
- pymdownx.highlight:
anchor_linenums: true
- pymdownx.inlinehilite
- pymdownx.snippets
- pymdownx.superfences
nav:
- Getting started:
- About the Workshop: index.md
- Episodes:
- episodes/00-introduction.md
- episodes/01-dataframes-and-series.md
- episodes/02-accessing-filtering.md
- episodes/03-modifying-dataframes.md
- Tasks:
- exercises/00-exercises.md
This diff is collapsed.
[virtualenvs]
in-project = true
create = true
[tool.poetry]
name = "workshop-oop-in-python"
version = "0.1.0"
description = ""
authors = ["The HIFIS Education team <support@hifis.net>"]
readme = "README.md"
#packages = [{include = "workshop_oop_in_python"}]
[tool.poetry.dependencies]
python = "^3.9"
mkdocs = "^1.3.0"
mkdocs-material = "^8.2.15"
mkdocs-kroki-plugin = "^0.3.0"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
{% block copyright %}
<div>
<a
rel="license"
href="http://creativecommons.org/licenses/by-nc-sa/4.0/"
>
<img
alt="Creative Commons License"
style="border-width:0"
src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png"
/>
</a>
<a
rel="author"
href="https://hzdr.de"
>
<img
alt="HZDR logo"
style="height: 31px; padding: 2px; background: white; border: 1px white; border-radius: 2px;"
src="https://www.hzdr.de/assets/cms/v2018/images/headerlogo.png"
/>
</a>
<a
rel="author"
href="https://hifis.net"
>
<img
alt="HIFIS logo"
style="height: 31px; padding: 2px; background: white; border: 1px white; border-radius: 2px;"
src="https://hifis.net/assets/img/HIFIS_Logo_short_RGB_cropped.svg"
/>
</a>
</div>
{% endblock %}
# _Pandas_ - Introduction
## Overview
* Teaching: 5 min
* Exercises: -
* Questions
* What is _pandas_ and what do we use it for?
* Objectives
* Learn about the background of the _pandas_ framework
---
## What is _pandas_?
* _Pandas_ ia a framework - i.e. a collection of functionality, not a program on its own
* _Pandas_ is based on the numerical mathmatics framework _numpy_
* Compared to _numpy_ it offers more usability and convenience but sacrifices speed
## What is _pandas_ used for?
Its main application cases is **data processing**.
This includes:
* Reading, exploring, cleaning, transforming and visualizing data
Common areas that make use of it are:
* Data Science
* Machine Learning
## How to get _pandas_?
It can be installed via _pip_.
Make sure that the dependencies are installed as well.
## Where to find help?
* Official Documentation: https://pandas.pydata.org/docs/
---
## Key Points
* _pandas_ is based on _numpy_
* It offers additional utility functions but sacrifices speed
# _Pandas_ - Dataframes and Series
## Overview
* Teaching: 15 min
* Exercises: -
* Questions
* What are _Dataframes_ in _pandas_?
* What are _Series_ in _pandas?
* Objectives
* Get acquainted with the central data structures of the framework
---
## Introducing _Series_
* _Series_ in _pandas_ represent 1-dimensional data, i.e. a sequence of values
* It is provided as it's own data type by _pandas_
* _Series_ are often used to represent values changing over time
* The values within a _series_ do usually have the same data type
* Each of the values can have an index associated with it
To get import to the _Series_ run:
```python
from pandas import Series # Note the initial upper-case letter
```
### Creating _Series_ from various kinds of Data
Let's say we have a cat and we noticed it is sneezing a lot.
We suspect it might be allergic to something.
So we track the count of sneezes over one week.
```python
sneeze_counts = Series(data=[32, 41, 56, 62, 30, 22, 17])
print(sneeze_counts)
```
Output:
```
0 32
1 41
2 56
3 62
4 30
5 22
6 17
dtype: int64
```
* The _Series_ automatically adds an index on the left side
* It also automatically infers the best fitting data type for the elements (here `int64` = 64-bit integer)
To make the data a bit more meaningful, let's set a custom index:
```python
days_of_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
sneeze_counts.index = days_of_week
print(sneeze_count)
```
Output:
```
Monday 32
Tuesday 41
Wednesday 56
Thursday 62
Friday 30
Saturday 22
Sunday 17
dtype: int64
```
Also, we add a name to the series, so we can distinguish it later:
```python
sneeze_counts.name = "Sneezes"
```
* The index and name can also be passed directly while creating the series
We suspect that the illness of our cat is related to the weather, so we also log the average temperature and humidity
```python
temperatures = Series(data=[10.9, 8.2, 7.6, 7.8, 9.4, 11.1, 12.4], index=days_of_week, name="Temperature")
humidities = Series(data=[62.5, 76.3, 82.4, 98.2, 77.4, 58.9, 41.2], index= days_of_week, name="Humidity")
```
> **Note:** Alternatively you can provide the index while creating the _series_ by passing a dictionary:
> ```python
> sneeze_counts = Series(
> data= {
> "Monday": 32,
> "Tuesday": 41,
> "Wednesday": 56,
> "Thursday": 62,
> "Friday": 30,
> "Saturday": 22,
> "Sunday": 17
> },
> name="Sneezes"
> )
> ```
* To get a first statistical impression of the data, use the `describe()`-method:
```python
print(temperatures.describe())
```
Output:
```
count 7.000000
mean 9.628571
std 1.871465
min 7.600000
25% 8.000000
50% 9.400000
75% 11.000000
max 12.400000
Name: Temperature, dtype: float64
```
## Introducing _Dataframes_
To correlate our various measurements, we want some table-like data structure, so we import _Dataframes_:
```python
from pandas import DataFrame # Note the camel-case spelling
```
* A _dataframe_ can be created from a list of _series_, where each _series_ forms a **row** in the resulting table
```python
measurements = DataFrame(data=[sneeze_counts, temperatures, humidities])
print(measurements)
```
Output:
```
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Sneezes 32.0 41.0 56.0 62.0 30.0 22.0 17.0
Temperature 10.9 8.2 7.6 7.8 9.4 11.1 12.4
Humidity 62.5 76.3 82.4 98.2 77.4 58.9 41.2
```
* A _dataframe_ can be created from a dictionary of _series_ where each _series_ forms a **column** in the resulting table
```python
measurements = DataFrame(
data={
sneeze_counts.name: sneeze_counts,
temperatures.name: temperatures,
humidities.name: humidities
}
)
print(measurements)
```
Output:
```
Sneezes Temperature Humidity
Monday 32 10.9 62.5
Tuesday 41 8.2 76.3
Wednesday 56 7.6 82.4
Thursday 62 7.8 98.2
Friday 30 9.4 77.4
Saturday 22 11.1 58.9
Sunday 17 12.4 41.2
```
* To flip rows and columns, _dataframes_ can be transposed using the `T`-property:
```python
column_wise = DataFrame(data=temperatures)
print(column_wise)
print() # Add a blank line as separator
row_wise = column_wise.T
print(row_wise)
```
Output:
```
Temperature
Monday 10.9
Tuesday 8.2
Wednesday 7.6
Thursday 7.8
Friday 9.4
Saturday 11.1
Sunday 12.4
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Temperature 10.9 8.2 7.6 7.8 9.4 11.1 12.4
```
* Note: Store the transposed dataframe in a new variable, the original will not be changed.
---
## Key Points
* _Series_ represent 1-dimensional data
* _Dataframes_ represent 2-dimensional (tabular) data
* Each column in a _dataframe_ is a _series_
* _dataframes_ and _series_ have row indices to label the data
* _Dataframes_ may be transposed to switch rows and columns
# _Pandas_ - Accessing and Filtering Data
## Overview
* Teaching: 10 min
* Exercises: -
* Questions
* How to selectively access data from a dataframe
* Objectives
* Understand the `[ ]`-access within data frames
---
## Accessing Data
Reminder: We currently have a dataframe called `measurements` and it looks like this:
```
Sneezes Temperature Humidity
Monday 32 10.9 62.5
Tuesday 41 8.2 76.3
Wednesday 56 7.6 82.4
Thursday 62 7.8 98.2
Friday 30 9.4 77.4
Saturday 22 11.1 58.9
Sunday 17 12.4 41.2
```
### Selecting Columns
To get all available column names, run
```python
print(measurements.columns.values)
```
Output:
```
['Sneezes' 'Temperature' 'Humidity']
```
We can extract a singular column by using the `[]`-operator:
```python
print(measurements["Sneezes"])
```
Output:
```
Monday 32
Tuesday 41
Wednesday 56
Thursday 62
Friday 30
Saturday 22
Sunday 17
Name: Sneezes, dtype: int64
```
* **Note that the output is a _series_ again
To access a selection of columns, we pass in a list of column names in the desired order
```python
print(measurements[ ["Humidity", "Sneezes"] ])
```
Output:
```
Humidity Sneezes
Monday 62.5 32
Tuesday 76.3 41
Wednesday 82.4 56
Thursday 98.2 62
Friday 77.4 30
Saturday 58.9 22
Sunday 41.2 17
```
### Selecting Rows
To access given rows you can use the slicing operation as known from lists:
```python
print(measurements[0:3])
```
> **Note:** If you pass in a singular number instead of `[start:stop]` _pandas_ will look for a row with that number as a label.
> This will fail in our example since the rows are not numbered.
## Acess via `loc`
The property `loc` gives label-based access to the elements of a dataframe.
It follows the pattern `dataframe.loc[row_slice, column_slice]`.
For example:
```python
print(measurements.loc["Monday": "Friday", "Temperature":"Humidity"])
```
Output:
```
Temperature Humidity
Monday 10.9 62.5
Tuesday 8.2 76.3
Wednesday 7.6 82.4
Thursday 7.8 98.2
Friday 9.4 77.4
```
### Access via `iloc`
The `iloc`-property works similar to `loc`, except that it takes integer-based indexes instead of row/column labels:
```python
print(measurements.iloc[0:5, 1:])
```
> Output same as above
## Creating Filter masks
We want to extract only the data for cold days, which we consider to be below 10 degrees Celsius.
For this purpose we generate a series to use as a filter mask:
```python
cold_days = measurements["Temperature"] < 10
print(cold_days)
```
Output:
```
Monday False
Tuesday True
Wednesday True
Thursday True
Friday True
Saturday False
Sunday False
Name: Temperature, dtype: bool
```
We can apply this filter to our dataframe:
```python
print(measurements[cold_days])
```
Output:
```
Sneezes Temperature Humidity
Tuesday 41 8.2 76.3
Wednesday 56 7.6 82.4
Thursday 62 7.8 98.2
Friday 30 9.4 77.4
```
These steps often get combined into one:
```python
print(measurements[measurements["Sneezes"] == 56])
```
Output:
```
Sneezes Temperature Humidity
Wednesday 56 7.6 82.4
```
> **Note:** A filter mask can be inverted by using the `~` prefix operator:
> ```python
> print(~cold_days)
> ```
> Output:
> ```
> Monday True
> Tuesday False
> Wednesday False
> Thursday False
> Friday False
> Saturday True
> Sunday True
> Name: Temperature, dtype: bool
> ```
---
## Key Points
* Rows and columns can be selected ba their label, with the `loc`- or `iloc`-methods
* Combining these selections with a boolean comparison generates a filter mask which can then again be used to filter a dataframe
# _Pandas_ - Modifying Dataframes
## Overview
* Teaching: 15 min
* Exercises: -
* Questions
* How to manipulate dataframes
* Objectives
* Learn ways to add, change and remove data from a dataframe
---
## Side note: Incomplete Data
We intend to also note down the cleaning habits of our cat.
For this purpose we have created a new series of measurements.
```python
cleaning = Series(
data={"Monday": 2, "Friday": 1, "Saturday": 3},
index=days_of_week,
name="Cleaning"
)
print(cleaning)
```
Output:
```
Monday 2.0
Tuesday NaN
Wednesday NaN
Thursday NaN
Friday 1.0
Saturday 3.0
Sunday NaN
Name: Cleaning, dtype: float64
```
**Note:** that not all weekdays have a value associated with it.
Incomplete data is a common problem in real-world measurements.
_Pandas_ tends to represent "no data" as `NaN` which can be a pitfall.
## Calculating with DataFrames
Our vetinary friend wants to help us and requests we send them the temparatures we measured.
Since they live in the US, they would prefer to have the measurements in Farenheidt:
```python
print(measurements["Temperature"] * 1.8 + 32)
```
Output:
```
Monday 51.62
Tuesday 46.76
Wednesday 45.68
Thursday 46.04
Friday 48.92
Saturday 51.98
Sunday 54.32
Name: Temperature, dtype: object
```
## Adding another column to a Dataframe
To extend our dataframe, we can use
```python
measurements.join(cleaning)
print(measurements)
```
This seems not to have worked as we expected!
The reason is that many dataframe manipulations return a copy with the result instead of manipulating the original dataframe.
We can assign the result to our original dataframe (or a new variable)
```python
measurements = measurements.join(cleaning)
print(measurements)
```
Output:
```
Sneezes Temperature Humidity Cleaning
Monday 32 10.9 62.5 2.0
Tuesday 41 8.2 76.3 NaN
Wednesday 56 7.6 82.4 NaN
Thursday 62 7.8 98.2 NaN
Friday 30 9.4 77.4 1.0
Saturday 22 11.1 58.9 3.0
Sunday 17 12.4 41.2 NaN
```
## Side Note: Advanced filtering
Dataframes offer additional methods to generate filter masks.
```python
missing_data = measurements.isnull()
print(missing_data)