Commit c499bec8 authored by Erxleben, Fredo's avatar Erxleben, Fredo
Browse files

Split exercises in dedicated tasks

parent 2d56f8d2
Pipeline #148713 passed with stage
in 52 seconds
......@@ -41,4 +41,9 @@ nav:
- episodes/02-accessing-filtering.md
- episodes/03-modifying-dataframes.md
- Tasks:
- exercises/00-exercises.md
- exercises/00-about.md
- exercises/01-getting-data.md
- exercises/02-loading-data.md
- exercises/03-cleaning-data.md
- exercises/04-initial-exploration.md
- exercises/05-advanced-exploration.md
---
title: "About the Exercises"
---
# About the Exercises
This is a stand-alone open-end exercise to practise working with _pandas_ on a real-live data set.
> It is highly recommended to have the [_pandas_ documentation][docs] open for this exercise, it will be needed a lot.
> Not all required functions will be listed in the exercise document, finding out what to use is part of the intended training.
Please take your time to read the instructions and hints carefully to avoid getting stuck on minor details.
[docs]: https://pandas.pydata.org/docs
# Pandas Exercises
This is a stand-alone open-end exercise to practise working with _pandas_ on a real-live data set.
> It is highly recommended to have the _pandas_ documentation open for this exercise, it will be needed a lot.
> Not all required functions will be listed in the exercise document, finding out what to use is part of the intended training.
Please take your time to read the instructions and hints carefully to avoid getting stuck on minor details.
## Getting the Data
The [NOAA][noaa] provides open weather data from stations around the world.
* Here is a [list of all stations][stations] and their respective codes
* This list in itself already makes for an interesting data set to explore
* This is the [weather data archive][archive].
* It is sorted by year and then station code from the station list
### Tasks
* Pick, download and extract a sample data set from the archive.
* If you are not sure, _New York, Central Park_ (Code 725060) in 2020 would be a good pick
* Visually inspect the data set with a text editor to make sure it does not only contain a few rows of data
* Get acquainted with the [ISD Lite data format][documentation], it holds valuable information how to interpret what you have in front of you
## Loading the Data
* For loading, the `pandas.read_csv()`-function can be used.
* [`read_csv()` documentation][pandas-read-csv-doc]
* The downloaded data is compressed in a `gz`-archive. You _could_ decompress it before working with it (especially useful if you want to inspect the data beforehand with a plain text editor or other tool/programs), the `read_csv()`-function itself however can handle a such an archive just fine
* Note that for these data sets the seperator for the data fields is not a comma, but multiple whitespaces. You can use the [regular expression][regex] `r"\s+"` to express this in python.
* Note the parameter `parse_dates` which can come in extremely handy
* Note that the data set as provided has **no header**
### Tasks
* Consider **first** what the loaded data should look like
* Load the data set
* Display the loaded data, compare with your expectations and do a plausability check
* Assign a proper header based on the information from the [data documentation][documentation]
## Cleaning the data
* According to the data documentation, the value `-9999` indicates missing data
* Some data columns have been scaled by a factor
### Tasks
* Replace the value `-9999` with something more appropriate
* A suggestion would be the constant `nan` from the `math` library
* Check for columns that have no data at all and remove them if convenient
* Re-scale the columns so they all use a factor of 1 (and can be read and interpreted more easily by humans)
* Check if there are entries missing for some dates/hours.
* Consider how many hours the given year should have
* Add placeholders for those missing rows, so the averading works as expected.
## Initial Exploration
### Tasks
* Find the most extreme temperatures, wind speeds amd precipitations measured
* Calculate the year average for temperatures, pressures and wind speed
* Find the total precipitation for the year
## Advanced Tasks
### Tasks
* Calculate the differences in air temperature from one hour to the next
* Find the biggest rise/ drop in temperature over an hour and when they happened
* Find out what the most common wind directions in your location are
* Note: Wind directions can be rather fuzzy, so you might want to come with a better metric here than simply calculating the maximum
* Note: A direction of `0` means that it is _undetermined_, _North_ is designated by `360`
* Note: The Wind direction wraps around after `360` to `10` (direction is given in incremets of 10°) take this into account.
* You might find the _modulo_-operator useful (`%` in python)
* Extrapolate daily statistics from the hourly ones
* Come up with a metric for a "nice day" and an "awful day". What were the most _nice_ or _awful_ days in your location?
[noaa]: https://en.wikipedia.org/wiki/National_Oceanic_and_Atmospheric_Administration
[stations]: https://www.ncei.noaa.gov/pub/data/noaa/isd-history.txt
[archive]: https://www1.ncdc.noaa.gov/pub/data/noaa/isd-lite/
[documentation]: https://www1.ncdc.noaa.gov/pub/data/noaa/isd-lite/isd-lite-format.pdf
[pandas-read-csv-doc]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
[regex]: https://en.wikipedia.org/wiki/Regular_expression
---
title: Getting the Data
---
# Task 1: Getting the Data
The [NOAA][noaa] provides open weather data from stations around the world.
It has a [list of all weather stations][stations] and their respective codes among other info. The main part is the [weather data archive][archive]. It is sorted by year and then station code from the station list.
> Side note: The station list in itself already makes for an interesting data set to explore, if you are looking for practise opportunities later on.
## Tasks
1. Pick and download a sample data set from the archive.
If you are not sure, _New York, Central Park_ (Station code 725060) in 2020 would be a good starting point.
2. The data is provided in a compressed `gz`-archive, which can be extracted by most regular archive tools.
Extract the data set and visually inspect it with a text editor to make sure it does not only contain a few rows of data.
3. Get acquainted with the [ISD Lite data format][documentation], it holds valuable information how to interpret what you have in front of you.
> Side note: Step 2 is not strictly necessary, _pandas_ can handle this kind of archive out of the box. It is however a good idea when starting out to get a first impression how the data set looks and if it has a lot of missing / repeating values that might indicate a reduced usefulness.
[noaa]: https://en.wikipedia.org/wiki/National_Oceanic_and_Atmospheric_Administration
[stations]: https://www.ncei.noaa.gov/pub/data/noaa/isd-history.txt
[archive]: https://www1.ncdc.noaa.gov/pub/data/noaa/isd-lite/
[documentation]: https://www1.ncdc.noaa.gov/pub/data/noaa/isd-lite/isd-lite-format.pdf
---
title: "Loading the Data"
---
# Task 2: Loading the Data
To load the data you can use the `pandas.read_csv()` function. ([`read_csv()` documentation][pandas-read-csv-doc])
#### Hints
* In these data sets the seperator for the data fields is not a comma, but multiple whitespaces.
You can use the [regular expression][regex] `r"\s+"` to express this in python.
* Note the parameter `parse_dates` of the `read_csv()`-function which can come in extremely handy.
* Note that the data set as provided has **no header**.
> Side note: As noted previously, the downloaded data is compressed in a `gz`-archive.
> You _could_ decompress it before working with it
> (especially useful if you want to inspect the data beforehand with a plain text editor or other tool/programs),
> the `read_csv()`-function itself however can handle a such an archive just fine.
## Tasks
1. Consider **first** what the loaded data should look like
2. Load the data set using the `read_csv()`-function from _pandas_.
3. Display the loaded data, compare the result with your expectations
4. Do a plausability check:
* Check the number of rows and columns
* Check if the data inside the rows is displayed correctly (i.e. no columns got joined or torn apart), especially the date column
5. Assign a proper header based on the information from the [data documentation][documentation]
[pandas-read-csv-doc]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
[regex]: https://en.wikipedia.org/wiki/Regular_expression
[documentation]: https://www1.ncdc.noaa.gov/pub/data/noaa/isd-lite/isd-lite-format.pdf
---
title: "Cleaning the Data"
---
# Task 3: Cleaning the data
The data as loaded is not yet ready for our work.
For technical reasons, the data representation has a few peculiarities:
* According to the data documentation, the value `-9999` indicates missing data.
* Some data columns have been scaled by a factor.
* A wind direction of `0` means that it is _undetermined_, _North_ is designated by `360`.
## Tasks
1. Replace the value `-9999` with something more appropriate, for example the constant `nan` from the `math` library.
2. Replace the measurements were no wind direction is given in a similar fashion.
3. Now the value `0` is free to represent _North_ as usual. This will come in handy in a later task.
4. Check for columns that have no useful data at all and remove them if convenient
5. Re-scale the columns so they all use a factor of 1 (and can be read and interpreted more easily by humans)
6. Check if there are entries missing for some dates/hours.
Consider first how many hours the given year should have (Account for the additional day of leap years if applicable.)
How many rows are missing in your data set? (If your data set has a significant number of rows missing, consider choosing another one.)
7. Add suitable placeholders for those missing rows, so the averaging works as expected.
---
title: "Initial Exploration"
---
# Task 4: Initial Exploration
Now we can start working with our data set.
Let's start by finding some basic statistical values for our location.
## Tasks
1. Find the most extreme temperatures, wind speeds amd precipitations measured.
2. Calculate the year average for temperatures, pressures and wind speed.
3. Find the total precipitation for the year.
---
title: "Advanced Exploration"
---
# Advanced Tasks
Finally, let's consider more intricate questions about our data set.
## Tasks
1. Calculate the differences in air temperature from one hour to the next.
2. Find the biggest rise/ drop in temperature over an hour and when they happened.
3. Find out what the most common wind directions in your location are.
Note that wind directions can be rather fuzzy values, so you might want to come with a better metric here than simply calculating the most common value.
They are given in increments of 10°.
To make it even more complicated, the wind direction wraps around after `350` to `0` thanks to the adaptation we made when cleaning the data.
You might find the [_modulo_-operator][modulo] (`%` in python) quite useful in your calculations.
4. Extrapolate daily statistics for all columns from the hourly ones.
5. Come up with a metric for a "nice day" and an "awful day".
What were the most _nice_ or _awful_ days in your location?
[modulo]: https://en.wikipedia.org/wiki/Modulo_operation
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment