Your First Python Tutorial for Scientists¶

Welcome to Your First Python tutorial for scientists! In this self-paced course you will learn how to write Python code using Python best practices. Through these instructions you will develop Python scripts and use Git and GitHub to save and organize your work. At the end of this tutorial you will have a grasp of how to begin building your own library of Python tools for your scientific analysis workflows.

Why Python?¶

You’re already here because you want to learn to use Python for your data analysis and visualizations. Python can be compared to other high-level, interpreted, object-oriented languages, but is especially great because it is free and open source!

High level languages:: Other high level languages include MatLab, IDL, and NCL. The advantage of high level languages is that they provide functions, data structures, and other utilities that are commonly used, which means it takes less code to get real work done. The disadvantage of high level languages is that they tend to obscure the low level aspects of the machine such as: memory use, how many floating point operations are happening, and other information related to performance. C and C++ are all examples of lower level languages. The “higher” the level of language, the more computing fundamentals are abstracted.
Interpreted languages:: Most of your work is probably already in interpreted languages if you’ve ever used IDL, NCL, or MatLab (interpreted languages are typically also high level). So you are already familiar with the advantages of this: you don’t have to worry about compiling or machine compatability (it is portable). And you are probably familiar with their deficiencies: sometimes they can be slower than compiled languages and potentially more memory intensive.
Object Oriented languages:: Objects are custom datatypes. For every custom datatype, you usually have a set of operations you might want to conduct. For example, if you have an object that is a list of numbers you might want to apply a mathematical operation, such as sum, onto this list object in bulk. Not every function can be applied to every datatype; it wouldn’t make sense to apply a logarithm to a string of letters or to capitalize a list of numbers. Data and the operations applied to them are grouped together into one object.
Open source:: Python as a language is open source which means that there is a community of developers behind its codebase. Anyone can join the developer community and contribute to deciding the future of the language. When someone identifies gaps to Python’s abilities, they can write up the code to fill these gaps. The open source nature of Python means that Python as a language is very adaptable to shifting needs of the user community.

Python is a language designed for rapid prototyping and efficient programming. It is easy to write new code quickly with less typing.

Why another Python tutorial?¶

What makes this Python tutorial unique is that it has been designed specifically to meet the needs of, and feedback from, atmospheric and oceanic scientists making the transition with the NCAR-wide pivot-to-Python. In particular, this tutorial should be useful to any scientist who already knows how to program in some other language but is taking up Python for the first time. By spending the first course on pure Python without importing any additional packages, our “Your First” tutorial addresses the concerns that most tutorials either pick up speed too quickly by going into the intricacies of third-party packages before explaining how Python is different from other languages, or get too bogged down in basic programming concepts that anyone with programming experience already knows. This tutorial attempts to hit the sweet spot between too high-level and too low-level. By using coding examples with real atmospheric datasets and questions, the skills and techniques taught are easily applied to actual atmospheric or oceanic workflows. We hope that this tailored approach to teaching and sharing computational tools effectively addresses the concerns and needs of the geoscience community.

See also

Requirements & Installation¶

We will be using the Conda package manager in this tutorial. If you don’t have Conda installed at all, please install it. Conda is an excellent package manager for Python development, but it is capable of managing installations of more than just Python packages. Both Miniconda (just conda) and the full Anaconda (conda plus a lot of pre-installed packages and tools) are acceptable, but we recommend trying to install Miniconda first. Miniconda is the most lightweight solution, and it is the ideal solution when trying to install Conda on a remote system (i.e., with only SSH access).

We will also be doing all of the following in a bash shell on MacOS. You may have different experiences on other OSes, and you may even have problems on MacOS! That’s okay. If you have a problem, please let us know. We will try to work with you to find solutions on your OS, and we will post the solution here on this page.

Check that you have conda or miniconda installed on your OS by checking your conda version:
```
$ conda --version
```
At the time of writing this, the latest version of conda is 4.8. If you have an old version of conda installed, update it.
If necessary, update:
```
$ conda update -n base conda
```
Updating your Conda package manager should not have any effect on your existing Conda environments.

Note

If you have a really old version of conda it might be easier to delete it and then reinstall it. But before doing this you have to check your env-list with conda env list to see if there are any environments you created and want to save.
Check your conda version again.
```
$ conda --version
```
Initialize Conda to work with your shell (.e.g., bash):
```
$ conda init
```
This step may modify your shell configuration script (e.g., .bash_profile) to make the conda command available in your shell, and it will make the conda activate command work.
Install and Configure Git

Git is a program that tracks changes made to files. This makes it easy to maintain access to multiple versions of your code as you improve it, and revert your code back to a previous version if you’ve made any mistakes.

First Python Script¶

This section of the tutorial will focus on teaching you Python through the creation of your first script. You will learn about syntax and the reasoning behind why things are done the way they are along the way. We will also incorporate lessons on the use of Git because we highly recommend you version controling your work.

We are assuming you are familiar with bash and terminal commands. If not here is a cheat sheet.

Reading a .txt File¶

In building your first Python script we will set up our workspace, read a .txt file, and learn Git fundamentals.

Here is a video recording of the live tutorial covering “Reading a .txt File”:

Let’s begin.

Open a terminal.

Note

On Windows, open Anaconda Prompt. On a Mac or Linux machine, simply open Terminal.
Create a directory:
```
$ mkdir python_tutorial
```
The first thing we have to do is create a directory to store our work. Let’s call it python_tutorial.
Go into the directory:
```
$ cd python_tutorial
```
Create a virtual environment for this project:
```
$ conda create --name python_tutorial python
```
A conda environment is a directory that contains a collection of packages or libraries that you would like installed and accessible for this workflow. Type conda create --name and the name of your project, here that is python_tutorial, and then specify that you would like to install Python in the virtual environment for this project.

It is a good idea to create new environments for different projects because since Python is open source, new versions of the tools you use may become available. This is a way of guaranteeing that your script will use the same versions of packages and libraries and should run the same as you expect it to.

See also

More information on Conda environments
And activate your Conda environment:
```
$ conda activate python_tutorial
```
Make the directory a Git repository:
```
$ git init .
```
A Git repository tracks changes made to files within your project. It looks like a .git/ folder inside that project.

This command adds version control to this new python_tutorial directory and all of its contents.

See also

More information on Git repositories
Create a data directory:
```
$ mkdir data
```
And we’ll make a directory for our data.
Go into the data directory:
```
$ cd data
```
Download sample data from the CU Boulder weather station:
```
$ curl -kO https://sundowner.colorado.edu/weather/atoc8/wxobs20170821.txt
```
This weather station is a Davis Instruments wireless Vantage Pro2 located on the CU-Boulder east campus at the SEEC building (40.01 N, 05.24 W, 5250 ft elevation). The station is monitored by the Atmospheric and Oceanic Sciences (ATOC) department and is part of the larger University of Colorado ATOC Weather Network.
Check the status of your repository:
```
$ git status
```
You will see the newly created data directory (which is listed as ./, since you are currently in that directory) is listed as “untracked,” which means all of the files you added to that directory are also untracked by Git. The git status command will tell you what to do with untracked files. Those instructions mirror the next 2 steps:
Add the file to the Git staging area:
```
$ git add wxobs20170821.txt
```
By adding this datafile to your directory, you have made a change that is not yet reflected in our Git repository. Every file in your working directory is classified by git as “untracked”, “unmodified”, “modified”, or “staged.” Type git add and then the name of the altered file to stage your change, i.e. moving a file that is either untracked or modified to the staged category so they can be committed.

See also

More information on git add
Check your git status once again:
```
$ git status
```
Now this file is listed as a “change to be commited,” i.e. staged. Staged changes can now be commited to your repository history.
Commit the file to the Git repository:
```
$ git commit -m "Adding sample data file"
```
With git commit, you’ve updated your repository with all the changes you staged, in this case just one file.

Note

On a Windows machine you may see the following: warning: LF will be replaced by CRLF. The file will have its original line endings in your working directory. Do not worry too much about this warning. CR refers to “Carriage Return Line Feed” and LF refers to “Line Feed.” Both are used to indicate line termination. In Windows both a Carriage Return and Line Feed are required to note the end of a line, but in Linux/UNIX only a Line Feed is required. Most text editors can account for line ending differences between opperating systems, but sometimes a conversion is necessary. To silence this warning you can type git config --global core.autocrlf false in the terminal.
Look at the Git logs:
```
$ git log
```
If you type git log you will show a log of all the commits, or changes made to your repository.
Go back to the top-level directory:
```
$ cd ..
```
And now that you’ve set up our workspace, create a blank Python script, called mysci.py:
```
$ touch mysci.py
```
Note

If you are working on a Windows machine it is possible that touch will not be recognized as an internal or external command. If this is the case, run conda install m2-base to enable unix commands such as touch.
Edit the mysci.py file using nano, vim, or your favorite text editor:
1
print("Hello, world!")
Your classic first command will be to print Hello, world!.

Note

On a Windows machine, it is possible nano or vim are not recognized as text editors within your terminal. In this case simply try to run mysci.py to open a notepad editor.
Try testing the script by typing python and then the name of your script:
```
$ python mysci.py
```
Yay! You’ve just created your first Python script.

You probably won’t need to run your Hello World script again, so delete the print(“Hello, world!”) line and start over with something more useful - we’ll read the first 4 lines from our datafile.

Change the mysci.py script to read:

# Read the data file
filename = "data/wxobs20170821.txt"
datafile = open(filename, 'r')

print(datafile.readline())
print(datafile.readline())
print(datafile.readline())
print(datafile.readline())

datafile.close()

First create a variable for your datafile name, which is a string - this can be in single or double quotes.

Then create a variable associated with the opened file, here it is called datafile.

The ‘r’ argument in the open command indicates that we are opening the file for reading capabilities. Other input arguments for open include ‘w’, for example, if you wanted to write to the file.

The readline command moves through the open file, always reading the next line.

And remember to close your datafile.

Comments in Python are indicated with a hash, as you can see in the first line # Read the data file. Comments are ignored by the interpreter.

See also

More information on the open() function

And test your script again by typing:
```
$ python mysci.py
```
Testing of your script with python mysci.py should be done every time you wish to execute the script. This will no longer be specified as a unique step in between every change to our script.
Change the mysci.py script to read your whole data file:
1 2 3 4 5 6 7 8 9 10 11
# Read the data file filename = "data/wxobs20170821.txt" datafile = open(filename, 'r') data = datafile.read() datafile.close() # DEBUG print(data) print('data')
Our code is similar as before, but now we’ve read the entire file. To test that this worked. We’ll print(data). Print statements in python require parenthesis around the object you wish to print, in this scenario the data object.

Try print(‘data’) as well. Now Python will print the string data, as it did for the hello world function, instead of the information stored in the variable data.

Don’t forget to execute with python mysci.py.
Change the mysci.py script to read your whole data file using a context manager with:
1 2 3 4 5 6 7
# Read the data file filename = "data/wxobs20170821.txt" with open(filename, 'r') as datafile: data = datafile.read() # DEBUG print(data)
Again this is a similar method of opening the datafile, but we now use with open. The with statement is a context manager that provides clean-up and assures that the file is automatically closed after you’ve read it.

The indendation of the line data = datafile.read() is very important. Python is sensitive to white space and will not work if you mix spaces and tabs (Python does not know your tab width). It is best practice to use spaces as opposed to tabs (tab width is not consistent between editors).

Combined these two lines mean: with the datafile opened, I’d like to read it.

And execute with python mysci.py.

See also

More information on context managers
What did we just see? What is the data object? What type is data? How do we find out?

Change the DEBUG section of our script to:
6 7
# DEBUG print(type(data))
And execute with python mysci.py

Object types refer to float, integer, string or other types that you can create.

Python is a dynamically typed language, which means you don’t have to explicitly specify the datatype when you name a variable, Python will automatically figure it out by the nature of the data.
Now, clean up the script by removing the DEBUG section, before we commit this to Git.
Let’s check the status of our Git repository
```
$ git status
```
Note

Take a look at which files have been changed in the repository!
Stage these changes:
```
$ git add mysci.py
```
Let’s check the status of our Git repository,again. What’s different from the last time we checked the status?
```
$ git status
```
Commit these changes:
```
$ git commit -m "Adding script file"
```
Here a good commit message -m for our changes would be "Adding script file"
Let’s check the status of our Git repository, now. It should tell you that there are no changes made to your repository (i.e., your repository is up-to-date with the state of the code in your directory).
```
$ git status
```
Look at the Git logs, again:
```
$ git log
```
You can also print simplified logs with the --oneline option.

That concludes the first lesson of this virtual tutorial.

In this section you set up a workspace by creating your directory, conda environment, and git repository. You downloaded a .txt file and read it using the Python commands of open(), readline(), read(), close(), and print(), as well as the context manager with. You should be familiar with the str datatype. You also used fundamental git commands such as git init, git status, git add, git commit, and git log.

See also

Creating a Data Dictionary¶

This is intended to pick off right where “Reading in a .txt File” left off - you had just commited your new script file that reads in the data from a file as a string. You will now manipulate your data into a more usable format - a dictionary. In doing so you will learn how to write iterative for loops and about Python data structures.

Here is a video recording of the live tutorial covering “Creating a Data Dictionary”:

Let’s begin.

One big string isn’t very useful, so use str.split() to parse the data file into a data structure you can use.

With your terminal open and python_tutorial environment activated, change the mysci.py script to read:

# Initialize my data variable
data = []

# Read and parse the data file
filename = "data/wxobs20170821.txt"
with open(filename, 'r') as datafile:

 # Read the first three lines (header)
 for _ in range(3):
    datafile.readline()

 # Read and parse the rest of the file
 for line in datafile:
    datum = line.split()
    data.append(datum)

# DEBUG
for datum in data:
   print(datum)

The first thing that is different in this script is an initialized data variable; data = [] creates the variable data as an empty list which we will populate as we read the file. Python list objects are a collection data type that contain ordered and changeable - meaning you can call information out of the list by its index and you can add or delete elements to your list. Lists are denoted by square brackets, [].

Then with the datafile open for reading capabilities, we are going to write two separate for loops. A for loop is used for iterating over a sequence (such as a list). It is important to note the syntax of Python for loops: the : at the end of the for line, the tab-indentation of all lines within the for loop, and perhaps the absence of an end for that is found in languages such as Matlab.

In your first for loop, loop through the dummy variable _ in range(3). The range function returns a sequence of numbers, starting at 0 and incrementing by 1 (by default), ending at the specified length. Here if you were to print(_) on each line of the for loop you would see:

0
1
2

Try it out if you are unsure of how this works. Here the _ variable is a placeholder, meaning the variable is never called within the loop.

So again, in the first for loop, you execute the readline command (which you will remember moves down to the next line each time it is consecutively called) 3 times to read through the file header (which is 3 lines long). Yay! You have just written your first for loop!

Then in a second for loop, you loop through lines in the remainder of your datafile. On each line, split it along white space. The string.split() method splits a string into a list on a specified separator, the default being white space. You could use any character you like, but other useful options are /t for splitting along tabs or , along commas.

Then you append this split line list to the end of your data list. The list.append() method adds a single item to the end of your list. After every line in your for loop iteration, the data list that was empty is one element longer. Now we have a list of lists for our data variable - a list of the data in each line for multiple lines.

When you print each datum in data, you’ll see that each datum is a list of string values.

We just covered a lot of Python nuances in a very little bit a code!

Now, to practice list indexing, get the first, 10th, and last row in data.

Change the DEBUG section of our mysci.py script to:
17 18 19 20
# DEBUG print(data[0]) print(data[9]) print(data[-1])
Index your list by adding the number of your index in square brackets, [], after the name of the list. Python is 0-indexed so data[0] refers to the first index and [-1] refers to the last index.
Now, to practice slice indexing, get the first 10 rows in data.

Change the DEBUG section of our mysci.py script to:
17 18 19
# DEBUG for datum in data[0:10]: print(datum)
Using a colon, :, between two index integers a and b, you get all indexes between a and b. See what happens when you print data[:10], data[0:10:2], and data[slice(0,10,2)]. What’s the difference?
Now, to practice nested indexing, get the 5th, the first 5, and every other column of row 9 in the data object.

Change the DEBUG section of the mysci.py script to:
17 18 19 20
# DEBUG print(data[8][4]) print(data[8][:5]) print(data[8][::2])
In nested list indexing, the first index determines the row, and the second determines the element from that row. Also try printing data[5:8][4], why doesn’t this work?
Clean up the file (remove DEBUG section), stage the changes, and commit.
```
$ git add mysci.py
$ git commit -m "Parsing file"
```

Can you remember which column is which? Is time the first column or the second? Which column is the temperature?

Each column is a time-series of data. We would ideally like each time-series easily accessible, which is not the case when data is row-column ordered (like it currently is). (Remember what happens when you try to do something like data[:][4]!)

Let’s get our data into a more convenient named-column format.

Change mysci.py to the following:

# Initialize my data variable
data = {'date': [],
  'time': [],
  'tempout': []}

# Read and parse the data file
filename = "data/wxobs20170821.txt"
with open(filename, 'r') as datafile:

   # Read the first three lines (header)
   for _ in range(3):
      datafile.readline()

   # Read and parse the rest of the file
   for line in datafile:
      split_line = line.split()
      data['date'].append(split_line[0])
      data['time'].append(split_line[1])
      data['tempout'].append(split_line[2])

# DEBUG
print(data['time'])

First we’ll initialize a dictionary, dict, indicated by the curly brackets, {}. Dictionaries, like lists, are changeable, but they are unordered. They have keys, rather than positions, to point to their elements. Here you have created 3 elements of your dictionary, all currently empty lists, and specified by the keys date, time, and tempout. Keys act similarly to indexes: to pull out the tempout element from data you would type data[‘tempout’].

Grab date (the first column of each line), time (the second column of each line), and temperature data (the third column), from each line and append it to the list associated with each of these data variables.

See also

Writing Functions¶

This is intended to pick off right where “Creating a Data Dictionary” left off - you had just commited your new script that reads the file, saving the variables of date, time, and tempout in a data dictionary. In this section you will compute wind chill index by writing your first function and learning about basic math operators.

Here is a video recording of the live tutorial covering “Writing Functions”:

Let’s begin.

Now that you’ve read the data in a way that is easy to modify later, it is time to actually do something with the data.

Compute the wind chill factor, which is the cooling effect of the wind. As wind speed increases the rate at which a body loses heat increases. The formula for this is:

\[WCI = a + (b * t) - (c * v^{0.16}) + (d * t * v^{0.16})\]

Where WCI refers to the Wind Chill in degrees F, t is temperature in degrees F, v is wind speed in mph, and the other variables are as follows: a = 35.74, b = 0.6215, c = 35.75, and d = 0.4275. Wind Chill Index is only defined for temperatures within the range -45 to +45 degrees F.

You’ve read the temperature data into the tempout variable, but to do this calculation, you also need to read the windspeed variable from column 7.

With your terminal open and python_tutorial environment activated, modify the columns variable in mysci.py to read:
1 2
# Column names and column indices to read columns = {'date': 0, 'time': 1, 'tempout': 2, 'windspeed': 7}
and modify the types variable to be:
4 5
# Data types for each column (only if non-string) types = {'tempout': float, 'windspeed': float}

Great! Save this in your Git repo. Stage and commit

$ git add mysci.py
$ git commit -m "Reading windspeed as well"

Now, let’s write our first function to compute the wind chill factor. We’ll add this function to the bottom of the file.

# Compute the wind chill temperature
def compute_windchill(t, v):
   a = 35.74
   b = 0.6215
   c = 35.75
   d = 0.4275

   v2 = v ** 2
   wci = a + (b * t) - (c * v2) + (d * t * v2)
   return wci

To indicate a function in python you type def for define, the name of your function, and then in parenthesis the input arguments of that function, followed by a colon. The preceding lines,the code of your function, are all tab-indented. If necessary specify your return value.

See also

First Python Package¶

In this section of the tutorial we will learn how to create a Python package and the basics of how to use built-in package math. This will prepare you to learn any package you think may be useful for your scientific analysis.

Creating Your Own Package¶

In this section you will learn how to move functions and code blocks into Python packages that you can import into your analysis methods, making them easier to write, read, and share.

Perhaps you are already familiar with importing packages into your workflow. Many scientists pass around files that contain unique user-written functions to reduce redundant work between scientists, but what if the original author found a bug in their script? It is difficult to track down every user of their code to let them know. In Python, package managers help you know what version of those functions you are using. Matlab also has packages that you can pay extra money to install and use - again Python is free!

Here is a video recording of the live tutorial covering “Creating Your Own Package”:

Let’s begin.

Open a terminal and make sure you are in the python_tutorial directory and have activated the corresponding environment.
Make a copy of your first script with a new name:
```
$ cp windchillcomp.py heatindexcomp.py
```

Git add and commit this new file:

$ git add heatindexcomp.py
$ git commit -m "Copying first script to start second"

Now you will compute the Heat Index.

Like wind chill, which is a measure of how much colder the weather feels to the human body due to wind speed, heat index is a measure of how much hotter the weather feels to the human body due to humidity. The Rothfusz formula for heat index is:

\[\textit{HI} = a + (b * T) + (c * H) + (d * T * H) + (e * T^2) + (f * H^2) + (g * T^2 * H) + (h * T * H^2) + (i * T^2 * H^2)\]

where HI is the Heat Index, T is temperature is in degrees F, H is humidity in %, a = -42.379, b = 2.04901523, c = 10.14333127, d = -0.22475541, e = -0.00683783, f = -0.05481717, g = 0.00122874, h = 0.00085282, and i = -0.00000199. The Roothfusz regression is not valid for extreme temperature or humidity conditions.

Replace the compute_windchill function with in your heatindexcomp.py script with a compute_heatindex function:

# Compute the heat index
def compute_heatindex(t, rh_pct):
   a = -42.379
   b = 2.04901523
   c = 10.14333127
   d = -0.22475541
   e = -0.00683783
   f = -0.05481717
   g = 0.00122874
   h = 0.00085282
   i = -0.00000199

   rh = rh_pct / 100

   hi = a + (b * t) + (c * rh) + (d * t * rh)
      + (e * t**2) + (f * rh**2) + (g * t**2 * rh)
      + (h * t * rh**2) + (i * t**2 * rh**2)
   return hi

Change the columns and types dictionary we read from the data file to read in the humidity and heat index values as floats:

# Column names and column indices to read
columns = {'date': 0, 'time': 1, 'tempout': 2, 'humout': 5, 'heatindex': 13}

# Data types for each column (only if non-string)
types = {'tempout': float, 'humout': float, 'heatindex': float}

Update the function call and printing sections of the script to match:

# Compute the heat index
heatindex = []
for temp, hum in zip(data['tempout'], data['humout']):
   heatindex.append(compute_heatindex(temp, hum))

# Output comparison of data
print('                ORIGINAL  COMPUTED')
print(' DATE    TIME  HEAT INDX HEAT INDX DIFFERENCE')
print('------- ------ --------- --------- ----------')
for date, time, hi_orig, hi_comp in zip(data['date'], data['time'], data['heatindex'], heatindex):
   print(f'{date} {time:>6} {hi_orig:9.6f} {hi_comp:9.6f} {hi_orig-hi_comp:10.6f}')

Run this script with "python heatindexcomp.py" and see the results.

So far you have only revisited concepts from “Your First Script”.

Git stage and commit this new script.

$ git add heatindexcomp.py
$ git commit -m "Updating new heat index script"

Now, you have two scripts that do very similar things. In fact, all of the data reading and parsing code is duplicated! And the output is similarly formatted, too. Let’s remove that duplication!

Create a new file called readdata.py:
```
$ touch readdata.py
```
This new file will include the common code for reading the data file from both the windchillcomp.py and heatindexcomp.py scripts.

Copy and paste the lines for reading in the data file into readdata.py:

# Initialize my data variable
data = {}
for column in columns:
   data[column] = []

# Read and parse the data file
with open(filename, 'r') as datafile:

   # Read the first three lines (header)
   for _ in range(3):
      datafile.readline()

   # Read and parse the rest of the file
   for line in datafile:
      split_line = line.split()
      for column in columns:
         i = columns[column]
         t = types.get(column, str)
         value = t(split_line[i])
         data[column].append(value)

Turn these lines into a function:

def read_data(columns, types={}, filename="data/wxobs20170821.txt"):
   # Initialize my data variable
   data = {}
   for column in columns:
      data[column] = []

   # Read and parse the data file
   with open(filename, 'r') as datafile:

      # Read the first three lines (header)
      for _ in range(3):
         datafile.readline()

      # Read and parse the rest of the file
      for line in datafile:
         split_line = line.split()
         for column in columns:
            i = columns[column]
            t = types.get(column, str)
            value = t(split_line[i])
            data[column].append(value)

   return data

The function arguments for our read_data function are columns, types, and filename. The types and filename variables are both keyword arguments, which means that it is not necessary to include them in your function call; if you do not call them, their value is taken as what they are assigned to in the function definition.

When you see types={} it means that types is presumed to be an empty dictionary when unspecified (and so you don’t have to specify it every time you call the function when this keyword isn’t relevant).

Similarly, filename is set to the path of our data file as long as the user doesn’t specify a different file.

Keyword arguments can be called in any order, but they must follow all positional arguments (i.e., arguments that do not have default values).

Add a docstring to the function:

def read_data(columns, types={}, filename="data/wxobs20170821.txt"):
   """
   Read data from CU Boulder Weather Station data file

   Parameters:
      columns: A dictionary of column names mapping to column indices
      types: A dictionary of column names mapping to types to which
         to convert each column of data
      filename: The string path pointing to the CU Boulder Weather
            Station data file
   """

   # Initialize my data variable
   data = {}
   for column in columns:
      data[column] = []

   # Read and parse the data file
   with open(filename, 'r') as datafile:

      # Read the first three lines (header)
      for _ in range(3):
         datafile.readline()

      # Read and parse the rest of the file
      for line in datafile:
         split_line = line.split()
         for column in columns:
            i = columns[column]
            t = types.get(column, str)
            value = t(split_line[i])
            data[column].append(value)

   return data

The section between the tripple quotes """ is the docstring. The “Read data from CU Boulder Weather Station data file …” describing the utility of the function and the list of parameters are standard information included in a docstring, but there is no requirement. Everything between the triple quotes is essentially a comment that you can write and format any way you want.

This new file is a module. Modules are simply files containing Python code, meant to be called up (or “imported”) within a different Python script. We’ll get to this later.

Stage and commit this new file:

$ git add readdata.py
$ git commit -m "Adding new readdata module"

Amend your two Python (heatindexcomp.py and windchillcomp.py) scripts by deleting the equivalent read-file code in them.
Add the following import statement to the top of each script:
1
from readdata import read_data
In python you can call up functionality from scripts outside of your active script using the import statement. Here we import our read_data function from the readdata module. And now we can call up the function from these scripts.
And after the initializations of the columns and types variables, replace the deleted code with a function call:
9 10
# Read data from file data = read_data(columns, types=types)
The types=types says that the input argument types is being set equal to our dictionary types.

Test out both of these scripts to make sure they still work!
Do a "git status" now.

Do you notice something new? Running our new scripts created the __pycache__ directory.

What is __pycache__? When you run a python program with an import command, Python learns that you have written code that you may call again. The interpreter compiles your scripts to bytecode and stores them in a cache, making your scripts run a little faster next time. As a user, you can for the most part ignore this new folder. If you change or delete your scripts they will be recompiled and reappear in this folder.

However, you don’t want to add this directory to our project repository, so before you commit anything, tell git to ignore it!

Create a new file (in the top-level directory of your project) called .gitignore
```
$ touch .gitignore
```
with the following contents:
```
__pycache__/
```
Do another git status. What do you see?

Now, instead of __pycache__ being listed as “untracked”, you see .gitignore being listed as “untracked”, and no mention of __pycache__.
Stage and commit the new .gitignore file.
```
$ git add .gitignore
$ git commit -m "Ignoring pycache"
```
Do another git status. Notice that the edits you made to your two scripts have still not been committed to the project repository! Because they have not yet been staged.
Stage both files and commit all new changes in one commit:
```
$ git add -A
$ git commit -m "Refactor scripts to use new module"
```
You can type -A instead of the name of your files to add all unstaged changes.

There is still have some duplicated code between the two scripts. Let’s combine the final output code and printing code.

Create another module file called printing.py:

$ touch printing.py

And create a printing function (with docstring!) in printing.py:

def print_comparison(name, dates, times, original_data, computed_data):
   """
   Print a comparison of two time series (original and computed)

   Parameters:
      name: A string name for the data being compared. (Limited
         to 9 characters in length)
      dates: List of strings representing the dates for each data element
      times: List of strings representing time of day for each data element
      original_data: List of original data (floats)
      computed_data: List of computed data (floats)
   """

   print(f'                ORIGINAL  COMPUTED')
   print(f' DATE    TIME  {name.upper():>9} {name.upper():>9} DIFFERENCE')
   print(f'------- ------ --------- --------- ----------')
   for date, time, orig, comp in zip(dates, times, original_data, computed_data):
      print(f'{date} {time:>6} {orig:9.6f} {comp:9.6f} {orig-comp:10.6f}')

The only new functionality shown here is string.upper() (or, specifically, name.upper()), which capitalizes all lower case letters in a string.

Edit the two scripts to use this new module (similar methods to step #13-15), and test your results.

Try to do this on your own first, but if you are getting error messages the solution looks like:
1. Add the "from printing import print_comparison" line to the top of each script.
2. Replace the printing output section at the bottom of each script with:
  29 30
  # Output comparison of data print_comparison('WINDCHILL', data['date'], data['time'], data['windchill'], windchill)
  or
  37 38
  # Output comparison of data print_comparison('HEAT INDX', data['date'], data['time'], data['heatindex'], heatindex)

Stage all changes and commit:

$ git add -A
$ git commit -m "Creating printing module"

You now have 2 different modules related to the same project. It is best practice to separate different functions into different modules depending upon the kind of functionality they represent. In this case, you’ve separated out the concepts of “data input” and “printing output” into different modules.

Do the same thing with the computation functions, compute_windchill and compute_heatindex.

Move these functions into a new module called computation.py, and modify the scripts to use this new module. Remember to add docstrings!

Try to do this on your own first!!

Your new computation.py module should look similar to the following:

def compute_windchill(t, v):
   """
   Compute the wind chill factor given the temperature and wind speed

   NOTE: This computation is valid only for
      temperatures between -45F and +45F and for
      wind speeds between 3 mph and 60 mph.

   Parameters:
      t: The temperature in units of F (float)
      v: The wind speed in units of mph (float)
   """

   a = 35.74
   b = 0.6215
   c = 35.75
   d = 0.4275

   v16 = v ** 0.16
   wci = a + (b * t) - (c * v16) + (d * t * v16)
   return wci


def compute_heatindex(t, rh_pct):
   """
   Compute the heat index given the temperature and the humidity

   Parameters:
      t: The temperature in units of F (float)
      rh_pct: The relative humidity in units of % (float)
   """

   a = -42.379
   b = 2.04901523
   c = 10.14333127
   d = -0.22475541
   e = -0.00683783
   f = -0.05481717
   g = 0.00122874
   h = 0.00085282
   i = -0.00000199

   rh = rh_pct / 100

   hi = a + (b * t) + (c * rh) + (d * t * rh)
   + (e * t**2) + (f * rh**2) + (g * t**2 * rh)
   + (h * t * rh**2) + (i * t**2 * rh**2)
   return hi

And then modified the scripts accordingly as in steps #13-15 and #19 by adding your import statements "from computation import compute_windchill" OR "from computation import compute_heatindex" and removing the redundant function definitions.

Your two scripts should look as follows:

For windchillcomp.py:

from readdata import read_data
from printing import print_comparison
from computation import compute_windchill

# Column names and column indices to read
columns = {'date':0, 'time':1, 'tempout':2, 'windspeed':7, 'windchill':12}

# Data types for each column (only if non-string)
types = {'tempout': float, 'windspeed':float, 'windchill':float}

# Read data from file
data = read_data(columns, types=types)

# Compute the wind chill factor
windchill = []
for temp, windspeed in zip(data['tempout'], data['windspeed']):
   windchill.append(compute_windchill(temp, windspeed))

# Output comparison of data
print_comparison('WINDCHILL', data['date'], data['time'], data['windchill'], windchill)

And for heatindexcomp.py:

from readdata import read_data
from printing import print_comparison
from computation import compute_heatindex

# Column names and column indices to read
columns = {'date': 0, 'time': 1, 'tempout': 2, 'humout': 5, 'heatindex': 13}

# Data types for each column (only if non-string)
types = {'tempout': float, 'humout': float, 'heatindex': float}

# Read data from file
data = read_data(columns, types=types)

# Compute the heat index
heatindex = []
for temp, hum in zip(data['tempout'], data['humout']):
   heatindex.append(compute_heatindex(temp, hum))

# Output comparison of data
print_comparison('HEAT INDX', data['date'], data['time'], data['heatindex'], heatindex)

Stage and commit everything:

$ git stage -A
$ git commit -m "Creating computation module"

Now, you’ve got quite a few Python files in the main directory. Which ones are scripts? Which ones are modules meant to be imported?

Typically, you should group all of the modules meant for import only into another directory called a package. A package is a directory containing a file called __init__.py inside it. (Note that this file is commonly empty.)

Create a new directory called mysci and create an empty file in it called __init__.py:
```
$ mkdir mysci
$ cd mysci
$ touch __init__.py
$ cd ..
```
Then, move the 3 modules into this package:
```
$ git mv readdata.py mysci/
$ git mv printing.py mysci/
$ git mv computation.py mysci/
```
Then, let’s modify the import statements at the top of our two scripts so that the modules are automatically imported from the new package:
1 2 3
from mysci.readdata import read_data from mysci.printing import print_comparison from mysci.computation import compute_heatindex
Stage everything (don’t forget the __init__.py file!) and commit
```
$ git add -A
$ git commit -m "Creating mysci package"
```
Our commits are getting bigger, but that’s okay. Each commit corresponds to a single (conceptually) change to the codebase.

With this last change, our project should look like this (ignoring the __pycache__ directories:
```
python_tutorial

   data/
      wxobs20170821.txt

   mysci/
      __init__.py
      readdata.py
      printing.py
      computation.py

   heatindexcomp.py
   windchillcomp.py
```
As a brief aside – look at the use of the computation functions in these scripts.

In the case of the wind chill factor computation, it looks like this:
14 15 16 17
# Compute the wind chill factor windchill = [] for temp, windspeed in zip(data['tempout'], data['windspeed']): windchill.append(compute_windchill(temp, windspeed))
This divides the initialization of the windchill variable as an empty list from the “filling” of that list with computed values.

Python gives you some shortcuts to doing this via a concept called “comprehensions”, which are ways of initializing containers (lists, dicts, etc.) with an internal loop. For example, we could have written the previous 3 lines in the form of a “one-liner” like so:
14 15
# Compute the wind chill factor windchill = [compute_windchill(t, w) for t, w in zip(data['tempout'], data['windspeed'])]
This is a list comprehension, and it initializes the entire list with the computed contents, rather than initializing an empty list and appending values to it after the fact. Computationally, this is actually more efficient.

Use list comprehensions to make the computation steps in both of scripts one-liners.

Do a final stage and commit changes

$ git add -A
$ git commit -m "Using list comprehensions"

That concludes the lesson on “Creating Your First Package”, the first in our introduction to Python packages series.

You should now be familiar with modules, using the import statement, some more f-string formatting options, __pycache__, .gitignore, __init__.py, and list comprehensions.

Using a Built-In Package¶

So far you have created separate readdata, printing, and computation modules to remove redundant code blocks from your scripts. And you have combined these modules into a package that we imported into our scripts.

Python comes with many different built-in packages (i.e., libraries) that you can import and use. The beauty of using built-in packages is that you don’t have to install anything new! If you can use and run Python, you already have access to these packages. For this tutorial, we are going to cover just a little bit of the built-in math package, which extends the computational capabilities beyond the basic math operators we’ve already covered.

Here is a video recording of the live tutorial covering “Using a Built-In Package and Publishing Your Package”:

Note

The recording is missing the first 10 minutes or so of the presentation and thus does not contain the introduction, the review from last session, and a brief list comprehension aside. The video begins when editing the mysci/computation.py module. There should be an import math statement at the beginning of the script and then you should be able to follow along with writing a new function compute_dewpoint, as shown.

Let’s begin.

Open your terminal, navigate to your python_tutorial directory and activate the corresponding environment.

Now we’re going to add a function for calculating dew point temperature to your mysci/computation.py module:

The formula for this is:

\[\Gamma = \log{(h)} + \frac{b * t}{c + t}\]

\[\textit{DPT} = \frac{c * \Gamma}{b - \Gamma}\]

Where DPT represents Dew Point Temperature in Degrees C, h is humidity in %, t is temperature is in degrees C, b = 18.678, and c = 257.14 degrees C.

In order to compute a natural logarithm, we will need to import the math package. It is best practice to import packages and modules at the beginning (top) of the file.

1	import math

To access the logarithmic function within the module math you would type math.log.

Then write the function at the bottom of computation.py file (with best practice suggesting 2 empty lines between each function):

def compute_dewpoint(t, rh_pct):
   """
   Compute the dew point temperature given the temperature and humidity

   Parameters:
      t: The temperature in units of F (float)
      rh_pct: The relative humidity in units of % (float)
   """

   tempC = (t - 32) * 5 / 9 # Convert temperature from deg F to deg C
   rh = rh_pct / 100

   b = 18.678
   c = 257.14 # deg C

   gamma = math.log(rh) + (b * tempC) / (c + tempC)
   tdp = c * gamma / (b - gamma)

   tdp_F = 9 / 5 * tdp + 32 # Convert deg C to deg F
   return tdp_F

This function converts our input temperature to degrees Celsius and humidity to relative humidity, specifies the constants, calculates the dew point temperature, and finally converts that temperature to degrees Fahrenheit.

Git add and commit computation.py:

$ git add computation.py
$ git commit -m "Function for Computing DPT"

Make a copy of your second script with the new name dewpointtempcomp.py:
```
$ cp windchillcomp.py dewpointtempcomp.py
```

Git add and commit dewpointtempcomp.py:

$ git add dewpointtempcomp.py
$ git commit -m "Creating a 3rd Script or DPT calculation"

Edit dewpointtempcomp.py:

Make changes to the import statements to include:

from mysci.computation.py import compute_dewpoint

And change your columns and types dictionaries to include dewpt:

# Columns names and column indices to read
columns = {'date':0 , 'time':1, 'tempout':2, 'humout':5, 'dewpt':6}

# Data types for each column (only if non-string)
types = {'tempout':float, 'humout':float, 'dewpt':float}

And finally, make changes to the function calls:

# Compute the dew point temperature
dewpointtemp = [compute_dewpoint(t, h) for t, h in zip(data['tempout'], data['humout'])]

# Output comparison of data
print_comparison('DEW PT', data['date'], data['time'], data['dewpt'], dewpointtemp)

Git add and commit:

$ git add dewpointtempcomp.py
$ git commit -m "Computed dew point temperature"

Let’s learn more about the math module!

Since you already imported code from your readdata, printing, and computation modules, importing from the built-in package math seemed a little less intimidating.

So far you have only used the math.log function, but let’s test out some other common methods within math.

Perhaps you want to change the base of your logarithm. To do this you could type math.log(x, base). Here base is a keyword argument (just like filename or types in our read_data() function) which means that base does not need to be specified. When it is not specified, the logarithm is assumed to be natural (base e). When both arguments are entered, the function returns the logarithm of x to the given base, calculated by log(x)/log(base). Let’s test this out:
```
import math as m

x = m.e

y_natural = m.log(x)
y_base10 = m.log(x, 10)

print(x, y_natural, y_base10)
```
Something new that we have done here is use the "import ... as ..." statement. This essentially allows us to shorten the name of the module for convenience if it is very long or if we are going to be calling it a lot.

The symbol math.e represents Euler’s number (e), the base of the natural logarithm. Euler’s number (e) is an irrational number with infinite decimal places, often approximated as 2.718. How much more accurate is math.e than this approximation?

The function math.log(x, base) is very useful for computing logarithms in any base - but for some common bases there are separate logarithmic functions. Try using log10(x):
```
import math as m

x = m.e

y_natural = m.log(x)
y_base10 = m.log(x, 10)
y_log10 = m.log10(x)

print(x, y_natural, y_base10, y_log10)
```
Do the two values differ? The math.log10(x) function is considered to be more accurate than math.log(x, 10). Similarly math.log2(x) is more accurate than math.log(x, 2).
Let’s cover some math trigonometry examples!

The math symbol $\pi$ is an irrational number (like e) that is approximately $\frac{22}{7}$ or 3.14159. We can access the most accurate float version of this number (depending on your C compiler), with math.pi.

Say we wanted to convert a number from 60 degrees to radians. We have two options:
```
import math as m

deg = 60

rads = deg * m.pi / 180
rads_fromfunc = m.radians(deg)

print(deg, m.pi, rads, rads_fromfunc)
```
In the first example we used math.pi to perform our calculation (by default printed to 15 digits). In the second conversion, we used the function math.radians(x) which converts angle x from degrees to radians.

We can also use trigonometric functions: math.sin to get the sine value of an angle, math.cos to get the cosine, math.tan for the tangent, math.asin for the arc sine, math.acos to get the arc cosine, and math.atan to get the arc tangent. You can also calculate the hypotenuse of a triangle with math.hypot(). The input angle for each of these functions must be in radians to get the expected result!
```
import math as m

deg = 180

cos_deg = m.cos(deg)
cos_rad = m.cos(m.radians(deg))

print(deg, cos_deg, cos_rad)
```
You might remember that the cosine of 180 degrees is -1, you can see that we only get the correct value if we enter the degree in radians (180 deg = PI radians).
Let’s use math.factorial():

Another popular math function is factorial() which is much faster and requires a lot less code than writing your own for loops to find the factorial of a number. Try math.factorial(5) and see what you get!

That concludes the “Using a Built-In Package” section of this tutorial.

You should now be familiar with importing packages that you did not build and some methods within the math module - specifically the log method.

See also

More information on the Math module

Publishing Your Package¶

Time to publish our package! …But what does that mean? Haven’t we published it already by hosting the git repository on GitHub?

In a sense, yes, you have already published your code. But you haven’t published it in a way that makes your code easy for someone else to install. That’s what packaging is all about.

In its current state, your code could be downloaded by somebody from GitHub using the git clone command:

$ git clone https://github.com/username/repo.git

where username and repo are your GitHub username and the name of the GitHub repository, respectively. This will download the git repository from GitHub and put it in a directory called repo. But then to use this code in your own project you would have to copy the contents of the repo directory into your own project space so that you could import the mysci package in your own scripts and code.

That’s burdensome! Fortunately, the Python developers created a way of installing external packages into a common space from which your python interpreter can find. That tool is called pip, which is short for the “package installer for Python.” With pip, you can install a package that was downloaded (i.e., cloned) from GitHub, like so (make sure you are out of the directory you are trying to clone):

$ cd ..
$ git clone https://github.com/username/some_package.git
$ pip install some_package

…assuming that the some_package repository has been properly packaged, which is what this section of this tutorial is all about!

Now, before we begin teaching you how to package your code properly, so that other people can easily share it, you might be asking yourself, “Can’t I just install a package directly from GitHub?” And the answer is YES! You can! The git clone step can be skipped entirely by writing:

$ pip install git+https://github.com/username/some_package.git

where you should not the git+https:// protocol syntax, instead of just https://, which tells pip to do the git clone step first before trying to install the package.

…And, lastly, if you are looking at this and still thinking that it looks messy, then you are in luck! The Python development community has created a free online service called the Python Package Index (PyPI) that allows you to publish your package to the PyPI servers so that other users can then install your package by simply executing:

$ pip install some_package

At the end of this section, we’ll talk about how to publish your package to PyPI.

But first, let’s learn how to package our code properly.

Create a setup.py file one level above your mysci package (in the python_tutorial directory):

$ touch setup.py

The setup.py file is a Python file necessary for package distribution. This file tells pip how to install your package into the common Python space for your python interpreter. Required information is the name of your package, the version of your package (which you can choose), and a list of packages you’d like installed by pip (e.g., your mysci package).

It’s contents will look as follows (but with your name and email):

from distutils.core import setup

setup(
    name="mysci",
    version="1.0.0",
    description="A sample package",
    author="Xdev",
    author_email="xdev@ucar.edu",
    packages=["mysci"],
    install_requires=[],
)

This setup.py includes the information on the package name, version, description, author, contact info, contents, and dependencies (install_requires) which is set to an empty list since our current package uses no external packages.

Push to GitHub!

$ git add setup.py
$ git commit -m "Adding setup.py"
$ git push origin main

Pip Install your package locally.

To test that our package is set up correctly, let’s install it into our project repository.
```
$ pip install .
```
Everything should install smoothly, and now you will be able to import mysci in any Python code that you write, regardless of where that code is…*as long as you use the same python interpreter*! See the Note below.

Note

The pip and python commands are tied to one another. You can think of it as the pip command installing package into python. At the beginning of this tutorial, when we created the Conda environment python_tutorial, we installed python into that Conda environment. Conda also installed pip into that environment, so you can use that Conda environment’s pip to install packages into that same Conda environment’s python.

Now, before moving on, let’s use pip to uninstall the package we just installed:
```
$ pip uninstall mysci
```
Install from your GitHub repository

Now, let’s re-install our package directly from GitHub.
```
$ pip install git+https://github.com/Username/Project.git
```
To do this replace Username and Project with your target username and repository (likely mysci for this example).

Note

If you are not comfortable with people using your code you can change the privacy and permission settings of your repository.
How to publish to PyPI

With our package containing a properly formed setup.py, it is now ready for publication on PyPI (https://pypi.org/). We don’t recommend that you actually publish this package (i.e., the one you just created in this tutorial) because every package on PyPI needs a unique name, which means only one of you will be able to actually perform this step of the tutorial successfully! Also, the package we’ve created in this tutorial is probably not the most useful package out there, so maybe it’s not worth sharing.

Anyway, we will give you the instructions for how to publish a package to PyPI here, so that when you do actually create a package you want to share with the world, you will know how to do it.

The first step is to create an account on PyPI. Follow this link to do so: (https://pypi.org/account/register/). Take note of your newly created username and password.

To upload our package, we will need to use another external package called Twine (https://twine.readthedocs.io/en/latest/). We’ll install this new package with pip:
```
$ pip install twine
```
By installing this package, a new utility called twine will be installed that you can use to upload your package. First, however, we need to build a distribution package using our newly created setup.py file. To do that, execute the following command in the same directory where the setup.py file is located:
```
$ python setup.py sdist bdist_wheel
```
Then, to upload your newly created distribution package to PyPI, execute the following:
```
$ twine upload dist/*
```
Twine will then ask for your username and password.

Once the upload succeeds, head to PyPI and see your package displayed as a new release!

That concludes the “Publishing Your Package” section of this tutorial.