1. What is Python?

Python is an increasingly popular high-level programming language. It emphasizes legibility over highly complex structure. Python innately provides simple data structures allowing for easy data manipulation. Python provides a simple approach to object-oriented programming, which in turn allows for intuitive programming, and has resulted in a large user community that has created numerous libraries that extend the basic capacities of the language.

Python is an "interpreted" language. This means that every Python command that is executed is actually translated to lower-level programming languages. Lower-level programming languages are very fast and powerful, but writing programs in these languages can be difficult.

Numerous Python tutorials exist, some can be found here:

2. Downloading Python/Anaconda:

Note: If you're using this tutorial on an Athena computer, you can skip this step.

While you can certainly download Python directly from the Python homepage, we suggest installing it as part of Anaconda - a platform built to complement Python by creating customizable and easily accessible environments in which you can run Python scripts.

We won't be going through the features of Anaconda in this tutorial, so we encourage you to check out the 30-minute test drive for a crash course if you're interested.

Download Anaconda directly from their website. Version 3.X is still somewhat newer, so we'll go ahead and use version 2.7. The great thing about Anaconda is that once you have one version installed, you can actually install another version and run them in separate environments for specific projects. Follow the instructions for the graphical installer (additional information here).

Check to make sure the install worked properly by opening Terminal or the Command Prompt and entering:

conda info

If you see a the version, platform, and environment information apper, the installation was successful.


3. Jupyter (IPython) Notebooks:

A Jupyter Notebook, formerly known as IPython Notebook, is a web-based tool for interactively writing and executing Python code. It allows the programmer to easily write and test code by allowing snippets of code and their results to be displayed side-by-side. Each snippet of code is called a "cell". It promotes the ease of documentation by allowing some cells to contain text, html code, images, or even Latex.

To open a new Jupyter Notebook, open Terminal/Command Prompt and navigate to the appropriate folder (I'll simply use my desktop):

cd desktop

Then launch Jupyter Notebook with:

jupyter notebook

This opens the Jupyter Notebook homescreen in our browser. Create a new notebook by selecting Notebooks -> Python [Root] from the New dropdown menu in the upper-right corner. This will launch a new notebook.

Notebooks are arranged as a series of individual cells. Each cell can contain a small piece of code that can be run for testing purposes. The output of the code or any erros will be "printed" below the cell. This allows us to check individual snippets of code and view output quickly. Since each cell is run individually, any cells that depend on outputs from prior cells will need to be run after the cells that come before. This is true each time a notebook is re-opened.

We can add/delete cells using the add/delete (plus-sign/scissors) in the notebook menu. To Run the code in a cell, simply hit CTRL+Enter. We'll start entering code in the next section.


3. Python Libraries:

Python is a dynamically typed language. A language has dynamic typing when variable types are not predefined like in a compiled language; the type of a value is evaluated when the code is run, based on how you are attempting to use it.

Dynamic typing and a number of other language-specific characteristics, like readability and reusability, make Python a very popular programming language with a large user community. However, Python on its own only provides a basic number of modules and functionality. In order to extend Python's functionality, the active community has created a very large number of libraries. A library is a built-in or external module that can be imported into our current code to add functionality. Libraries usually take advantage of Object-Oriented-Programming, defining Python objects in addtional scripts that can then be instantiated in our current code.

Loading libraries into our current context can be expensive; for that reason, Python requires us to explicitly load the libraries that we want to use. Type the following code into the first cell, and hit CTRL+Enter to run the code:

import math

To enter another line of code, we can either type it on a second line in our current cell and re-run the cell, or add a new cell. Let's add a new cell and enter the following code in order to be more specific and import only particular classes or functions of the library:

from math import pi

We can also change the name of the library when it gets imported:

import numpy as np

Python Data Structures

A data structure is a way of organizing data in the computer's memory. Data structures can implement one or more particular data types (ADT). Some data structures can be built based on a basic type or basic building block. Most languages allow more complicated composite types to be recursively constructed starting from basic types.

0. Comments

Comments aren't technically a data strucutre, but they are a critical component to good coding. Comments let you document your code and are not executed. They are very useful when sharing code, or even when going back to your own code after a while. It is good practice to comment your code. We can have sinlge or multi-line coments.

Single-line comments start the hash character, #, and extend to the end of the physical line. A coment may appear at the start of a line or following whitespace or code.

Multi-line comments are known as docstrings. They start with """ and end with """.

# This is a single line comment

"""
This is a
multi-line
comment.
"""

1. Variables

Variables allow us to store results, functions, or data in order to later retrieve it through its name. They give our code persistence. Compared to other programming languages, you don't hav eto explicitly define a variable name or its datatype beforehand - you can create variables on the fly! Variables can also be continuously redefined. Python emphasizes legibility, and good naming practices refer specifically to the type of data you are storing.

From here on out, we'll use print to display the output of our code:

# We can define variables without having to declare their type. We can name it whatever we want.
x = 4.0
print x
print x*2 # you can change the value of x as often as you want.
x = 3
print x
x = x+2
print x
# the following retrieves the value stored in x, adds 2 to it, and stores the result in x.
x+=2
print x
x-=10
print x

2. Numeric Types and Their Methods

Python implements four distinct numeric types: plain integers, long integers, floating points numbers, and complex numbers. In addition, Booleans are a subtype of plain integers.

Variables can be defined based on constructing a numeric data type. Every time we define a variable with a number, we are constructing an instance of the datatype. Different datatypes have different constructors; to construct a numberic data type, you only need to type it! In general, numeric types (and data types) have methods and properties. Properties aloow you to access specific data of a given object, and methods allow you to do operations with them.

# Integers (int) are a numerical datatype.
1+2
# Floats are another numeric type that allows for decimals.
print 1.0+.5
print 3.0-2.0
print 6*5.0
print 7.0/2

Notice integer division and floating-point error below:

1/2,1/2.0,3*3.2

3. Strings

Strings are sets of characters. We can create them by encolosing characters in quotes. Python treats single quotes the same as double quotes.

Strings could be thought of as a list of characters. Therefore, we can access substrings, by using a similar syntax to the one of lists (which we'll cover shortly). They also have a number of methods or operations that we can perform with them.

# A string is a sequence of characters.
print "Hello World."
    
# Let's store a string in a variable called "s".
# Note that you can use either ' or " to define strings.
string = 'This s a string.'
print string
    
# We can access individual characters from the string.
print string[2]
    
# You can add strings together
string = string + " Another string."
print string
string+=" A third string."
print string

4. Booleans

Booleans are abinary datatypes. they have 2 values: True and False (or 1 and 0). Booleans are useful when testing for truth value; we can test them in an if or while condition or as an operand of Boolean operations.

# booleans, or bools, can hold only two possible values: True or False.
is_true = True
print is_true
  
# There are several functions that act an booleans.
# Let x and y be variables storing booleans.
# "Not x" switches the value of x. If x is True, then "not x" is False.
# "x and y" returns True if if both x and y are True.
# "x or y" returns True if x or y are True.
x = True
y = False
print x
print not x
print x and y
print x and not y
print x or y
print not x or y
# There are functions to make comparisons between valuables and produce bools.
# '==' tests for equality
print 1 == 2
print 1 == 1
# '!=' tests for inequality
print 1 != 2
# comparison functions also apply to strings
print "abcd" == "abcd"
# Pointers and their values can be compared.
x = 1
y = 2
# compare values of pointers
print x == y
# set x and y to be the same pointer
x = y
print x
print y
# check if x and y are the same pointer using the 'is' function.
print x is y

5. Flow Control

Python is an imperative programming language. In computer science, imperative programming is a programming paradigm that uses statements that change a program's state. The different states are executed based on a number of rules. We can control the flow of the program through a number of structures. In python there are three main categories of program control flow:

  • loops
  • branches
  • function calls

Booleans can be used to control the flow of execution of the code. If-statements execute a section of code if a given bool evaluates to True. There is a specific syntax for booleans, including indentations.

flag = True
x = 0
if flag:
    x = 1
    print "Flag is True."
else:
    x = 2
    print "Flag is False."
print x
# We can check for other cases as well. Controlling the execution of code
# like this is referred to as "flow of control".
if x ==0:
    print "A"
elif x == 1:
    print "B"
else:
    print "C"

6. Lists

Lists are a data structure designed for easy storage and access to data. they are initialized by using "[]" to enclose a comma separated sequence of values. These values can be anything. Lists can contain the same type of values or a heterogeneous mix of values. We can access individual elements of a list, a subset of elements, or the whole list. Lists are mutable: we can modify their elements.

Python deals with multiple data structures in a similar manner. For example, lists, dictionaries, files, and iterators work similarly.

# an empty list
L1 = []
x = 5
# a list containing different values
L2 = [1,2.0,'a',"abcd",True,x]
# lists can be built dynamically using "append" and "extend"
L1.append(1)
L2.append(2)
print L1
print L2
L3 = ['a','b','c']
L1.extend(L3)
print L1
L1.append(L3)
print L1
# Values stored in lists are accessible by their index in the list.
# Lists maintain the ordering in which values were stored in them.
# We use "[i]" to retrieve the ith element in a list.
# Note that the first element in a list in Python has an index of 0.
L = ['a','b','c','d','e']
print L[0]
print L[1]
# We can access from the ends of lists as well.
print L[-1]
print L[-2]
# We can access chunks of a list to produce sub-lists.
print L[:2]
print L[2:4]
# There is a useful function for producing sequences of numbers.
print range(10)
print range(2,10)
print range(4,10,2)
# The length of a list can be calculated using "len()"
print len(range(10))

7. Dictionaries

Dictionaries, called "dicts" for short, allow you to store values by providing identifying keys and values. they always have key/value pairs. Dicts are initialized using "{}". Placing a comma-separated list of key:value pairs within the braces addes initial key:value pairs to the dictionary. Dictionaries are indexed by keys instead of numeric indices.

It is best to think of a dictionary as an unordered set of key:value pairs, with the requirement that the keys are unique (within on edictionary).

The keys() method of a dictionary object returns a list of all the keys used in the dictionary, in arbitrary order. To check whether a single key is in the dictionary, use the in keyword.

# an empty dict
d = {}
D2 = {'key1':1,'key2':'moose',4:5}
print D2
# Key-value pairs can also be defined ike this
D2[6] = False
print D2
# values can be retrieved using their keys
print D2['key1']
print D2[6]

if not D2[6]:
    print "Dicts are fun."
else:
    print "Dicts are not that fun."

# The keys and values of dicts can be accessed as lists
print "keys: "+str(D2.keys())
print "values: "+str(D2.values())

8. Iteration

Loops allow us to automate repetitive tasks. The repeated execution of a set of statements is called iteration. There are a number of ways to iterate in Python - we can use for loops or while loops. the syntax is like the syntax of if-statements. the for loop loops over each of the elements of a list or iterator, assignming the current element to the variable name given. A while loops repeats a sequence of statements until some condition becomes false.

X = range(10)
print X
for x in X:
    print x

for i in range(len(X)):
    # doubles the list element
    X[i]*=2
print X

We can control the execution of a loop through different statements. Python inclues statements to exit a loop prematurely. To exit a loop, use the break statement. The loop below is a for loop.

for i in range(10):
    print i
    if (i > 5):
        break

While loops are also supported. Continue continues to the next iteration of the loop, skipping all the code below, while break breaks out of it.

i=0
while i < 10:
    print i
    i=i+1
    if i < 5:
        continue
    else:
        break

Now, build a list that contains all the courses that you are taking this semster, and print them. Every time you print a course number, add the phrase "In Spring 2016, I am taking:"

9. Functions

Functions allow a programmer to write reusable code to perform a single action. Functions provide better modulatrity for your application and a high degree of code-reusing. Once a function is defined, it can be called by typing the name of the function and passing the arguments. For example, Python gives you many built-in functions like print(). Functions can just perform an operation, or they can return values.

Functions are defined using the key word def.

# consider this example:
# First choose an initial value for x.
x = 0
for i in range(100):
    x+=i
print x

# what if we do this for a new initial value for x?
# what if we use a different number instead of 100?
# we don't want to rewrite this for loop every time
# let's define a function.
def ForSum(x,y):
    for i in range(y):
        x+=i
    # "return" indicates what values to output
    return x

# same calculation as our previous for loop
print ForSum(0,100)
# now using different values
print ForSum(10,50)

# interestingly, pointers can store functions.
# this means functions can be inputs into other functions
F = ForSum
print F(0,100)

def execute(funct,x):
    return funct(x,100)
    
print execute(F,10)

# now, just for fun:
print F(F(F(10,100),50),1000)

Now wrap the code you created ifor the previous section in a function called "course_printer".

Vectorization

Numpy arrays are a bit different from regular python lists and are the bread and butter of data science. Pandas Series are built atop them.

In other words, operations on numpy arrays and, by extension, Pandas Series, are vectorized. You can add two numpy lists by just using + whereas the result isn't what you might expect for regular Python lists. To add regular Python lists elementwise, you will need to use a loop:

alist=[1,2,3,4,5]
newlist = []
for item in alist:
    newlist.append(item+item)
newlist

Vectorization is a powerful tool. For almost all data-intensive computing, you will use numpy arrays rather than Python lists.

You've seen a similar concept in a spreadsheet where you add an entire column to another one.

# we'll use numpy's array function to vectorize lists
# we can do this on the fly:
a=np.array([1,2,3,4,5])
print a
print type(a)

# or we can reference a previously defined list:
L=[2,4,6,8,10]
b=np.array(L)
print b
print type(b)

# now we can use vector math to combine the two
print a+b
print a-b
print a*b
print a/b
print a**b

Pandas

Big Data and Society Tutorial


Go to Main DUSPviz Tutorials Page