Part 02 gives us a taste of rudimentary intelligence that allows the software to react based on conditions or exceptions. Some brains, if you will.
Data containers allow us to retain more than one value per variable and perform operations on them. To make the software more convenient for the pursuit of automation, we would need some muscles too, which would be ways to perform repetitions.
Similar to strings, but more capable, Python Lists (list
) can contain a sequence of more than one type of data within.
'''norse_shop.py'''
header = ['poi', 'revenue', 'cost', 'visits', 'unique_visitors']
row1 = ['Yggdrasil', 790.2, 477.85, 53, 7]
row2 = ['Valhalla', 1700.65, 1500, 11, 10]
One can conceptualize the above example as a data table or spreadsheet. The header
variable holds a list
of string
values while row1
and row2
each contain a list
of mixed string
, float
, and int
values.
'''norse_shop.py'''
# ...
csv_header = ','.join(header)
print(csv_header)
In fact we can loosely translate a list
into a CSV (comma-separated values) formatted string by leveraging a str.join()
method.
% python norse_shop.py
poi | revenue | cost | visits | unique_visitors |
---|---|---|---|---|
Let us attempt the same with the actual data rows:
'''norse_shop.py'''
# ...
csv_row1 = ','.join(row1)
print(csv_row1)
csv_row2 = ','.join(row2)
print(csv_row2)
% python norse_shop.py
Traceback (most recent call last):
File "norse_shop.py", line 9, in <module>
csv_row1 = ','.join(row1)
TypeError: sequence item 1: expected str instance, float found
The above error indicates a violation of the expected type for the str.join()
method to work only with a sequence of str
values. As we identify that starting from item 1 in row1
(or the second item), which would be 790.2
that is of type float
, we can apply type casting to fix that and all other non-string values:
'''norse_shop.py'''
# ...
row1[1] = str(row1[1]) # index 1 (second item)
row1[2] = str(row1[2]) # index 2 (third item)
row1[3] = str(row1[3]) # index 3 (fourth item)
row1[4] = str(row1[4]) # index 4 (fifth item)
csv_row1 = ','.join(row1)
print(csv_row1)
poi | revenue | cost | visits | unique_visitors |
---|---|---|---|---|
Yggdrasil | 790.2 | 477.85 | 53 | 7 |
As list
is a sequence type like str
, so does it have the notion of index operations. However, one key difference involves the concept of mutation:
>>> s = 'Canada'
>>> s[0] = 'B'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>> l = ['C', 'a', 'n', 'a', 'd', 'a']
>>> l[0] = 'B'
>>> l[-2] = 'n'
>>> l
['B', 'a', 'n', 'a', 'n', 'a']
>>> ''.join(l)
'Banana'
Individual items of a string cannot be mutated by assignment, while in a list they can.
Performing operations on individual items of a list one by one feel like a chore. Through a for
loop we can automate that chore away:
'''norse_shop.py'''
# ...
for i in range(len(row1)):
if type(row1[i]) is not str:
row1[i] = str(row1[i])
csv_row1 = ','.join(row1)
print(csv_row1)
To digest the code snippet above:
- A
range()
of item indexes of the targetinglist
(row1
) with the help of thelen()
function. - A
for
loop iteratesin
that range of indexes, where each temporal variablei
represents the positional index from left-to-right (or from 0 to length - 1) corresponds to each list item. - An
if
condition specifies our intent to cast non-string values into thestr
type. - When the condition from point 3 is satisfied, we mutate the item at that given index
i
by casting it into thestr
type.
Note: The if
condition within the for
loop does not serve a practical purpose. Removing it works because str('already string') == 'already string'
, and the computational cost is negligible in this particular case.
Let us wrap this operation into a function and apply it to both rows through another for
loop:
'''norse_shop.py'''
# ...
def mutate_row(row):
for i in range(len(row)):
row[i] = str(row[i])
for row in [row1, row2]:
mutate_row(row)
csv_row = ','.join(row)
print(csv_row)
% python norse_shop.py
poi | revenue | cost | visits | unique_visitors |
---|---|---|---|---|
Yggdrasil | 790.2 | 477.85 | 53 | 7 |
Valhalla | 1700.65 | 1500 | 11 | 10 |
Notice the difference in the for
loop usages. Unlike in the mutate_row()
function, the outer loop iterates through a list of lists by value (instead of by index). Both cases are an iteration of sequences with different objectives and access patterns.
Things are coming up nicely. But if we want to find the profit (revenue - cost
), we will surely encounter another, albeit familiar, type error:
'''norse_shop.py'''
# ...
# add profit header
header.append('profit')
csv_header = ','.join(header)
print(csv_header)
for row in [row1, row2]:
mutate_row(row)
csv_row = ','.join(row)
# compute profit for each row and concatenate to the csv_row
profit = row[1] - row[2]
# another way to concatenate strings
csv_row = ','.join([csv_row, str(profit)])
print(csv_row)
% python norse_shop.py
Traceback (most recent call last):
File "norse_shop.py", line 17, in <module>
profit = row[1] - row[2]
TypeError: unsupported operand type(s) for -: 'str' and 'str'
The issue is trivial to fix. Before we attempt to do so, let us revisit the function mutate_row()
and discuss the very concept of mutation.
Mutations exist for some good reasons. The most prominent is that it allows us to make in-place operations to a data container without provisioning extra memory (space) overhead to achieve the same objective.
But in this case, if we do trade off some extra cost on space, we would retain the integrity of the original rows to carry on the intended computations for profit in a straightforward manner.
'''norse_shop.py'''
# ...
def convert_row(row):
new_row = []
for i in range(len(row)):
new_row.append(str(row[i]))
return new_row
for row in [row1, row2]:
new_row = convert_row(row)
csv_row = ','.join(new_row)
# compute profit for each row and concatenate to the csv_row
profit = row[1] - row[2]
# another way to concatenate strings
csv_row = ','.join([csv_row, str(profit)])
print(csv_row)
% python norse_shop.py
poi | revenue | cost | visits | unique_visitors | profit |
---|---|---|---|---|---|
Yggdrasil | 790.2 | 477.85 | 53 | 7 | 312.35 |
Valhalla | 1700.65 | 1500 | 11 | 10 | 200.6500000000001 |
You may argue that we can perform all necessary computations before the mutation for CSV string formation. While that is true, the point of avoiding them is about where the responsibility of the original data integrity lies. Mutable approaches such as mutate_row()
give no flexibility and push that responsibility to its users before the mutation, while immutable ways like convert_row()
do so more gracefully with options for users aftermath:
# users have a flexible choice with an immutable approach
new_row1 = convert_row(row1) # assign anew
row1 = convert_row(row1) # override the original to emulate mutation if desired
# workaround with a mutable approach
# basically re-implement `convert_row()` itself
new_row1 = []
for i in range(len(row1)):
new_row1.append(row1[i])
mutate_row(new_row1) # new_row1 is now mutated
The capability that a mutable data type (such as list
) grants require greater responsibility from its users. As a convention and etiquette, abstractions involving mutable data types usually carry out immutable operations.
The intention to use a for
loop to generate new_row
is basically to copy the original list so that any potential mutation conducted on new_row
does not contaminate the original. Python lists come with a built-in method for such a purpose:
def convert_copy_row(row):
new_row = row.copy()
for i in range(len(new_row)):
new_row[i] = str(new_row[i])
return new_row
Unlike convert_row()
function where we start with an empty list
and iteratively populate with the string version of the row
items, convert_copy_row()
function starts with a shallow copy of row
, and perform in-place mutation on the copy instead of the original. It is however only a shallow copy of the immediate items, which means that if any of the items are also mutable data types, they may still suffer from undesired mutations:
a = ['a', [1, 2, 3]]
b = a.copy()
# mutation tests
b[0] = 'b'
assert b[0] == 'b'
assert a[0] == 'a' # list a still intact
b[1][0] = 10
assert b[1][0] == 10
assert a[1][0] == 1 # would raise AssertionError
Traceback (most recent call last):
...
assert a[1][0] == 1
AssertionError
To fix above, you can iteratively copy the nested list items:
a = ['a', [1, 2, 3]]
# custom deeper copy
b = [] # outer new list
for i in range(len(a)):
if type(a[i]) is list:
inner = [] # inner new list
for ii in range(len(a[i])):
inner.append(a[i][ii]) # make "deeper" of the nested items
b.append(inner)
else:
b.append(a[i])
# mutation tests
b[0] = 'b'
assert b[0] == 'b'
assert a[0] == 'a' # list a still intact
b[1][0] = 10
assert b[1][0] == 10
assert a[1][0] == 1
We should prefer a flat list data structure to avoid undesired mutations on nested mutable data items. Recall "The Zen of Python" (>>> import this
):
Flat is better than nested.
a = ['a', 1, 2, 3]
b = a.copy()
# mutation tests
b[0] = 'b'
assert b[0] == 'b'
assert a[0] == 'a' # list a still intact
b[2] = 10
assert b[2] == 10
assert a[2] == 1
In computer science terms, copy a list using a for
loop is imperative (programmers describe the "how"), whereas the list.copy()
method represents a declarative way (programmers state the "what").
The imperative techniques are usually more verbose but offer more internal visibility of the mechanism. On the other hand, their declarative counterparts (where applicable) offer simplicity with less control.
The Python language offers a compromise between the two, called comprehensions, that offer greater control than the declarative equivalent while keeping the expressions less verbose than the imperative ways:
a = ['a', 1, 2, 3]
# copy `a` through list comprehension
b = [v for v in a]
This approach leverages the fact that we only intend to utilize the individual values of the items of the original list. With this in mind, the original norse_shop.py
can be revised as such:
'''norse_shop.py'''
header = ['poi', 'revenue', 'cost', 'visits', 'unique_visitors']
row1 = ['Yggdrasil', 790.2, 477.85, 53, 7]
row2 = ['Valhalla', 1700.65, 1500, 11, 10]
header.append('profit')
csv_header = ','.join(header)
print(csv_header)
def get_profit(row):
return row[1] - row[2]
for row in [row1, row2]:
# list comprehension to replace `convert_row()`
new_row = [str(v) for v in row]
# compute profit
profit = get_profit(row)
new_row.append(str(profit))
# transform to CSV string and print out
csv_row = ','.join(new_row)
print(csv_row)
Like strings, lists can be concatenated too:
header = ['poi', 'revenue', 'cost', 'visits', 'unique_visitors']
header = header + ['profit', 'profit_margin', 'avg_revenue', 'avg_visits']
csv_header = ','.join(header)
print(csv_header)
poi | revenue | cost | visits | unique_visitors | profit | profit_margin | avg_revenue | avg_visits |
---|---|---|---|---|---|---|---|---|
Sequences can be unpacked through prepending a *
operator before it within a callable (such as a function call) context:
'''mad_libs.py'''
header = ['poi', 'revenue', 'cost', 'visits', 'unique_visitors']
print('''
Around this {0}
Our {1} is great
While the {2} is minimal
We gather massive {3}
From quite a small number of {4}
'''.format(*header))
% python mad_libs.py
Around this poi
Our revenue is great
While the cost is minimal
We gather massive visits
From quite a small number of unique_visitors
For some more random and comical effect:
'''mad_libs.py'''
import random
header = ['poi', 'revenue', 'cost', 'visits', 'unique_visitors']
random.shuffle(header)
print('''
Around this {0}
Our {1} is great
While the {2} is minimal
We gather massive {3}
From quite a small number of {4}
'''.format(*header))
% python mad_libs.py
Around this revenue
Our visits is great
While the cost is minimal
We gather massive unique_visitors
From quite a small number of poi
Note: random
is a Python built-in module and shuffle()
is a function it offers to randomly shuffle a given mutable sequence in-place (by performing a mutation of the original). You can find more about this module in its official documentation.
Take the final norse_shop.py
as a base, implement:
get_profit_margin()
- to compute the profit margin based onprofit / revenue
get_avg_revenue()
- to compute the averagerevenue
byunique_visitors
get_avg_visits()
- to compute the averagevisits
byunique_visitors
Then apply these newly implemented functions to enrich the output CSV as such:
poi | revenue | cost | visits | unique_visitors | profit | profit_margin | avg_revenue | avg_visits |
---|---|---|---|---|---|---|---|---|
Yggdrasil | 790.2 | 477.85 | 53 | 7 | 312.35 | 0.3952796760313845 | 112.88571428571429 | 7.571428571428571 |
Valhalla | 1700.65 | 1500 | 11 | 10 | 200.6500000000001 | 0.11798430012054219 | 170.065 | 1.1 |