A lesson of advanced python from Juan Rodríguez

The lecture by Juan Luis Cano Rodríguez in Master in Business Analytics and Big Data, 2021-2022, which focused on modern SWE practices using Python and the insight we can gain on how Data Science projects are put into production.
python
pydata
Author
Published

July 8, 2024

Modified

August 1, 2024

This lesson was found on Juan Luis Cano Rodríguez’s github profile when I came across his talk “Building the composable Python data stack with Kedro & Ibis” (bookmarked for learning later) in Pydata London 2024.

This is my notes:

00 intro

learning objectives

  • Learn modern software engineering practices using Python
  • Understand the value of automation in the software engineering process
  • Gain insight into how Data Science projects are put in production
  • Learn better techniques to collaborate in software projects

The instructor used Linux / Conda, err I just have Window here so there is no other way but right I will try to make stuffs work in Window. Let’s get started!

stolen from the lecture

01 git

glossary

  • Repository: Directory tracked by git, contains a .git folder and it’s created by $ git init;
  • Commit: State or snapshot of the repository, they are created by $ git commit;
  • Branch: A parallel or separate line of development, the default one is master and they are created by $ git branch or $ git checkout -b.

git uses a Linux-like cli so now Linux cli 101 (and how to do the same on Window pwsh)

  • whoami: who am i
  • pwd: print working directory
  • ls: list all file in the dir, . for current folder, .. for parent one (seems there is no --color or -a in pwsh)
  • cd: change dir
  • touch: create empty file (in pwsh we use echo "" >> file_name)
  • cat: concatenate, print file contents
  • nano: edit a file from the command line (there is a way that allows us to do the same in win10, but ya, nope)

workflow

I did these in pwsh:

  1. Create a directory mkdir test_project and navigate there cd test_project;
  2. Init a git repository git init;
  3. Check status git status (“on branch master, no commits yet, nothing to commit”);
  4. Create some files echo "#Hello, world!" >> readme.md;
  5. Stage the files git add readme.md;
  6. Commit the changes git commit -m "initial commit";
Warning

Do not run git init on your home directory, as it can lead to confusion and potential data loss. If git status gives a lot of untracked files unrelated to your project, you might want to rm -rf .git and start in another directory. Notice that this command removes all git history.

branching

  1. Create and checkout to new branch git switch -c branch1 (-c stands for create);
  2. Commit there (see above);
  3. Go back to main branch git switch master;
  4. Merge changes git merge branch1;
  5. Delete branch git branch -d branch1 (-d stands for delete, don’t forget this step!).
Tip
  • Normally, the git merge step happens online using pull requests or merge requests, which are not git concepts, but GitHub/GitLab concepts.
  • If git switch does not work for you, you might have an older version of Git. Consider upgrading, or alternatively replace all git switch -c with git checkout -b.

Pull requests (PR) let you tell others about changes you’ve pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch. – GitHub

A merge request (MR) is a proposal to incorporate changes from a source branch to a target branch. – GitLab

Remember to never commit to master. – Git Workflow

merging

2 types of merging:

  • Fast-forward merge: There is no diverging history, and git just “advances the pointer” of the current branch. git merge new-branch --ff-only will fail if a fast-forward merge is not possible;
  • Non fast-forward merge: The history diverged, and git will create a merge commit (hence ask for a commit message) with two parents that combines the two branches. git merge new-branch --no-ff always creates a merge commit even if a fast-forward merge is possible.

GitHub use --no--ff option for pull requests, see here, and this old-but-gold discussion.

  • Non fast-forward merges can end up in conflicts. In that case, git will halt the merge operation and leave traces in the affected files;
  • To abort a merge git merge --abort (useful if we are scared and don’t know what to do);
  • To merge overriding everything with the upcoming branch git merge new-branch --strategy-option theirs;
  • To merge overriding everything with the current branch git merge new-branch --strategy-option ours.

Be careful while editing files that are in conflict.

other

  • Ignoring files .gitignore;
  • Amend the last commit: git commit --amend;
  • Show pretty history: git log --graph --oneline --decorate --all;
  • Configuring git aliases: git config --global alias.lg "log --graph --oneline --decorate" (and now you have git lg!).

This excellent chart will help you in git workflow decision making.

git flowchart, photo credit to this SO thread

triangular workflows in git

When collaborating with a project hosted online on GitHub or GitLab, the most common setup is having a central repository, one remote fork per user, and local clones/checkouts:

triangular workflows in git, source

Notice the different naming conventions between this website and the first image:

  • Convention 1: upstream/origin/local
  • Convention 2: origin//local

We will be consistent with the Aaron Meurer guide and therefore use Convention 2 all the time.

after creating a PR

After your pull request has been merged to master, your local master and <username>/master will be outdated with respect to origin/master. On the other hand, you should avoid working on this branch anymore in the future: remember branches should be ephemeral and short-lived.

To put yourself in a clean state again, you have to:

  1. Click “remove branch” in the pull request (don’t click “remove fork”!);
  2. git checkout master (go back to master);
  3. git fetch origin (never, ever use git pull unless you know exactly what you’re doing)
  4. git merge --ff-only origin master (update your local master with origin/master, and fail if you accidentally made any commit in master)
  5. git fetch -p <username> (✨ acknowledge the removal of the remote branch ✨)
  6. git branch -d old-branch (remove the old branch)
  7. git push <username> master (update your fork with respect to origin)
  8. git checkout -b new-branch (start working in the new feature!)

This process has to be repeated after every pull request.

Some organizations where all the members are trusted do not use forks, and everybody pushes their branches to the same repository instead. While this simplifies some parts of the workflow, it also requires proper checks in place to prevent bad code to be merged - for example, by requiring a minimum number of reviews or some automated status checks.

02 pythonpath

how does import work?

How do import os, pandas work? If pandas was not installed, what happen?

how can I import my code?

There are three ways to import our own code:

  • Being on the same directory: This is the quickest, however it scales quite poorly (imagine having all of pandas and scikit-learn in a single directory to do any data analysis project!)
  • Appending our code location to PYTHONPATH: This is effective, but we will try to avoid it because it can bring problems in the future.
  • Making our code installable: Since any code that’s installed can be imported, this shifts the question to “how to make our code installable”.

my first python lib

We will create a new Python library, “IE Titanic utils”, to analyze the Titanic dataset. I will create a project ie-titanic-utils, add readme.md and .gitignore files.

mkdir it-titanic-utils
git init

echo "# This is utility to help analyze Titanic dataset" >> readme.md
Invoke-WebRequest -Uri "https://www.toptal.com/developers/gitignore/api/python,jupyternotebooks" | Select-Object -ExpandProperty Content | Out-File -FilePath .gitignore -Encoding utf8

git add readme.md .gitignore
git commit -m "initial commit"

Now I will create str_utils.py file (using VS code for convenience), with a function called tokenize that takes a str sentence and splits it into a list of words.

str_utils.py
def tokenize(sentence: str) -> list[str]:
    return sentence.split()

In pwsh, I can import and use this function

python
Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from str_utils import tokenize
>>> tokenize("Toi la Le Khac Tuan")
['Toi', 'la', 'Le', 'Khac', 'Tuan']
>>>

the PYTHONPATH

We saw above that we could easily import our tokenize function. However, this only works if we are in the same directory. Why? Python looks in some predefined locations to know where to find what we want to import, called the “PATH”.

Show the code
import sys
sys.path # i will not execute this code

Therefore, there are two ways of making our code globally importable:

  1. Modify the “PATH”
  2. Put our code inside a location predefined in the “PATH”

The first option can be achieved like this:

Show the code
>>> sys.path.insert(0, "/home/username/ie-titanic-utils")
>>> import str_utils  # Works!

Or, alternatively, from outside of the interpreter: export PYTHONPATH=/home/username/ie-titanic-utils.

However, both are bad practices and should be avoided.

what does import do?

Python code is normally written in .py scripts. These scripts can be imported in the same way that any model or package from the standard library can:

$ python3
>>> import math  # Works, because it's in stdlib
>>> import numpy as np  # Works if you ran `pip install numpy` in advance
>>> import str_utils  # Works if you are in the same directory
['Hello,', 'world!']
>>> 

When the user imports a script, Python runs the script. That’s the way all the possible functions and classes inside it are available.

how to separate “running code” from reusable pieces

A Python module (any .py script) might contain code that we want to run, as well as code that we only want to import. To separate these, we use this trick:

 ie-titanic-utils  get-content -Tail 2 .str_utils.py
if __name__ == "__main__":
    print(tokenize("hello, world!"))

 ie-titanic-utils  python .str_utils.py # The `print` runs
['hello,', 'world!']

 ie-titanic-utils  python
>>> from str_utils import tokenize
>>> tokenize("Hi, world!")
['Hi,', 'world!'] # The `print` doesn't run!
>>>
Tip

Here is what Claude sonnet 3.5 gave me:

To separate code that we want to run directly from code that we only want to import, we should use the if __name__ == "__main__": idiom in Python. This is a common pattern that allows a Python script to be both importable and executable. Here’s an explanation:

  1. When a Python file is run directly, Python sets the special __name__ variable to “__main__”.
  2. When a Python file is imported as a module,__name__is set to the name of the module.

By using this idiom, we can control which code runs when the script is executed directly versus when it’s imported as a module.

03&04 pip vs. conda

managing python environments

stolen from the lecture

Simple is better than complex.

Complex is better than complicated.

How do people install and upgrade Python? -> Most of people choose Python.org, the same for me!

Caution

This way ships a tool to create development environments (venv). However, venv cannot create environments with different Python versions (you’re tied to the one you downloaded) and certain packages will require extra steps to be installed. Therefore, it is not for everyone.

Juan chose to use conda. As I am learning Docker, I choose Docker for this tutorial.

How do people create isolated development environments? -> The most popular is Virtualenv. But normally when doing an analysis task, I use venv which is built-in python lib. Recently I followed my dev team to use pipenv, I do also see poetry is worth-learning approach.

“More than a half of the users of Jupyter Notebook and JupyterLab choose Conda”

As I think a model which is not deployed yet is useless model, I choose VS Code and Docker - more deployment-oriented.

summary

For the user, the most salient distinction is probably this: pip installs python packages within any environment; conda installs any package within conda environments.

—Jake Vanderplas

However, I will be using pipenv to achieve what Juan done in upcoming sections. I will specify version of each package I used.

pip and PyPI

pip is the default Python installer. By default, it fetches packages from https://pypi.org/, which is the community repository for Python packages.

05 layout

here is my project’s layout:

ie-titanic-utils
├─ src
  └─ ie_titanic_utils
     ├─ __init__.py
     └─ ...
├─ tests
  └─ ...
├─ .gitignore
├─ README.md
└─ pyproject.toml
  • The src/package_name contains the source code of the library. package_name must be Python identifier. It should contain a __init__.py that can be empty;
  • The tests directory contains the tests. It must not contain any __init__.py because it’s not meant to be imported as a package. In very specific cases it’s included inside src/package_name;
  • Every project contains a README.md that at least explains what the project is;
  • pyproject.toml contains the metadata of the project. The absolutely required fields are module, author, and some extra information that tells Python how to install the package.

creating a package

  1. run flit init to create the metadata
 ie-titanic-utils  flit init
pyproject.toml exists - overwrite it? [y/N]: y
Module name [ie_titanic_utils]: ie_utils
Author: Tuan Le Khac
Author email: tuan.lekhac0905@gmail.com
Home page: 
Choose a license (see http://choosealicense.com/ for more info)
1. MIT - simple and permissive
2. Apache - explicitly grants patent rights
3. GPL - ensures that code based on this is shared with the same terms
4. Skip - choose a license later
Enter 1-4: 4

Written pyproject.toml; edit that file to add optional extra info.
  1. place some code under the source directory. In __init__.py there must be a docstring giving a description of the project and a __version__ variable indicating the version:
__init__.py
"""IE utils (test package)."""

__version__ = "0.1.0"
  1. Install the code using pip install! (this did not work for me currently. edit: the project name should match the src/package_name omg)
 ie-titanic-utils  pip install .
Processing .ie-titanic-utils
...
Successfully installed ie_titanic_utils-0.1.0
 ie-titanic-utils  python
Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import ie_titanic_utils
>>> ie_titanic_utils.__version__
'0.1.0'
>>>    
  1. readme.md and a .gitignore files were created.
  2. commit the change
git checkout -b first_module # i should be using switch (-c) haha, checkout is old syntax
git status
git add .
git commit -m "initial very first module"
git checkout master

 ie-titanic-utils  git merge --no-ff first_module
Merge made by the 'ort' strategy.
 pyproject.toml                    | 30 ++++++++++++++++++++++++++++++
 src/ie_titanic_utils/__init__.py  |  6 ++++++
 src/ie_titanic_utils/str_utils.py |  6 ++++++
 3 files changed, 42 insertions(+)
 create mode 100644 pyproject.toml
 create mode 100644 src/ie_titanic_utils/__init__.py
 create mode 100644 src/ie_titanic_utils/str_utils.py

git branch -d first_module
# git push origin master
# as I will not upload this code to GitHub
# https://stackoverflow.com/questions/32238616/git-push-fatal-origin-does-not-appear-to-be-a-git-repository-fatal-could-n

Now if I log the git, I will see this:

 ie-titanic-utils  git log --graph --oneline --decorate --all
*   39105c7 (HEAD -> master) Merge branch 'first_module'
|
| * 7a33844 initial very first module
|/
* 793bd67 initial commit

Yeah til now I can create a “package” in my computer and install so I can use it globally. But I have not use any env management yet.

intermezzo: version numbers

  • Version numbers for Python packages are explained in PEP 440
  • For libraries, the most widely used convention is semantic versioning: X.Y.Z
    • Z must be incremented if only backwards compatible bug fixes are introduced (a bug fix is defined as an internal change that fixes incorrect behavior)
    • Y must be incremented every time there is new, backwards-compatible functionality
    • X must be incremented every time there are backwards-incompatible changes
  • Between releases, the version should have the .dev0 suffix
  • Recommendation: start with 0.1.dev0 (development version), then make a 0.1.0 release, then progress to 0.1.1 for quick fixes and 0.2.0 for new functionality, and when you want to make a promise of relative stability jump to 1.0.0.
  • For applications, other conventions are more appropriate, like calendar versioning: [YY]YY.MM.??

project requirements

Sometimes our project will depend on third-party libraries (pandas, scikit-learn). To make pip install those dependencies automatically, we can add them to our pyproject.toml under the [tool.flit.metadata] section, using the requires option:

[build-system]
requires = ["flit_core>=3.4"]
build-backend = "flit_core.buildapi"

[project]
name = "ie_titanic_utils"
authors = [{name = "Tuan Le Khac", email = "tuan.lekhac0905@gmail.com"}]
readme = "readme.md"
requires-python = ">=3.11"
dynamic = ["version", "description"]
classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Operating System :: OS Independent",
]
requires = [
    "pandas",
    "matplotlib>=2",
]

We might want to specify optional dependencies that should only be installed upon request, or for some specific purposes. A typical example will be development dependencies: we will need things like pytest and black, but we don’t want the user to install them as part as our library. To do that, we can specify groups of optional dependencies under the tool.flit.metadata.requires-extra section:

In my case, I use `project.optional-dependencies`: flit_core.config.ConfigError: Use [project] table for metadata or [tool.flit.metadata], not both.
[project.optional-dependencies]
dev = [
    "pytest>=6.0",
    "black>=20.8b1",
]

That way, they will only get installed when [dev] is added after the name of our library:

 ie-titanic-utils  pip install .[dev]
# Successfully installed black-24.4.2 ie_titanic_utils-0.1.0 iniconfig-2.0.0 pytest-8.2.2

06 unit test

If you use software that lacks automated tests, you are the tests.

— Jenny Bryan (@JennyBryan) September 22, 2018

Testing is essential. Computers excel at doing repetitive tasks: they basically never make mistakes (the mistake might be in what we told the computer to do). Humans, on the other hand, fail more often, especially under pressure, or on Friday afternoons and Monday mornings. Therefore, instead of letting the humans be the tests, we will use the computer to frequently verify that our software works as specified.

I will be using pytest to achieve this.

test-driven development

The “test-driven development mantra” is Red - Green - Refactor:

Make it work. Make it right. Make it fast.
  1. Write a test. Watch it fail.
  2. Write just enough code to pass the test.
  3. Improve the code without breaking the test.

Repeat.

testing in Python

Summary: use pytest. Everybody does. It rocks.

pytest is a testing framework for Python that makes writing tests extremely easy. It is much more powerful than the standard library equivalent, unittest. We can use by install it first pip install pytest.

We can write a function that test the tokenize funtion:

tests/test_tokenize.py
from ie_titanic_utils import tokenize  # This will fail right away!


def test_tokenize_returns_expected_list():
    sentence = "This is a sentence"
    expected_tokens = ["This", "is", "a", "sentence"]

    tokens = tokenize(sentence)

    assert tokens == expected_tokens

and we run it from the command line:

 ie-titanic-utils  pytest
============================================================================================ test session starts ============================================================================================
platform win32 -- Python 3.11.4, pytest-8.2.2, pluggy-1.5.0
rootdir: .ie-titanic-utils
configfile: pyproject.toml
plugins: anyio-3.7.1, time-machine-2.14.0
collected 1 item

teststest_tokenize.py .                                                                                                                                                                               [100%] 

============================================================================================= 1 passed in 0.02s ============================================================================================= 

The test successed after I fixed the __init__.py:

src/./__init__.py
from ie_titanic_utils.str_utils import tokenize
__all__ = ["tokenize"]

07&08 oop

what are “objects” anyway?

like pandas’s DataFrame or matplotlib’s Figure, an object is sthg that has:

  • object-bound variables: call properties;
  • object-bound functions: call methods.

if the object’s properties can change, we say they have states, in that case they are mutable. otherwise, they are stateless and immutable. a typical example is list (mutable) and tuple (immutable).

Show the code
my_list = [1, 2, 3]
print(my_list)
my_list.append(4)
print(my_list)
[1, 2, 3]
[1, 2, 3, 4]
Show the code
# The operator that creates tuples is not parentheses:
# is the comma!
my_tuple = 1, 2, 3  # Notice that I don't need parentheses!
my_tuple
(1, 2, 3)
Show the code
print(dir(my_tuple))  # Nothing that allows us to change the state of the tuple
['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'count', 'index']
Show the code
my_tuple[0] = 99 # This wont work, "TypeError: 'tuple' object does not support item assignment"
Note

Immutable objects have the advantage that they can be hashed, that is: they can be transformed, using some cryptographical function, into something that uniquely represents that object. Mutable objects can’t, because the hash would have to change every time the state of the object changed. Dictionary keys have to be hashable objects.

Show the code
{
    my_tuple: "my_tuple"
}
{(1, 2, 3): 'my_tuple'}
Show the code
hash(my_tuple)
529344067295497451
Show the code
# this wont work: "TypeError: unhashable type: 'list'"
{
    my_list: "my_list"
}
Show the code
# this wont work: "TypeError: unhashable type: 'list'"
hash(my_list)

classes and instances

objects are defined by instantiating a class. a class is a template for objects, where we define it’s behaviours, an instance is a particular realization of that class.

Example

we want model the User of our company’s product, to later study their behaviours:

Show the code
class User:
    pass
type(User)
type

User class is of type type, which means that is can be used to created new objects. Let’s create 2 instances:

Show the code
user1 = User()
user2 = User()
print(user1)
print(user2)
<__main__.User object at 0x000001B1AD8D4F90>
<__main__.User object at 0x000001B1AD8D4D90>

with a slight abuse of notation, we would say we have 2 User objects, or just 2 Users.

Using the instance: self

let’s add a very simple method to demonstrate explicit self, a very important concept. a method is like a function bounded to the object, an can use it’s properties:

Show the code
class User:
    def whoami(self):
        print(f'This is : {self}')
Show the code
user1 = User()
user1.whoami()
This is : <__main__.User object at 0x000001B1AD8D4590>

why are methods (instead of plain functions) interesting? Because of duck typing:

“If it walks like a duck and it quacks like a duck, then it must be a duck” – https://en.wikipedia.org/wiki/Duck_typing

if something has a method that I need, I don’t care about its type.

Show the code
def do_stuff(obj):
    return obj.mean()
Show the code
import numpy as np
import pandas as pd

print(do_stuff(np.arange(5)))
print(do_stuff(pd.Series([0, 1, 2, 3, 4])))
2.0
2.0

notice how we called user1.test() without passing an extra argument? This is because Python is automatically passing the instance. It’s the equivalent of doing this (never do this):

Show the code
User.whoami(user1)
This is : <__main__.User object at 0x000001B1AD8D4590>

in fact, if we define a method without a first parameter, it will fail when we call it:

Show the code
class TestClass:
    def test():
        pass  # Don't do anything

t = TestClass()
t.test()  # Fails: "TypeError: test() takes 0 positional arguments but 1 was given"

this first parameter can be called anything, but everybody uses self. Remember, conventions are important to minimize surprise and enhance collaboration!

intermezzo: f-strings

Show the code
print(f"This is {user1}")  # Python >= 3.6
print("This is {}".format(user1))  # Python < 3.6, equivalent
print(User.whoami(user1))  # DON'T use! (Although it's equivalent)   
# %timeit User.whoami(user1)  # They have about the same performance 
# %timeit user1.whoami()
This is <__main__.User object at 0x000001B1AD8D4590>
This is <__main__.User object at 0x000001B1AD8D4590>
This is : <__main__.User object at 0x000001B1AD8D4590>
None

initializing our instances

The ellipsis (...) is a built-in constant in Python. It’s an instance of the ellipsis (dấu chấm lửng) class.

Show the code
...
Ellipsis

common uses of ellipsis:

  • as a placeholder in function definitions or class bodies
  • in type hinting (especially for variable-length tuples)
  • in slicing operations (especially for multidimensional arrays in libraries like NumPy)
Show the code
# As a placeholder
def function_to_be_implemented_later():
    ...

# In type hinting
from typing import Tuple

def process_points(*points: Tuple[int, ...]) -> None:
    for point in points:
        print(f"Processing point: {point}")

# Usage
process_points((1, 2), (3, 4, 5), (6,))
Processing point: (1, 2)
Processing point: (3, 4, 5)
Processing point: (6,)
Show the code
user1.this_property = ...  # "99", whatever
user1.this_property
Ellipsis

however, this is considered a bad practice, and can confuse editors and static analysis tools. These properties should be specified on creation, in a way that I cannot have a user without name and signup_date. Python provides us a special method, __init__ (this should not be confused with file __init__.py we put to project to tell Python our code is a package), that initializes1 the object:

1 This philosophy used to be summarized by the sentence “we are all consenting adults here”, which is nowadays being less used.

Show the code
class User:
    # "dunder init" = double underscore init
    def __init__(self, name, signup_date):
        self.name = name
        self.signup_date = signup_date
Show the code
import datetime as dt
user1 = User(name="John Doe", signup_date=dt.datetime.now())
print(user1.name, user1.signup_date, sep='\n')
John Doe
2024-08-01 23:07:08.587740

that’s something! However, there are several things we can improve:

  • it can be cumbersome to specify the date every time, and it would be nice to have some default.
  • the default representation of the instances contains some hexadecimal memory address and nothing else. It would be nice to at least see the user name and the signup date
  • nothing stops me from changing the name and signup_date of a existing user:

exercise

  • make signup_date optional by providing a default value;
  • make the __repr__ method return a string containing the name and signup_date, which will override the default.
Show the code
class User:
    def __init__(self, name, signup_date=None):
        if signup_date is None:
            signup_date = dt.datetime.now() # Watch out with default parameters! They are created when the function is defined.

        self.name = name
        self.signup_date = signup_date

    def __repr__(self):
        return f"User(name='{self.name}', signup_date={repr(self.signup_date)})"
Show the code
user1 = User("John Doe")
user1
User(name='John Doe', signup_date=datetime.datetime(2024, 8, 1, 23, 7, 8, 618751))

extra: date formatting

Show the code
dt.datetime.now().isoformat()  # ISO 8601
# If you don't like it, there's http://strftime.org/
'2024-08-01T23:07:08.625045'
Show the code
user1.signup_date.strftime("%Y ::: %d")
'2024 ::: 01'

protecting properties

in Python, there are no private attributes (neither properties nor methods), and in fact everything can be accessed 2. However, we can “hide” them by default in autocomplete and other environments by using a leading underscore _: this is usually called protected variables.

2 This philosophy used to be summarized by the sentence “we are all consenting adults here”, which is nowadays being less used.

there is a common pattern in which, if I want to make some property read-only, we can:

  • make it protected
  • create a “getter” using the @property decorator, which gets the value of the protected property with a public name
Show the code
class User:
    def __init__(self, name, signup_date=None):
        if signup_date is None:
            signup_date = dt.datetime.now()

        self._name = name
        self._signup_date = signup_date

    @property
    def name(self):
        return self._name

    @property
    def signup_date(self):
        return self._signup_date

    def __repr__(self):
        return f"User(name='{self.name}', signup_date='{self.signup_date}')"
Show the code
user1 = User("Tuan Le")
user1.name
'Tuan Le'
Show the code
# this wont work: "AttributeError: can't set attribute"
user1.name = "Tan Le"
Important

If you see tutorials mentioning “true private variables”, they are wrong!

Show the code
class Test:
    def __init__(self, name):
        self.__name = name  # Not what you think!
t1 = Test("This name")
Show the code
# this wont work
t1.__name
Show the code
t1._Test__name  # These are *NOT* "private" properties
'This name'
Note

The behavior you’re experiencing is due to a feature in Python called “name mangling” for private attributes. Let’s break down what’s happening:

Double underscore prefix:

  • When you define an attribute with a double underscore prefix (__) in a class, Python automatically mangles the name to avoid naming conflicts in inherited classes.
  • Name mangling process: Python changes the name from __name to _ClassName__name. In your case, it becomes _Test__name.
  • Accessing the attribute: Because of this name mangling, you can’t access t1.__name directly. Instead, you would need to use the mangled name.
Show the code
class Test:
    def __init__(self, name):
        self.__name = name  # This gets mangled

t1 = Test("This name")

# This will raise an AttributeError
try:
    print(t1.__name)
except AttributeError as e:
    print(f"AttributeError: {e}")

# This will work
print(t1._Test__name)

# You can see all attributes, including mangled ones
print(dir(t1))
AttributeError: 'Test' object has no attribute '__name'
This name
['_Test__name', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__']

inheritance

Show the code
class SpecialUser(User):
    def __init__(self, name, age, signup_date=None):
        # Initializes self._name and self._signup_date
        super().__init__(name, signup_date)

        self._age = age

    @property
    def age(self):
        return self._age

    def greet(self):
        print(f"Hi! I'm {self.name}")
Show the code
s_user1 = SpecialUser("John Doe", 27)
#s_user1
Show the code
s_user1.name
'John Doe'
Show the code
s_user1.greet()
Hi! I'm John Doe

more special methods

go search for python data model

09&10 flask

  • flask is a “lightweight framework” or a “microframework”. “It is designed to make getting started quick and easy, with the ability to scale up to complex applications”, and therefore requires less context to get started with;
  • django is a “batteries included” framework that is more focused on good practices and encouraging a “clean, pragmatic design”. It’s more complex and requires more experience to master.

conclusion

where do we go from here

  • Keep improving the art of Python packaging
  • Explore other options for high performance Python
  • Help “bridging the gap”
  • Engage with the (open source) Python community

keep improving the art of python packaging

  • recommend: pip-tools + requirements.in = requirements.txt
  • Poetry, Pipenv… Yes, they work, but they are way more complex and have “lock-in”
  • All companies I worked for struggle sooner or later with their code deployment practices. Now, you know better

explore other options for high performance python

  • Juan mentioned: Numba, Modin, Vaex, Dask, Spark, Coiled, Prefect
  • Now we have more: polar, pandas 2.0, aiflow, dagster, dbt, etc

stolen from the lecture

other resources

  1. NBViewer for this lecture;
  2. PyData, a community for developers and users of open source data tools;
  3. Pro Git, --distributed-is-the-new-centralized;
  4. Pull Requests;
  5. Merge Requests;
  6. Git Workflow, the git workflow for contributing to open source repositories;
  7. Git Ignore builder;
  8. Git commit –amend, rewriting history;
  9. The right way to distribute Python code
  10. Python packaging
  11. Pytest.