Performance Optimization¶

1. Performance Measurement Basics¶

Always measure before optimizing. "Don't guess, measure."

timeit Module¶

import timeit

# Measure code as string
time = timeit.timeit(
    'sum(range(1000))',
    number=10000
)
print(f"Execution time: {time:.4f}s")

# Measure function
def my_sum():
    return sum(range(1000))

time = timeit.timeit(my_sum, number=10000)
print(f"Execution time: {time:.4f}s")

In IPython/Jupyter¶

# Measure single line
%timeit sum(range(1000))

# Measure entire cell
%%timeit
total = 0
for i in range(1000):
    total += i

Timing Decorator¶

import time
from functools import wraps

def timing(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__}: {end - start:.4f}s")
        return result
    return wrapper

@timing
def slow_function():
    time.sleep(1)
    return "Done"

slow_function()  # slow_function: 1.0012s

2. Profiling¶

cProfile¶

import cProfile
import pstats

def expensive_function():
    total = 0
    for i in range(10000):
        total += sum(range(100))
    return total

# Profiling
profiler = cProfile.Profile()
profiler.enable()

expensive_function()

profiler.disable()

# Print results
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10

Command Line¶

# Basic profiling
python -m cProfile my_script.py

# Sort results
python -m cProfile -s cumulative my_script.py

# Save to file
python -m cProfile -o output.prof my_script.py

Analyzing Profile Results¶

import pstats

# Load saved profile
stats = pstats.Stats('output.prof')

# Sort criteria:
# 'calls': call count
# 'time': internal time
# 'cumulative': cumulative time
stats.sort_stats('cumulative')
stats.print_stats(20)

# Show specific function
stats.print_stats('my_function')

line_profiler (Line-by-Line Analysis)¶

pip install line_profiler

# my_script.py
@profile  # Add decorator
def slow_function():
    result = []
    for i in range(10000):
        result.append(i ** 2)
    return result

slow_function()

kernprof -l -v my_script.py

3. Memory Profiling¶

memory_profiler¶

pip install memory_profiler

from memory_profiler import profile

@profile
def memory_hungry():
    big_list = [i for i in range(1000000)]
    del big_list
    small_list = [i for i in range(1000)]
    return small_list

memory_hungry()

Output:

Line #    Mem usage    Increment  Occurrences   Line Contents
     3     38.5 MiB     38.5 MiB           1   @profile
     4                                         def memory_hungry():
     5     76.8 MiB     38.3 MiB           1       big_list = [i for i in range(1000000)]
     6     38.5 MiB    -38.3 MiB           1       del big_list
     7     38.5 MiB      0.0 MiB           1       small_list = [i for i in range(1000)]
     8     38.5 MiB      0.0 MiB           1       return small_list

sys.getsizeof()¶

import sys

# Check object size
print(sys.getsizeof([]))         # 56 (empty list)
print(sys.getsizeof([1, 2, 3]))  # 88
print(sys.getsizeof({}))         # 64 (empty dict)
print(sys.getsizeof("hello"))    # 54

# Doesn't include nested objects
nested = [[1, 2, 3] for _ in range(10)]
print(sys.getsizeof(nested))  # Only outer list

tracemalloc (Memory Tracking)¶

import tracemalloc

tracemalloc.start()

# Memory-consuming code
big_list = [i ** 2 for i in range(100000)]

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

print("[ Top 10 Memory Usage ]")
for stat in top_stats[:10]:
    print(stat)

tracemalloc.stop()

4. Data Structure Selection¶

List vs Tuple¶

import sys

# Tuples are smaller and faster
lst = [1, 2, 3, 4, 5]
tup = (1, 2, 3, 4, 5)

print(sys.getsizeof(lst))  # 104
print(sys.getsizeof(tup))  # 80

# Creation speed
%timeit [1, 2, 3, 4, 5]  # Slower
%timeit (1, 2, 3, 4, 5)  # Faster (constant)

List vs Set (Membership Testing)¶

import timeit

data_list = list(range(10000))
data_set = set(range(10000))

# List: O(n)
print(timeit.timeit('9999 in data_list', globals=globals(), number=10000))
# ~0.8s

# Set: O(1)
print(timeit.timeit('9999 in data_set', globals=globals(), number=10000))
# ~0.001s

Dictionary vs List (Search)¶

# Search by name
users_list = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25},
    # ... 10000 items
]

users_dict = {
    "Alice": {"age": 30},
    "Bob": {"age": 25},
    # ... 10000 items
}

# List: O(n)
def find_in_list(name):
    for user in users_list:
        if user["name"] == name:
            return user

# Dictionary: O(1)
def find_in_dict(name):
    return users_dict.get(name)

Time Complexity Summary¶

Operation	list	dict	set
Index access	O(1)	-	-
Search (in)	O(n)	O(1)	O(1)
Insert (end)	O(1)	O(1)	O(1)
Insert (middle)	O(n)	-	-
Delete	O(n)	O(1)	O(1)

5. String Optimization¶

String Concatenation¶

# Bad: O(n²)
result = ""
for i in range(10000):
    result += str(i)

# Good: O(n)
result = "".join(str(i) for i in range(10000))

# Better (using list)
parts = []
for i in range(10000):
    parts.append(str(i))
result = "".join(parts)

f-string vs format vs %¶

name = "Alice"
age = 30

# f-string (fastest, Python 3.6+)
s = f"Name: {name}, Age: {age}"

# format (medium)
s = "Name: {}, Age: {}".format(name, age)

# % formatting (slowest)
s = "Name: %s, Age: %d" % (name, age)

String Interning¶

# Python automatically interns short strings
a = "hello"
b = "hello"
print(a is b)  # True

# Long strings
a = "hello world" * 100
b = "hello world" * 100
print(a is b)  # False

# Manual interning
import sys
a = sys.intern("hello world" * 100)
b = sys.intern("hello world" * 100)
print(a is b)  # True

6. Loop Optimization¶

Use Local Variables¶

import math

# Slow: global lookup every time
def slow():
    result = []
    for i in range(10000):
        result.append(math.sqrt(i))
    return result

# Fast: cache as local variable
def fast():
    result = []
    sqrt = math.sqrt  # Local variable
    append = result.append  # Local variable
    for i in range(10000):
        append(sqrt(i))
    return result

List Comprehension¶

# for loop
result = []
for i in range(10000):
    result.append(i ** 2)

# List comprehension (faster)
result = [i ** 2 for i in range(10000)]

# map (similar speed)
result = list(map(lambda x: x ** 2, range(10000)))

Remove Unnecessary Work¶

# Bad: len() called every iteration
for i in range(len(my_list)):
    if i < len(my_list) - 1:
        pass

# Good: precompute
length = len(my_list)
for i in range(length):
    if i < length - 1:
        pass

# Better: use enumerate
for i, item in enumerate(my_list):
    pass

7. Function Optimization¶

Memoization¶

from functools import lru_cache

# Without memoization (slow)
def fib_slow(n):
    if n < 2:
        return n
    return fib_slow(n - 1) + fib_slow(n - 2)

# With memoization (fast)
@lru_cache(maxsize=None)
def fib_fast(n):
    if n < 2:
        return n
    return fib_fast(n - 1) + fib_fast(n - 2)

# fib_slow(35) → several seconds
# fib_fast(35) → instant

Use Generators¶

# Memory intensive
def get_squares_list(n):
    return [i ** 2 for i in range(n)]

# Memory efficient
def get_squares_gen(n):
    for i in range(n):
        yield i ** 2

# Usage
for square in get_squares_gen(1000000):
    if square > 100:
        break

Use Built-in Functions¶

# Custom implementation (slow)
def my_sum(numbers):
    total = 0
    for n in numbers:
        total += n
    return total

# Built-in (fast, implemented in C)
total = sum(numbers)

# Other examples
max_val = max(numbers)      # Maximum
min_val = min(numbers)      # Minimum
sorted_nums = sorted(numbers)  # Sorting
any_positive = any(n > 0 for n in numbers)  # Condition check

8. slots Optimization¶

Reduces memory usage for class instances.

import sys

# Regular class
class PointRegular:
    def __init__(self, x, y):
        self.x = x
        self.y = y

# Using __slots__
class PointSlots:
    __slots__ = ['x', 'y']

    def __init__(self, x, y):
        self.x = x
        self.y = y

# Memory comparison
p1 = PointRegular(1, 2)
p2 = PointSlots(1, 2)

print(sys.getsizeof(p1.__dict__))  # 104 (dict size)
# PointSlots has no __dict__

# Large difference with many instances
regular_points = [PointRegular(i, i) for i in range(100000)]
slots_points = [PointSlots(i, i) for i in range(100000)]

slots Cautions¶

class Point:
    __slots__ = ['x', 'y']

    def __init__(self, x, y):
        self.x = x
        self.y = y

p = Point(1, 2)

# Cannot add dynamic attributes
# p.z = 3  # AttributeError

# Careful with inheritance
class Point3D(Point):
    __slots__ = ['z']  # Only define additional slots

    def __init__(self, x, y, z):
        super().__init__(x, y)
        self.z = z

9. Parallel Processing¶

multiprocessing (CPU-bound)¶

from multiprocessing import Pool
import time

def cpu_bound_task(n):
    """CPU-intensive task"""
    return sum(i * i for i in range(n))

if __name__ == '__main__':
    numbers = [10000000] * 4

    # Sequential execution
    start = time.time()
    results = [cpu_bound_task(n) for n in numbers]
    print(f"Sequential: {time.time() - start:.2f}s")

    # Parallel execution
    start = time.time()
    with Pool(4) as pool:
        results = pool.map(cpu_bound_task, numbers)
    print(f"Parallel: {time.time() - start:.2f}s")

concurrent.futures¶

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import time

def task(n):
    return sum(range(n))

# ProcessPoolExecutor (CPU-bound)
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(task, [10000000] * 4))

# ThreadPoolExecutor (I/O-bound)
with ThreadPoolExecutor(max_workers=4) as executor:
    futures = [executor.submit(task, n) for n in [10000000] * 4]
    results = [f.result() for f in futures]

asyncio (I/O-bound)¶

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = ["http://example.com"] * 10

    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

    return results

# asyncio.run(main())

10. Using NumPy¶

Provides performance improvements for scientific computing.

Basic Comparison¶

import numpy as np

# Pure Python
def python_sum_squares(n):
    return sum(i ** 2 for i in range(n))

# NumPy
def numpy_sum_squares(n):
    arr = np.arange(n)
    return np.sum(arr ** 2)

# NumPy is much faster
%timeit python_sum_squares(1000000)  # ~100ms
%timeit numpy_sum_squares(1000000)   # ~2ms

Vectorized Operations¶

import numpy as np

# Python loop
def normalize_python(data):
    result = []
    mean = sum(data) / len(data)
    std = (sum((x - mean) ** 2 for x in data) / len(data)) ** 0.5
    for x in data:
        result.append((x - mean) / std)
    return result

# NumPy vectorization
def normalize_numpy(data):
    arr = np.array(data)
    return (arr - arr.mean()) / arr.std()

11. Cython Introduction¶

Compile Python to C for performance.

Simple Example¶

# fib.pyx
def fib_py(int n):
    cdef int a = 0
    cdef int b = 1
    cdef int i
    for i in range(n):
        a, b = b, a + b
    return a

Build¶

# setup.py
from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("fib.pyx")
)

python setup.py build_ext --inplace

12. Optimization Checklist¶

General Principles¶

Measure First: Don't guess, profile
Find Bottlenecks: 20% causes 80% of time
Improve Algorithms: Data structures and algorithms matter most
Maintain Readability: Don't over-optimize

Quick Wins¶

Item	Solution
Membership test	list → set
String concat	+ → join()
Loop	for → comprehension
Function calls	Cache local variables
Repeated calc	@lru_cache
Many objects	slots

Situation-based Choices¶

Situation	Solution
CPU-bound	multiprocessing, NumPy, Cython
I/O-bound	asyncio, threading
Memory shortage	Generators, slots
Repeated calc	Memoization

13. Summary¶

Tool	Purpose
timeit	Execution time measurement
cProfile	Function-level profiling
line_profiler	Line-level profiling
memory_profiler	Memory profiling
tracemalloc	Memory tracking

Optimization Technique	Effect
Appropriate data structure	O(n) → O(1)
List comprehension	Faster than loop
Local variable caching	Reduce lookup cost
Memoization	Eliminate duplicate calculations
slots	Save memory
Parallel processing	Maximize CPU utilization

14. Practice Problems¶

Exercise 1: Profiling¶

Profile slow code and find and improve bottlenecks.

def slow_function(n):
    result = ""
    for i in range(n):
        if i in [j for j in range(i)]:
            result += str(i)
    return result

Exercise 2: Memory Optimization¶

Write a function to efficiently process large CSV files.

Exercise 3: Parallel Processing¶

Parallelize CPU-bound tasks to improve performance.

Conclusion¶

This guide covered Python advanced syntax. Apply to real projects to gain experience!

Learning Completion Checklist¶

[ ] Improve code quality with type hints
[ ] Reuse code with decorators
[ ] Manage resources with context managers
[ ] Memory efficiency with generators
[ ] Encapsulate state with closures
[ ] Customize classes with metaclasses
[ ] Control attributes with descriptors
[ ] Async processing with asyncio
[ ] Apply functional patterns
[ ] Measure and optimize performance