Data Science with Python

10Modules

60+Hours

20Coding Labs

1Capstone Pipeline

Course Overview
Who Should Enrol
Module Breakdown
Learning Outcomes
Prerequisites
Tools & Technologies

Course Overview

Data science sits at the intersection of statistics, programming, and domain expertise. Python has emerged as the dominant language in this field thanks to its readable syntax, vast ecosystem of scientific libraries, and strong community support. This course takes you on a structured journey from Python fundamentals through advanced machine learning, culminating in a capstone project where you build a complete end-to-end pipeline.

Across ten modules, you will work with real-world datasets to develop practical skills in data acquisition, cleaning, exploration, modelling, and evaluation. Each module pairs conceptual instruction with hands-on coding labs completed in Jupyter Notebooks, giving you immediate feedback as you write and refine your code. The emphasis throughout is on understanding not just how to use tools, but when and why to apply specific techniques.

The self-paced format gives you flexibility, while a structured progression ensures that each module builds logically on the last. By the end, you will have a portfolio of completed projects and a capstone pipeline that demonstrates your ability to solve data problems from end to end.

The best data scientists are not those who know the most algorithms — they are the ones who ask the right questions of their data and communicate the answers with clarity.

Who Should Enrol

This course is built for professionals who want to develop rigorous, practical data science skills using Python. It is particularly well-suited for:

Analysts and business intelligence professionals who currently work with spreadsheets or SQL and want to expand into programmatic data analysis and predictive modelling.
Software developers and engineers looking to transition into data science or machine learning engineering roles, leveraging their existing programming experience.
Researchers and academics in quantitative fields who need robust tools for data manipulation, statistical testing, and reproducible analysis.
Product managers and technical leaders who want to understand data science workflows well enough to lead data teams, evaluate model outputs, and make data-informed decisions.
Career changers with a quantitative background (mathematics, economics, engineering) who are building a data science skill set from the ground up.

Module Breakdown

Module 1 — Python Foundations for Data Science

Establish a strong Python foundation tailored specifically for data work. This module covers data types, control flow, functions, list comprehensions, and file I/O with a focus on patterns you will use repeatedly in data science contexts. You will also set up your development environment with Anaconda, Jupyter Notebooks, and virtual environments, ensuring a reproducible workflow from the start.

Labs include writing utility functions for data parsing, working with CSV and JSON files, and building a simple command-line data summarisation tool that reads, processes, and reports on a dataset.

Module 2 — NumPy & Pandas for Data Wrangling

NumPy and Pandas form the backbone of Python data science. This module covers NumPy arrays, broadcasting, vectorised operations, and random number generation before moving into Pandas DataFrames, Series, indexing, grouping, merging, and reshaping. You will learn how to load data from multiple sources — CSV, Excel, SQL databases, and APIs — and transform it into analysis-ready structures.

Through hands-on exercises, you will wrangle a messy real-world dataset, combining multiple tables, handling mixed data types, and creating derived features that prepare the data for downstream analysis.

Module 3 — Data Cleaning & Preprocessing

Raw data is rarely ready for analysis. This module tackles the unglamorous but essential work of data cleaning — handling missing values, detecting and treating outliers, resolving inconsistent formatting, deduplicating records, and encoding categorical variables. You will explore multiple imputation strategies, understand when to drop versus fill missing data, and learn how preprocessing choices affect downstream model performance.

The lab presents you with a deliberately messy dataset containing every common data quality issue. You will build a reproducible cleaning pipeline that transforms it into a clean, documented dataset ready for modelling.

Module 4 — Exploratory Data Analysis & Visualisation

Before building models, you must understand your data. This module covers the principles of exploratory data analysis — summary statistics, distribution analysis, correlation matrices, and hypothesis generation. On the visualisation side, you will work with Matplotlib for foundational plotting, Seaborn for statistical graphics, and Plotly for interactive visualisations.

You will learn how to choose the right chart type for your data and audience, create multi-panel figures that tell a cohesive story, and apply visual design principles that make your findings clear and compelling. Labs include building an EDA report for a multi-dimensional dataset with narrative annotations.

Module 5 — Statistical Analysis with SciPy

Statistical rigour separates data science from data guessing. This module covers probability distributions, hypothesis testing (t-tests, chi-squared, ANOVA), confidence intervals, A/B testing methodology, and correlation versus causation. You will use SciPy's statistical functions alongside statsmodels for regression analysis and diagnostic testing.

Practical exercises include designing and analysing a simulated A/B test, performing multi-group comparisons on experimental data, and interpreting results with appropriate caveats about statistical significance versus practical significance.

Module 6 — Introduction to Machine Learning with Scikit-Learn

This module bridges the gap between statistics and machine learning. You will learn the Scikit-Learn API, understand the train-test split paradigm, and explore the bias-variance trade-off that governs all predictive modelling. Topics include feature scaling, cross-validation, and the importance of avoiding data leakage during preprocessing.

By the end of this module, you will be able to frame business problems as machine learning tasks, select appropriate algorithms, and evaluate models using metrics that align with your objectives.

Module 7 — Supervised Learning: Regression & Classification

Dive deep into supervised learning with comprehensive coverage of both regression and classification algorithms. For regression, you will implement linear regression, polynomial regression, ridge, lasso, and elastic net regularisation. For classification, you will work with logistic regression, decision trees, random forests, gradient boosting (XGBoost), and support vector machines.

Each algorithm is presented with its mathematical intuition, practical implementation, and guidance on when it excels versus when alternatives are preferable. Labs have you build and compare multiple models on the same dataset, interpreting coefficient values, feature importances, and decision boundaries.

Module 8 — Unsupervised Learning & Clustering

Not all problems come with labelled data. This module covers unsupervised techniques including k-means clustering, hierarchical clustering, DBSCAN, and Gaussian mixture models. You will also explore dimensionality reduction with PCA and t-SNE, learning how to visualise high-dimensional data and identify natural groupings.

Application-focused labs include customer segmentation on transaction data, anomaly detection in sensor readings, and topic discovery in a text corpus using latent semantic analysis.

Module 9 — Model Evaluation & Tuning

A model is only as good as its evaluation. This module covers precision, recall, F1-score, ROC-AUC, and calibration for classification; RMSE, MAE, and R-squared for regression. You will learn how to perform hyperparameter tuning with grid search, random search, and Bayesian optimisation, and how to use learning curves to diagnose underfitting and overfitting.

Advanced topics include ensemble methods, stacking, and model interpretability tools such as SHAP values and partial dependence plots. The lab challenges you to take an underperforming model and systematically improve its performance through feature engineering, algorithm selection, and hyperparameter optimisation.

Module 10 — Capstone: End-to-End ML Pipeline

Everything comes together in the capstone. You will select a dataset, define a problem statement, and build a complete machine learning pipeline from data ingestion through model deployment. The pipeline must include automated data validation, feature engineering, model training with cross-validation, evaluation against baseline metrics, and a simple API endpoint for serving predictions.

You will document your work in a structured notebook that follows industry best practices for reproducibility, and present your findings in a format suitable for both technical and non-technical stakeholders. This capstone project becomes a portfolio piece that demonstrates your end-to-end capabilities.

Learning Outcomes

Upon completing this course, you will be able to:

Write clean, efficient Python code for data manipulation, analysis, and modelling using industry-standard libraries.
Acquire, clean, and preprocess data from diverse sources, handling missing values, outliers, and encoding challenges with confidence.
Conduct rigorous exploratory data analysis and create visualisations that reveal patterns and communicate insights effectively.
Apply appropriate statistical tests to validate hypotheses and quantify uncertainty in your findings.
Build, evaluate, and tune supervised and unsupervised machine learning models using Scikit-Learn.
Select the right algorithm for a given problem based on data characteristics, performance requirements, and interpretability needs.
Construct an end-to-end machine learning pipeline that is reproducible, well-documented, and ready for deployment.
Communicate data science results to diverse audiences using clear narratives, appropriate visualisations, and honest assessments of model limitations.

Prerequisites

Participants should bring the following to the course:

Basic Python experience — you should be comfortable writing simple scripts, using variables, loops, and functions. If you can write a function that reads a file and processes its contents, you have sufficient Python knowledge.
Foundational mathematics — familiarity with basic statistics (mean, median, standard deviation), linear algebra concepts (vectors and matrices), and introductory calculus (derivatives) will help you grasp the theory behind algorithms.
Comfort with data — experience working with structured data in any tool (Excel, SQL, R, or similar) provides helpful context for the data wrangling and analysis modules.
A computer capable of running Jupyter Notebooks — setup instructions for Anaconda and alternative cloud-based environments (Google Colab, JupyterHub) are provided in the onboarding guide.

Tools & Technologies

Throughout this course, you will gain hands-on experience with the following tools and libraries:

Python 3.10+ — the core programming language for all labs and projects.
Jupyter Notebooks — the primary development environment for interactive data exploration and documentation.
NumPy — numerical computing, array operations, and linear algebra.
Pandas — data structures and manipulation for tabular data.
Matplotlib & Seaborn — static data visualisation and statistical graphics.
Plotly — interactive, web-based visualisations.
SciPy & statsmodels — statistical testing, distributions, and regression analysis.
Scikit-Learn — machine learning algorithms, preprocessing, model evaluation, and pipelines.
XGBoost — gradient boosting for competitive predictive performance.
SHAP — model interpretability and feature importance analysis.

Table of Contents