From data wrangling and statistical analysis to machine learning pipelines — build production-ready data science skills with Python's most powerful libraries.
Data science sits at the intersection of statistics, programming, and domain expertise. Python has emerged as the dominant language in this field thanks to its readable syntax, vast ecosystem of scientific libraries, and strong community support. This course takes you on a structured journey from Python fundamentals through advanced machine learning, culminating in a capstone project where you build a complete end-to-end pipeline.
Across ten modules, you will work with real-world datasets to develop practical skills in data acquisition, cleaning, exploration, modelling, and evaluation. Each module pairs conceptual instruction with hands-on coding labs completed in Jupyter Notebooks, giving you immediate feedback as you write and refine your code. The emphasis throughout is on understanding not just how to use tools, but when and why to apply specific techniques.
The self-paced format gives you flexibility, while a structured progression ensures that each module builds logically on the last. By the end, you will have a portfolio of completed projects and a capstone pipeline that demonstrates your ability to solve data problems from end to end.
The best data scientists are not those who know the most algorithms — they are the ones who ask the right questions of their data and communicate the answers with clarity.
This course is built for professionals who want to develop rigorous, practical data science skills using Python. It is particularly well-suited for:
Establish a strong Python foundation tailored specifically for data work. This module covers data types, control flow, functions, list comprehensions, and file I/O with a focus on patterns you will use repeatedly in data science contexts. You will also set up your development environment with Anaconda, Jupyter Notebooks, and virtual environments, ensuring a reproducible workflow from the start.
Labs include writing utility functions for data parsing, working with CSV and JSON files, and building a simple command-line data summarisation tool that reads, processes, and reports on a dataset.
NumPy and Pandas form the backbone of Python data science. This module covers NumPy arrays, broadcasting, vectorised operations, and random number generation before moving into Pandas DataFrames, Series, indexing, grouping, merging, and reshaping. You will learn how to load data from multiple sources — CSV, Excel, SQL databases, and APIs — and transform it into analysis-ready structures.
Through hands-on exercises, you will wrangle a messy real-world dataset, combining multiple tables, handling mixed data types, and creating derived features that prepare the data for downstream analysis.
Raw data is rarely ready for analysis. This module tackles the unglamorous but essential work of data cleaning — handling missing values, detecting and treating outliers, resolving inconsistent formatting, deduplicating records, and encoding categorical variables. You will explore multiple imputation strategies, understand when to drop versus fill missing data, and learn how preprocessing choices affect downstream model performance.
The lab presents you with a deliberately messy dataset containing every common data quality issue. You will build a reproducible cleaning pipeline that transforms it into a clean, documented dataset ready for modelling.
Before building models, you must understand your data. This module covers the principles of exploratory data analysis — summary statistics, distribution analysis, correlation matrices, and hypothesis generation. On the visualisation side, you will work with Matplotlib for foundational plotting, Seaborn for statistical graphics, and Plotly for interactive visualisations.
You will learn how to choose the right chart type for your data and audience, create multi-panel figures that tell a cohesive story, and apply visual design principles that make your findings clear and compelling. Labs include building an EDA report for a multi-dimensional dataset with narrative annotations.
Statistical rigour separates data science from data guessing. This module covers probability distributions, hypothesis testing (t-tests, chi-squared, ANOVA), confidence intervals, A/B testing methodology, and correlation versus causation. You will use SciPy's statistical functions alongside statsmodels for regression analysis and diagnostic testing.
Practical exercises include designing and analysing a simulated A/B test, performing multi-group comparisons on experimental data, and interpreting results with appropriate caveats about statistical significance versus practical significance.
This module bridges the gap between statistics and machine learning. You will learn the Scikit-Learn API, understand the train-test split paradigm, and explore the bias-variance trade-off that governs all predictive modelling. Topics include feature scaling, cross-validation, and the importance of avoiding data leakage during preprocessing.
By the end of this module, you will be able to frame business problems as machine learning tasks, select appropriate algorithms, and evaluate models using metrics that align with your objectives.
Dive deep into supervised learning with comprehensive coverage of both regression and classification algorithms. For regression, you will implement linear regression, polynomial regression, ridge, lasso, and elastic net regularisation. For classification, you will work with logistic regression, decision trees, random forests, gradient boosting (XGBoost), and support vector machines.
Each algorithm is presented with its mathematical intuition, practical implementation, and guidance on when it excels versus when alternatives are preferable. Labs have you build and compare multiple models on the same dataset, interpreting coefficient values, feature importances, and decision boundaries.
Not all problems come with labelled data. This module covers unsupervised techniques including k-means clustering, hierarchical clustering, DBSCAN, and Gaussian mixture models. You will also explore dimensionality reduction with PCA and t-SNE, learning how to visualise high-dimensional data and identify natural groupings.
Application-focused labs include customer segmentation on transaction data, anomaly detection in sensor readings, and topic discovery in a text corpus using latent semantic analysis.
A model is only as good as its evaluation. This module covers precision, recall, F1-score, ROC-AUC, and calibration for classification; RMSE, MAE, and R-squared for regression. You will learn how to perform hyperparameter tuning with grid search, random search, and Bayesian optimisation, and how to use learning curves to diagnose underfitting and overfitting.
Advanced topics include ensemble methods, stacking, and model interpretability tools such as SHAP values and partial dependence plots. The lab challenges you to take an underperforming model and systematically improve its performance through feature engineering, algorithm selection, and hyperparameter optimisation.
Everything comes together in the capstone. You will select a dataset, define a problem statement, and build a complete machine learning pipeline from data ingestion through model deployment. The pipeline must include automated data validation, feature engineering, model training with cross-validation, evaluation against baseline metrics, and a simple API endpoint for serving predictions.
You will document your work in a structured notebook that follows industry best practices for reproducibility, and present your findings in a format suitable for both technical and non-technical stakeholders. This capstone project becomes a portfolio piece that demonstrates your end-to-end capabilities.
Upon completing this course, you will be able to:
Participants should bring the following to the course:
Throughout this course, you will gain hands-on experience with the following tools and libraries: