Install Python Data Science Packages
Python is a high-level and general-purpose programming language with data science and machine learning packages. Use the video below to install on Windows, MacOS, or Linux. As a first step, install Python for Windows, MacOS, or Linux.
Install Python Packages
The power of Python is in the packages that are available either through the pip or conda package managers. This page is an overview of some of the best packages for machine learning and data science and how to install them.
We will explore the Python packages that are commonly used for data science and machine learning. You may need to install the packages from the terminal, Anaconda prompt, command prompt, or from the Jupyter Notebook. If you have multiple versions of Python or have specific dependencies then use an environment manager such as pyenv. For most users, a single installation is typically sufficient. The Python package manager pip has all of the packages (such as gekko) that we need for this course. If there is an administrative access error, install to the local profile with the --user flag.
Install Method #1
Install Method #2
Packages be installed from a Python script although this is not recommended.
pipmain(['install','gekko'])
List Package Version Numbers
Many of the modules come pre-packaged with distributions such as Anaconda. List the current packages and version numbers.
Package Version ---------------------------------- ------------------- anaconda-client 1.7.2 anaconda-navigator 1.10.0 anaconda-project 0.8.3 beautifulsoup4 4.9.3 conda 4.9.2 gekko 1.0.4
Additional packages for visualization, data science, and machine learning are listed below.
Beautiful Soup
Beautiful Soup is a Python package for extracting (scraping) information from web pages. It uses an HTML or XML parser (lxml) and functions for iterating, searching, and modifying the parse tree.
Gekko
Gekko provides an interface to gradient-based solvers for machine learning and optimization of mixed-integer, differential algebraic equations, and time series models. Gekko provides exact first and second derivatives through automatic differentiation and discretization with simultaneous or sequential methods.
Keras
Keras provides an interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. Other backend packages were supported until version 2.4. TensorFlow is now the only backend and is installed separately with pip install tensorflow.
Matplotlib
The package matplotlib generates plots in Python.
Numpy
Numpy is a numerical computing package for mathematics, science, and engineering. Many data science packages use Numpy as a dependency.
OpenCV
OpenCV (Open Source Computer Vision Library) is a package for real-time computer vision and developed with support from Intel Research.
Pandas
Pandas visualizes and manipulates data tables. There are many functions that allow efficient manipulation for the preliminary steps of data analysis problems.
Plotly
Plotly renders interactive plots with HTML and JavaScript. Plotly Express is included with Plotly.
PyTorch
PyTorch enables deep learning, computer vision, and natural language processing. Development is led by Facebook's AI Research lab (FAIR).
Scikit-Learn
Scikit-Learn (or sklearn) includes a wide variety of classification, regression and clustering algorithms including neural network, support vector machine, random forest, gradient boosting, k-means clustering, and other supervised or unsupervised learning methods.
SciPy
SciPy is a general-purpose package for mathematics, science, and engineering and extends the base capabilities of NumPy.
Seaborn
Seaborn is built on matplotlib, and produces detailed plots in few lines of code.
Statsmodels
Statsmodels is a package for exploring data, estimating statistical models, and performing statistical tests. It include descriptive statistics, statistical tests, plotting functions, and result statistics.
Temperature Control Lab
The Temperature Control Lab is used throughout the course for hands-on activities such as the Learn Python and Data Science modules. Data can also be generated from a digital twin simulator if a TCLab device is not connected. Use TCLabModel to generate simulated data wherever TCLab is used to connect Python to the physical lab.
TensorFlow
TensorFlow is an open source machine learning platform with particular focus on training and inference of deep neural networks. Development is led by the Google Brain team.
XGBoost
XGBoost is an open-source in Python and other data science platforms for gradient boosting. Unique features include tree penalization, proportional leaf node shrinking, Newton boosting, and scalable computing architectures. It is frequently the tool of choice of winning teams for Kaggle machine learning competitions.