If you want to master data analysis, or just want to use Python, this is the place to do it. Python is easy to learn, has extensive and in-depth support, and with almost any machine learning framework data science library there is a Python interface out there.
Over the past few months, several data science projects in Python have released new versions, including updates to key features. Some concern the calculation of real numbers. Others make it easier for Pythonistas to write fast code optimized for these tasks.
Python Data Science Basics: SciPy 1.7
Python users who need a fast and powerful math library can use NumPy, but NumPy itself is less task-oriented. SciPy It uses NumPy to provide a library for common mathematical and scientific programming tasks, from linear algebra to statistical work to signal processing.
How SciPy helps data science
SciPy has long been instrumental in providing practical and widely used tools for working with math and statistics. However, for a long time there was strong backward compatibility between versions, but there was no suitable version 1.0.
According to lead developer Ralf Gommers, the main reason for creating SciPy version 1.0 projects was the integration of project management methods. However, it also included the continuous integration process of the MacOS and Windows versions, and proper support for pre-built Windows binaries. This latest feature means Windows users can use SciPy without jumping over extra hoops.
Since the 2017 version of SciPy 1.0, the project has delivered seven key releases, with many improvements in the process.
- Support for deprecated Python 2.7 and subsequent codebase update.
- Continuous improvements and updates of SciPy submodules. It includes more features, better documentation, and many new algorithms. New Fast Fourier Transform Module It has better performance and modern interface.
- Improved support for LAPACK features, Fortran package for solving common linear equation problems.
- Improved compatibility with alternative Python runtimes PyPy, it includes a JIT compiler to speed up long-running code.
Where to download SciPy
SciPy binaries can be downloaded from Python Package Index, or enter
pip install scipy.. The source code is available on GitHub ..
Python Data Science Basics: Numba 0.53.0
Numba Allows Python functions or modules to be compiled as part of the LLVM assembly language compiler. This can be done on the fly at any time or before the Python program. In this sense, Numba is similar to Cython, but Cython accelerated code is easier to distribute to third parties, but Numba is often easier to use.
How Numba helps data science
The most obvious way Numba can help data scientists is to speed up operations written in Python. You can prototype your project in pure Python and annotate it in Numba to make it fast enough for production use.
Numba can also provide faster execution on hardware designed for machine learning and data science applications. Earlier versions of Numba supported compilation in CUDA accelerated code, but the latest version is New much more efficient GPU code reduction algorithm Faster compilation and support for Nvidia CUDA and AMD ROCm APIs.
Numba can optimize JIT compiled functions for parallel execution between processor cores whenever possible, but it requires a bit of extra syntax to do it properly in code.
Where to download Numba
Numba is available on Python Package Index You can install it by typing,
pip install numba From the command line. The predefined binaries are available for Windows, MacOS, and General Purpose Linux. It is also available as part of the Anaconda Python distribution and can be installed by typing:
conda install numba.. The source code is available on GitHub ..
Python Data Science Basics: Cython 3.0 (Beta)
Cython Converts Python code into C code which can be executed orders of magnitude faster. This transformation is especially useful for mathematically intensive code or code that runs in tight loops. Both are common in Python programs written for engineering, science, and machine learning.
How Cython Helps Data Science
Cython code is essentially Python code and has additional syntax. Python code can be compiled into C using Cython, but using Cython type annotations gives the best performance (tens to hundreds of times faster).
Prior to the introduction of Cython 3, Cython adopted version 0.xx of the numbering scheme. In Cython 3, the language no longer supports Python 2 syntax. Cython 3 is still in beta, but Cython maintainers recommend using Cython 3 instead of previous versions. Cython 3 also emphasizes greater use of “pure Python” mode. In this mode, you can use many (but not all) Cython functions using 100% Python compatible syntax.
Cython also supports integration with IPython / Jupyter Notebook. Code compiled with Cython can be used in Jupyter notebooks via inline annotations, as if the Cython code were another Python code.
You can also enable profile-guided optimization to compile Cython modules for Jupyter. Modules built with this option are compiled and optimized based on the profiling information generated for them, resulting in faster execution. Note that this option is only available in Cython when used with the GCC compiler. There is no support for MSVC yet.
Where to get Cython
Cython is available on Python Package Index, and you can install it
pip install cython From the command line. Includes binary versions for 32-bit and 64-bit Windows, Generic Linux, and Mac OS. The source code is on GitHub. Please note that you must have a C compiler on your platform to use Cython.
Python Data Science Fundamentals: Dask 2021.07.0
It’s cheaper than ever, but it can be difficult to leverage it in the most powerful ways, such as dividing up work between multiple CPU cores, physical processors, or compute nodes.
Dask Receive Python tasks and schedule them efficiently across multiple systems. Also, the syntax used to start Dask tasks is pretty much the same as the syntax used to do other things in Python, so you’ll have to rewrite most of your existing code to take advantage of Dask. There are not any.
How Dask Helps Data Science
Dask provides its own versions of several interfaces for many popular machine learning and scientific computation libraries in Python. This DataFrame object is the same as the Pandas library object. Likewise, this Array object works the same as NumPy. So, with Dask, you can quickly parallelize your existing code with just a few lines of code changes.
Dask can also be used to parallelize tasks written in pure Python, object types (bag) Suitable for optimizing the following operations
groupby A collection of generic Python objects.
Where to download Dask
Dask is available in the Python Package Index and can be installed in the following ways:
pip install dask.. You can also get it from the Anaconda distribution of Python by typing:
conda install dask.. The source code is available on GitHub ..
Python Data Science Basics: Vaex 4.30
Vaex allows users to perform deferred operations on large tabular datasets (essentially data frames by NumPy or Pandas). “Large” in this case means billions of rows where all operations are performed as efficiently as possible, with zero data copies, minimal memory usage, and built-in viewing tools. Means
How Vaex helps data science
Working with large datasets in Python often requires a lot of memory or processing power. This is especially true if your work involves only a subset of the data (for example, a column in a table). Vaex performs calculations on demand when you really need it to get the most out of your available IT resources.
Where to download Vaex
Vaex is available on Python Package Index Can be installed with and
pip install vaex From the command line. For best results, we recommend installing Vaex in a virtual environment or using the Anaconda distribution of Python.
Python Data Science Essentials: Intel SDC
Intel’s Scalable Data Frame Compiler (SDC), formerly High Performance Analytics Toolkit, is an experimental project to accelerate data analysis and machine learning in clusters. Compile a subset of Python to write code that is automatically parallelized between clusters.
mpirun Utility of the Open MPI project.
How Intel SDC Helps Data Science
HPAT uses Numba, but unlike its project and Cython, it does not compile Python directly. Instead, it takes a limited subset of the Python language (mostly NumPy arrays and Pandas data frames), optimizes them, and runs them on multiple nodes.
Like Numba, for HPAT
@jit A decorator who can transform a particular function into an optimized counterpart. It also includes native I / O modules for reading from HDF5 and writing to HDF5 (Absent HDFS).
Where to download Intel SDC
The SDC is only available in the GitHub source format. Binaries are not provided.
Copyright © 2021 IDG Communications, Inc.