Scalable Scientific Computing using Dask

Pandas and NumPy are great tools to dive through data, do analysis and train machine learning models. They provide intuitive APIs and superb performance. Sadly they are both restricted to the main memory of a single machine and mostly also to a single CPU. Dask is a flexible tools for parallelizing NumPy and Pandas code on a single machine or a cluster.

Tags: Algorithms, Big Data, Data Science, Parallel Programming, Python

Scheduled on wednesday 14:00 in room lounge

Speaker

Uwe L. Korn (@xhochy)

Uwe Korn is a Senior Data Scientist at the German RetailTec company Blue Yonder. His expertise is on building scalable architectures for machine learning services. Nowadays he focuses on the data engineering infrastructure that is needed to provide the building blocks to bring machine learning models into production. As part of his work to provide an efficient data interchange he became a core committer to the Apache Parquet and Apache Arrow projects.

Description

Pandas and NumPy are great tools to dive through data, do analysis and train machine learning models. They provide intuitive APIs and superb performance. Sadly they are both restricted to the main memory of a single machine and mostly also to a single CPU. Once our code reaches these boundaries, we can utilize Dask to scale our code to multiple CPUs or even across a cluster.

Dask provides high-level Array, Bag, and DataFrame implementations that mimic the NumPy, lists, and Pandas APIs but operate in parallel on data that doesn't need to fit into main memory. In the low level, Dask provides dynamic task schedulers that execute task graphs in parallel. These execution engines power the high-level collections mentioned above but can also power custom, user-defined workloads.

In the workshop, we want to show how to turn typical Pandas and NumPy code into parallel/distributed code using dask.array and dask.dataframe. We will highlight things that can easily be transformed into dask code and other things that need a bit more thought. In addition, we will show the utilities that Dask provides us to inspect the execution graphs and the behaviour of our distributed code.