Build a modern data infrastructure

### Data is the new oil 🛢️ So you want to build a data driven company or create machine learnings models, but guess what? You need a proper data infrastructure to support it! During this workshop you will learn how to create one using OSS tools and the best practices that could fit both batch and stream processing.

Tags: Big Data, Data Science, DevOps, Infrastructure, Jupyter, Web

Scheduled on wednesday 16:00 in room openhub

Speaker

Christian Barra (@christianbarra)

I do Python, conferences and often play with data.

Description

During this tutorial you will learn how to create a scalable and reliable data infrastructure using OSS libraries.

Given the limited amount of time we will focus mainly on the batch side but some useful insights will be shared about stream processing and OLAP.

The first part will introduce the main concepts while the second will be focused on building stuff.

Tutorial outline

First part

  • Why bother?
  • Unified data warehouse
  • Stream VS Batch (Fast VS Slow data)
  • Kafka? I'd rather use Redis
  • Short term storage VS Long term storage
  • Airflow

Second part

  • Build a Python consumer
  • Use Airflow to move things around
  • Use Airflow to train machine learning models
  • What's next?

Tools that we are going to use

  • Redis streams
  • Scylla
  • Airflow
  • Docker
  • Pandas and Parquet
  • Python ❤️

Prerequisites

  • If I say clone this repo locally and download these docker images you don't freak out
  • A good knowledge of Python