Speaker: Costas Yiotis
PCCW Global, Security Data Architect & Product Owner
This workshop will provide insights and best practices on Data Engineering. The aim is to enable anyone to build robust and scalable data pipelines to process enterprise network traffic data and support business analytics, while diving into big data engineering concepts such as Data ingestion, Message Queuing, Batch vs Stream Processing & Analysis and Data Visualization. Attendees will acquire hands-on experience analyzing real network traffic in an horizontally scalable manner, using frameworks such as Apache Spark, Apache Kafka and Logstash. Finally, the workshop will touch on domain-specific peculiarities that might influence decisions in a data pipeline design, such as maintaining and updating reference data used for enrichment among others.
Level:
Beginner / Intermediate
Target audience:
Students, Data Engineers, Big Data Analysts, Network Engineers
Prerequisites on Audience:
a) SW/HW:
Bring a laptop (preferably with at least 8GB of RAM) and Docker already installed.
b) Know-how:
No prior knowledge in Data engineering frameworks is necessary. However a basic understanding of Python programming language, Docker and Computer Networks will help.
Deliverables:
i) Presentation on Data engineering for network monitoring,
ii) Docker compose containing the whole testbed infrastructure,
iii) Notebook with all code generated during the workshop
iv) Claim your Data Engineering Prize !
Schedule:
Presentation (P) and Lab work (L)
P1. Introduction to Data Engineering & a use case (15min)
– Bringing together data from across the enterprise
– A PCCW Global Use Case: How we are using Open Source frameworks to support Network Analytics in PCCW Global
P2. Network Flow Data Ingestion & Queuing (25 min)
– Learn to easily ingest and ship NetFlow data to Apache Kafka before further processing
P3. Data Transformation & Enrichment (25min)
– Data enrichment and normalization; why domain understanding matters
L1. Lab 1: Collecting, normalizing and queuing NetFlow Data samples (30min)
P4. Data Processing & Storage (30min)
– Introduction to HDFS, YARN and Apache Spark; Choosing Batch vs Stream processing
L2. Lab 2: Create a batch data processing pipeline to process NetFlow data (30min)
L3. Data Engineering Challenge (15min)
– Lab 3: Workshop participants will have the opportunity to compete and win the Data Engineering prize