Data Integration and Large-Scale Analysis WS2022/23
(VU, 706.520 Data Integration and Large-Scale Analysis)

DIA is a 5 ECTS bachelor and master course, applicable to the bachelor programs computer science or software engineering and management, as well as the master catalog 'Data Science'. This course covers major data integration architectures, key techniques for data integration and cleaning, as well as methods for large-scale, i.e., distributed, data storage and analysis.


Lectures

In detail, the course covers the following topics, which also reflects the course calendar. All slides will be made available prior to the individual lectures, which take place Friday's 3pm in HS-i5 or virtually.

A: Data Integration and Preparation

  • 01 Introduction and Overview [Oct 07, pdf, pptx]
  • 02 Data Warehousing, ETL, and SQL/OLAP [Oct 14, pdf, pptx]
  • 03 Message-oriented Middleware, EAI, and Replication [Oct 21, pdf, pptx]
  • 04 Schema Matching and Mapping [Oct 28, pdf, pptx]
  • 05 Entity Linking and Deduplication [Nov 4, pdf, pptx]
  • 06 Data Cleaning and Data Fusion [Nov 11, pdf, pptx]

B: Large-Scale Data Management and Analysis

  • 07 Cloud Computing Fundamentals [Nov 18, pdf, pptx]
  • 08 Cloud Resource Management and Scheduling [Nov 25, pdf, pptx]
  • 09 Distributed Data Storage [Dec 02, pdf, pptx]
  • 10 Distributed, Data-Parallel Computation [Jan 20, pdf, pptx]
  • 11 Distributed Stream Processing [Jan 27, pdf, pptx]
  • 12 Distributed Machine Learning Systems [Jan 27, pdf, pptx]


Exercises

The lectures are accompanied by mandatory programming exercises (to the extend of 2 ECTS, i.e, roughly 50 working hours), preferably in Python or Java language.

Exercise Description


Organization

  • Lecturer: M.Sc. Shafaq Siddiqi, ISDS
  • Final written exams: Feb 10, 2023 at 14:30 - 1600 in HS i13 (additional oral exam slots via doodle, e.g., for international students)
  • Grading: 30% project, 70% final exam