High Performance Data Analysis

Objectives

The main learning objective of this curricular unit is to promote a comprehensive understanding of the principles of analysing high-performance data. In detail, this course aims to enable students to:

  • understanding the challenges of analysing high-performance data
  • solve the challenges of analysing high-performance data,
  • analyse high-performance data analysis case studies,
  • learn tools and best practices for more effective, scalable, robust and reproducible data analysis.
  • design effective high-performance data analysis solutions.
  • implement high-performance data analysis solutions. In addition, students will develop skills in using :
  • libraries suitable for implementing basic transformations, visualisations and analyses of large volumes of data
  • programming techniques suitable for implementing basic transformations, visualisations and analyses of large volumes of data.

Program

The course covers several important aspects of high-performance tools, including:

  • BigData: variety (data formats), velocity (real-time data flows to enable real-time decisions) and volume
  • Fundamental concepts and methods for designing, storing, analyzing and managing semi-structured and unstructured data: data models such as tabular, tree, graph, multi-dimensional (cubes), text; and row vs. column-oriented storage
  • Basic aspects of data analysis pipelines: acquisition, integration, exploration, mining, analysis, visualization and interpretation
  • Data scalability, availability, coherence, distribution and expressiveness
  • Distributed processing: MapReduce, Dataflow/DAG and Graphs
  • Batch vs Stream processing
  • Performance optimization in data analysis
  • Analyzing large amounts of data in Python: Jupyter Notebooks, Pandas, NumPy, Dask or PySpark

Bibliography

  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Martin Kleppmann. O’ Reilly Media, Inc, 2017.
  • High Performance Python. Micha Gorelick, Ian Ozsvald. O’ Reilly Media, Inc,k 2020.
  • Data Science with Python and Dask. Jesse C. Daniel. Manning, 2019.
  • Spark: The Definitive Guide. Bill Chambers, Matei Zaharia. O’ Reilly Media, Inc, 2018.

Updated: