HPDA
High Performance Data Analysis
Objectives
The main learning objective of this curricular unit is to promote a comprehensive understanding of the principles of analysing high-performance data. In detail, this course aims to enable students to:
- understanding the challenges of analysing high-performance data
- solve the challenges of analysing high-performance data,
- analyse high-performance data analysis case studies,
- learn tools and best practices for more effective, scalable, robust and reproducible data analysis.
- design effective high-performance data analysis solutions.
- implement high-performance data analysis solutions. In addition, students will develop skills in using :
- libraries suitable for implementing basic transformations, visualisations and analyses of large volumes of data
- programming techniques suitable for implementing basic transformations, visualisations and analyses of large volumes of data.
Program
The course covers several important aspects of high-performance tools, including:
- BigData: variety (data formats), velocity (real-time data flows to enable real-time decisions) and volume
- Fundamental concepts and methods for designing, storing, analyzing and managing semi-structured and unstructured data: data models such as tabular, tree, graph, multi-dimensional (cubes), text; and row vs. column-oriented storage
- Basic aspects of data analysis pipelines: acquisition, integration, exploration, mining, analysis, visualization and interpretation
- Data scalability, availability, coherence, distribution and expressiveness
- Distributed processing: MapReduce, Dataflow/DAG and Graphs
- Batch vs Stream processing
- Performance optimization in data analysis
- Analyzing large amounts of data in Python: Jupyter Notebooks, Pandas, NumPy, Dask or PySpark
Bibliography
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Martin Kleppmann. O’ Reilly Media, Inc, 2017.
- High Performance Python. Micha Gorelick, Ian Ozsvald. O’ Reilly Media, Inc,k 2020.
- Data Science with Python and Dask. Jesse C. Daniel. Manning, 2019.
- Spark: The Definitive Guide. Bill Chambers, Matei Zaharia. O’ Reilly Media, Inc, 2018.