Data Engineering

Scala Spark

Introduction
Objective

Building the SparkSession and loading the datasets
Selecting the Categorical variables
Pipelines
Formatting the database in svmlib format
Application of a Spark ML classifier
Model prediction
Model evaluation

Reference

Introduction to Scala

Scala is a versatile programming language that combines object-oriented and functional programming paradigms. It runs on the Java Virtual Machine (JVM) and is known for its concise syntax, strong static typing, and compatibility with Java libraries. Scala is commonly used in various domains including data science, big data processing (with frameworks like Apache Spark), web development, and more. It offers features like pattern matching, immutability, higher-order functions, and type inference, making it a powerful tool for building scalable and maintainable applications.

Scala is a modern, multi-paradigm programming language designed to address some of the perceived shortcomings of Java. Here's what makes Scala special:

Object-Oriented and Functional Powerhouse:
- Everything is an Object: Like Java, Scala is firmly object-oriented (OO). Classes, inheritance, and methods are familiar constructs.
- Functions Rule: Scala takes it a step further by treating functions as first-class citizens. You can pass functions as arguments, return functions from other functions, encouraging composable code.
Conciseness and Expressivity:
- Less Boilerplate: Scala eliminates a lot of Java's verbosity. Features like type inference and powerful data structures allow you to do more with fewer lines of code.
- Pattern Matching: Elegant way to deconstruct data and make decisions, replacing lengthy if-else chains.
Static Typing and Safety:
- Errors Caught Early: The compiler rigorously checks types, drastically reducing runtime surprises.
- Immutability by Default: Scala encourages side-effect-free coding through immutable variables and data structures, promoting safer and more reliable programs.
Powerful Concurrency Support:
- Actor Model: Scala-based libraries like Akka streamline building concurrent applications using actors (small units of computation that communicate via messages).
- Futures and Promises: Convenient abstractions for handling asynchronous operations.
Rich Ecosystem:
- Play Framework: Popular web framework for building RESTful APIs and dynamic web applications.
- Apache Kafka: Stream processing platform often used in conjunction with Spark.

Where Scala Shines

Scala is designed with the following goals in mind:

To express common programming patterns using as little code as possible.
To write software that runs at least as fast as Java, but often much faster.
To make it easy to add new features to existing code bases without modifying hundreds of files.
To make it easier to reason about correctness by providing compile-time safety checks.
To make it easier to reason about correctness by providing compile-time safety.
To make it easy to add new features to existing systems without modifying them.
To make it easy to add new features to an existing codebase without modifying hundreds of files.
To make it easy to add new features to an existing codebase without modifying existing code.
To make it easy to add new features to an existing codebase without tearing everything apart.
To make it easy to add new features to existing systems without tearing them apart. This means making it easier for developers to reuse code from other parts of a system or
To make it easy to add new features to an existing codebase without tearing everything apart. This means you can refactor your legacy Java application into a more modular and
To make it easy to add new features to existing code bases without modifying thousands of lines of code.
To make it easy to add new features to an existing codebase without modifying hundreds of files. This means you can refactor your legacy Java code into a more maintainable structure
To make it easy to add new features to existing code bases without te aring everything apart. This means that you can refactor your legacy Java code base into a more functional style
To make it easy to add new features to existing code bases without tearing everything apart. This means that you can gradually refactor your legacy code base into a more functional style
To have a concise syntax that makes it easy to read and write.
To catch type errors at compile time rather than at runtime.
To provide robust support for functional programming constructs such as higher-order
To support functional programming constructs while still embracing object-oriented design.
To integrate well with existing Javacodebase. This means you can use any existing Java library from within your Scala code without having to write new Java wrappers for them.
To be easily embedded into existing Java applications.

Getting Started with Scala

Installation

Visit the Download Page: Head over to https://www.scala-lang.org/download/ and you'll see different options:
- IDE (Recommended for beginners): An IDE includes the Scala compiler, build tools, an editor, debugger, and more. IntelliJ IDEA with the Scala plugin is a popular choice.
- Standalone Installer: Provides just the core Scala compiler and tools. Useful for advanced users or experimenting on the command line.
Follow Instructions: The download page offers instructions for Windows, macOS, and Linux.

Important Notes

Java is Required: Scala code compiles to Java bytecode and runs on the Java Virtual Machine (JVM). Make sure you have an appropriate Java Development Kit (JDK) installed.
IDE Convenience: While you can install Scala and write code with a simple text editor, an IDE makes things tremendously easier with auto-completion, code navigation, and debugging capabilities.

Fundamentals: The Building Blocks

Here's a summary of essential concepts to grasp right at the beginning:

Variables:
- val: for immutable values (cannot be changed once assigned).
- var: for mutable values (can be reassigned).
Data Types: Scala provides the familiar numeric types (Int, Double, etc.), booleans (Boolean), strings (String), and more.
Functions: Define reusable blocks of code using the def keyword, specifying name, parameters, and return type.
Classes and Objectives:
- Classes are blueprints for creating objects.
- Objects are instances of classes, holding data and behavior.
Traits: Similar to interfaces in Java, they define abstract methods and fields. A class can extend multiple traits.
Expressions: Any piece of code that evaluates to a value.

File formats: CSV, Parquet, ORC and AVRO

The selection of a file format profoundly influences the performance and manageability of a data system. As a result, various open-source solutions have emerged to efficiently store data. Among these, popular storage formats include JSON, Apache Avro, Apache Parquet, Apache ORC, Apache Arrow, and traditional delimited text files like CSV. Each format entails tradeoffs regarding factors such as flexibility, software compatibility, efficiency, and performance.

JSON (JavaScript Object Notation):
- JSON is a lightweight and human-readable data interchange format.
- It is commonly used for transmitting data between a server and a web application.
- JSON is text-based and easy to parse, making it popular for web APIs and configuration files.
- However, JSON files can be larger and less efficient compared to binary formats for large datasets.
Apache Avro:
- Avro is a binary serialization format developed within the Apache Hadoop project.
- It provides a compact, fast, and efficient data serialization framework.
- Avro supports schema evolution, allowing for changes to the schema without breaking compatibility with older data.
- It is widely used in the Hadoop ecosystem, especially with tools like Apache Kafka and Apache Spark.
Apache Parquet:
- Parquet is a columnar storage format optimized for use with big data processing frameworks.
- It stores data column-wise rather than row-wise, which improves compression and query performance.
- Parquet is efficient for analytics workloads, especially when dealing with large datasets.
- It is commonly used with Apache Hadoop, Apache Spark, and other big data tools.
Apache ORC (Optimized Row Columnar):
- ORC is another columnar storage format developed within the Apache Hive project.
- Similar to Parquet, ORC organizes data by column rather than by row for improved compression and query performance.
- ORC supports advanced features like predicate pushdown and stripe-level statistics, enhancing query performance further.
- It is often used in data warehousing and analytics applications within the Hadoop ecosystem.
Apache Arrow:
- Arrow is a cross-language development platform for in-memory data.
- It provides a standardized language-independent columnar memory format for data interchange between different systems.
- Arrow enables efficient data sharing and interoperability between various programming languages and analytical tools.
- It is particularly beneficial for high-performance analytics and machine learning applications.
CSV (Comma-Separated Values):
- CSV is a simple and widely used file format for tabular data.
- It stores data in plain text format with each record separated by a delimiter, commonly a comma.
- CSV files are human-readable and widely supported by spreadsheet applications and databases.
- However, they lack advanced features like schema enforcement and efficient compression compared to more modern formats like Parquet or Avro.

References

Official Documentation
Databricks Learning Academy
Spark by Examples
Datacamp tutorial.
For databricks, you can look at tutorial videos on youtube at youtube video by Bryan Cafferky, writer of the book "Master Azure Databricks". A great playlist for someone who just want to learn about the big data analytics at Databricks Azure cloud platform.
See the video for pyspark basics by Krish Naik. Great video for starter.
Great youtube on Apache spark one premise working.

Some other interesting things to know:

Visit my website on For Data, Big Data, Data-modeling, Datawarehouse, SQL, cloud-compute.
Visit my website on Data engineering

Arun