Spark Sql Cheat Sheet

SQL Cheat Sheet
- Background: What is SQL? Why do we need it?
- Managing Tables
- Manipulating Data
- Retrieving Attributes
- JOINS
- Subqueries
- Using Functions to Customize ResultSet
- GROUPING DATA

Spark Sql Examples
Pyspark Cheat Sheet
Spark Sql Cheat Sheet
Pyspark Tutorial For Beginners Pdf

SQL Cheat Sheet

Background: What is SQL? Why do we need it?

This part of the Spark, Scala, and Python training includes the PySpark SQL Cheat Sheet. In this part, you will learn various aspects of PySpark SQL that are possibly asked in interviews. Also, you will have a chance to understand the most important PySpark SQL terminology.

Spark SQL uses a nested data model based on Hive It supports all major SQL data types, including boolean, integer, double, decimal, string, date, timestamp and also User Defined Data types Example of DataFrame Operations. DataFrame Operations Cont. #Access DF with DSL or SQL. Real World Problems. from pyspark.sql.types import. InferSchema sc =spark.sparkContext lines =sc.textFile('people.txt') parts = lines.map(lambda l:l.split(',')) people = parts.map(lambda p:Row(name=p0,age=int(p1))) peopledf =spark.createDataFrame(people) SpecifySchema people = parts.map(lambda p:Row(name=p0, age=int(p1.strip))). Python, or SQL (for interactive queries), and a rich set of machine learning libraries available out of the box. Compatibility with the existing Hadoop v1 (SIMR) and 2.x (YARN). Spark engine provides a way to process data in distributed memory over a cluster of.

SQL is a database language used to query and manipulate the data in the database.

Main objectives:

To provide an efficient and convenient environment
Manage information about users who interact with the DBMS

The SQL statements can be categorized as

Data Definition Language(DDL) Commands:

CREATE: creates a new database object, such as a table.
ALTER: used to modify the database object
DROP: used to delete the objects.

Data Manipulation Language(DML) Commands:

INSERT: used to insert a new data row record in a table.
UPDATE: used to modify an existing record in a table.
DELETE: used delete a record from the table.

Data Control Language(DCL) Commands:

GRANT: used to assign permission to users to access database objects.
REVOKE: used to deny permission to users to access database objects.

Data Query Language(DQL) Commands:

SELECT: it is the DQL command to select data from the database.

Data Transfer Language(DTL) Commands:

COMMIT: used to save any transaction into the database permanently.
ROLLBACK: restores the database to the last committed state.

Identifying Data Types

Data types specify the type of data that an object can contain, such as integer data or character data. We need to specify the data type according to the data to be stored.

Following are some of the essential data types:

Data Type	Used to Store
int	Integer data
smallint	Integer data
tinyint	Integer data
bigint	Integer data
decimal	Numeric data type with a fixed precision and scale.
numeric	numeric data type with a fixed precision and scale.
float	floating precision data
money	monetary data
datetime	data and time data
char(n)	fixed length character data
varchar(n)	variable length character data
text	character string
bit	integer data with 0 or 1
image	variable length binary data to store images
real	floating precision number
binary	fixed length binary data
cursor	cursor reference
sql_variant	different data types
timestamp	unique number in the database that is updated every time in a row that contains timestamp is inserted or updated.
table	temporary set of rows returned as a result set of a table-valued function.
xml	store and return xml values

Managing Tables

Create Table

Table can be created using the CREATE TABLE statement. The syntax is as follows:

Example: Create a table named EmployeeLeave in Human Resource schema with the following attributes:

Columns	Data Type	Checks
EmployeeID	int	NOT NULL
LeaveStartDate	date	NOT NULL
LeaveEndDate	date	NOT NULL
LeaveReason	varchar(100)	NOT NULL
LeaveType	char(2)	NOT NULL

Constraints in SQL

Constraints define rules that must be followed to maintain consistency and correctness of data. A constraint can be created by using either of the following statements:

Types of Constraints:

Constraint	Description	Syntax
Primary key	Columns or columns that uniquely identify all rows in the table.	CREATE TABLE table_name ( col_name [CONSTRAINT constraint_name PRIMARY KEY] (col_name(s)) )
Unique key	Enforces uniqueness on non primary key columns.
Foreign key	Is used to remove the inconsistency in two tables when the data depends on other tables.
Check	Enforce domain integrity by restricting the values to be inserted in the column.

3.2 Modifying Tables

Modify table using ALTER TABLE statement when:

Adding column
Altering data type
Adding or removing constraints

Syntax of ALTER TABLE:

Spark Sql Examples

Renaming a Table

A table can be renamed whenever required using RENAME TABLE statement:

RENAME TABLE old_table_name TO new_table_name;

Dropping a Table versus Truncate Table

A table can be dropped or deleted when no longer required using DROP TABLE statement:

The contents of the table can be deleted when no longer required without deleting the table itself using TRUNCATE TABLE statement:

Manipulating Data

Storing Data in a Table

Syntax:

Example: Inserting data into Student table.

Example: Inserting multiple data into Student table.

Copying Data from one table to another:

Updating Data in a Table

Data can be updated in the table using UPDATE DML statement:

Example update marks of Andy to 85

Deleting Data from a Table

A row can be deleted when no longer required using DELETE DML statement.

Syntax:

Deleting all records from a table:

Retrieving Attributes

One or more column can be displayed while retrieving data from the table.

One may want to view all the details of the Employee table or might want to view few columns.

Required data can be retrieved data from the database tables by using the SELECT statement.

The syntax of SELECT statement is:

Consider the following Student table:

StudentID	FirstName	LastName	Marks
101	John	Ray	78
102	Steve	Jobs	89
103	Ben	Matt	77
104	Ron	Neil	65
105	Andy	Clifton	65
106	Park	Jin	90

Retrieving Selected Rows

To retrieve selected rows from a table use WHERE clause in the SELECT statement.

HAVING Clause is used instead of WHERE for aggregate functions.

Comparison Operators

Comparison operators test for the similarity between two expressions.

Syntax:

Example of some comparison operators:

Logical Operators

Logical operators are used to SELECT statement to retrieve records based on one or more conditions. More than one logical operator can be combined to apply multiple search conditions.

Syntax:

Types of Logical Operators:

OR Operator

AND Operator

NOT Operator

Range Operator

Range operator retrieves data based on range.

Syntax:

Types of Range operators:

BETWEEN

NOT BETWEEN

Retrieve Records That Match a Pattern

Data from the table can be retrieved that match a specific pattern.

The LIKE keyword matches the given character string with a specific pattern.

Displaying in a Sequence

Use ORDER BY clause to display the data retrieved in a specific order.

Displaying without Duplication

The DISTINCT keyword is used to eliminate rows with duplicate values in a column.

Syntax:

JOINS

Joins are used to retrieve data from more than one table together as a part of a single result set. Two or more tables can be joined based on a common attribute.

Types of JOINS:

Consider two tables Employees and EmployeeSalary

EmployeeID (PK)	FirstName	LastName	Title
1001	Ron	Brent	Developer
1002	Alex	Matt	Manager
1003	Ray	Maxi	Tester
1004	August	Berg	Quality

EmployeeID (FK)	Department	Salary
1001	Application	65000
1002	Digital Marketing	75000
1003	Web	45000
1004	Software Tools	68000

INNER JOIN

An inner join retrieves records from multiple tables by using a comparison operator on a common column.

Pyspark Cheat Sheet

Syntax:

Example:

OUTER JOIN

An outer join displays the resulting set containing all the rows from one table and the matching rows from another table.

An outer join displays NULL for the column of the related table where it does not find matching records.

Syntax:

Types of Outer Join

LEFT OUTER JOIN: In left outer join all rows from the table on the left side of the LEFT OUTER JOIN keyword is returned, and the matching rows from the table specified on the right side are returned the result set.

Example:

RIGHT OUTER JOIN: In right outer join all rows from the table on the right side of the RIGHT OUTER JOIN keyword are returned, and the matching rows from the table specified on the left side are returned is the result set.

Example:

FULL OUTER JOIN: It is a combination of left outer join and right outer join. This outer join returns all the matching and non-matching rows from both tables. Whilst, the matching records are displayed only once.

Example:

CROSS JOIN

Also known as the Cartesian Product between two tables joins each row from one table with each row of another table. The rows in the result set is the count of rows in the first table times the count of rows in the second table.

Syntax:

EQUI JOIN

An Equi join is the same as inner join and joins tables with the help of foreign key except this join is used to display all columns from both tables.

SELF JOIN

In self join, a table is joined with itself. As a result, one row is in a table correlates with other rows in the same table. In this join, a table name is mentioned twice in the query. Hence, to differentiate the two instances of a single table, the table is given two aliases. Syntax:

Subqueries

An SQL statement that is used inside another SQL statement is termed as a subquery.

They are nested inside WHERE or HAVING clause of SELECT, INSERT, UPDATE and DELETE statements.

Outer Query: Query that represents the parent query.
Inner Query: Query that represents the subquery.

Using IN Keyword

If a subquery returns more than one value, we might execute the outer query if the values within the columns specified in the condition match any value in the result set of the subquery.

Syntax:

Using EXISTS Keyword

EXISTS clause is used with subquery to check if a set of records exists.

TRUE value is returned by the subquery in case if the subquery returns any row.

Syntax:

Using Nested Subqueries

A subquery can contain more than one subqueries. Subqueries are used when the condition of a query is dependent on the result of another query, which is, in turn, is dependent on the result of another subquery.

Syntax:

Correlated Subquery

A correlated subquery can be defined as a query that depends on the outer query for its evaluation.

Using Functions to Customize ResultSet

Various in-built functions can be used to customize the result set.

Syntax:

Using String Functions

String values in the result set can be manipulated by using string functions.

They are used with char and varchar data types.

Following are the commonly used string functions are:

Function Name	Example
left
len
lower
reverse
right
space
str
substring
upper

Using Date Functions

Date functions are used to manipulate date time values or to parse the date values.

Date parsing includes extracting components, such as day, month, and year from a date value.

Some of the commonly used date functions are:

Function Name	Parameters	Description
dateadd	(date part, number, date)	Adds the number of date parts to the date.
datediff	(date part, date1, date2)	Calculates the number of date parts between two dates.
Datename	(date part, date)	Returns date part from the listed as a character value.
datepart	(date part, date)	Returns date part from the listed as an integer.
getdate	0	Returns current date and time
day	(date)	Returns an integer, which represents the day.
month	(date)	Returns an integer, which represents the month.
year	(date)	Returns an integer, which represents the year.

Using Mathematical Functions

Numeric values in a result set can be manipulated in using mathematical functions.

The following table lists the mathematical functions:

Function Name	Parameters	Description
abs	(numeric_expression)	Returns an absolute value
acts,asin,atan	(float_expression)	Returns an angle in radians
cos, sin, cot,tan	(float_expression)	Returns the cosine, sine, cotangent, or tangent of the angle in radians.
degrees	(numeric_expression)	Returns the smallest integer greater than or equal to specifies value.
exp	(float_expression)	Returns the exponential value of the specified value.
floor	(numeric_expression)	Returns the largest integer less than or equal to the specified value.
log	(float_expression)	Returns the natural logarithm of the specified value.
pi	0	Returns the constant value of 3.141592653589793
power	(numeric_expression,y)	Returns the value of numeric expression to the value of y
radians	(numeric_expression)	Converts from degrees to radians.
rand	([seed])	Returns a random float number between 0 and 1.
round	(numeric_expression,length)	Returns a numeric expression rounded off to the length specified as an integer expression.
sign	(numeric_expression)	Returns positive, negative or zero.
sqrt	(float_expression)	Returns the square root of the specified value.

Using Ranking Functions

Ranking functions are used to generate sequential numbers for each row to give a rank based on specific criteria.

Ranking functions return a ranking value for each row. Following functions are used to rank the records:

row_number Function: This function returns the sequential numbers, starting at 1, for the rows in a result set based on a column.
rank Function: This function returns the rank of each row in a result set based on specified criteria.
dense_rank Function: The dense_rank() function is used where consecutive ranking values need to be given based on specified criteria.

These functions use the OVER clause that determines the ascending or descending sequence in which rows are assigned a rank.

Using Aggregate Functions

The aggregate functions, on execution, summarize the values for a column or group of columns and produce a single value.

Syntax:

Following are the aggregate functions:

Function Name	Description
avg	returns the average of values in a numeric expression, either all or distinct.
count	returns the number of values in an expression, either all or distinct.
min	returns the lowest value in an expression.
max	returns the highest value in an expression.
sum	returns the total of values in an expression, either all or distinct.

GROUPING DATA

Grouping data means to view data that match a specific criteria to be displayed together in the result set.

Data can be grouped by using GROUP BY, COMPUTE,COMPUTE BY and PIVOT clause in the SELECT statement.

GROUP BY Clause

Summarizes the result set into groups as defined in the query by using aggregate functions.

Syntax:

COMPUTE and COMPUTE BY Clause

This COMPUTE clause, with the SELECT statement, is used to generate summary rows by using aggregate functions in the query result.

The COMPUTE BY clause can be used to calculate summary values of the result set on a group of data.

Syntax:

PIVOT Clause

The PIVOT operator is used to transform a set of columns into values, PIVOT rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output.

Syntax:

People are also reading:

This page contains a bunch of spark pipeline transformation methods, whichwe can use for different problems. Use this as a quick cheat on how we cando particular operation on spark dataframe or pyspark.

This code snippets are tested on spark-2.4.x version, mostly work onspark-2.3.x also, but not sure about older versions.

Read the partitioned json files from disk

applicable to all types of files supported

Save partitioned files into a single file.

Here we are merging all the partitions into one file and dumping it intothe disk, this happens at the driver node, so be careful with sie ofdata set that you are dealing with. Otherwise, the driver node may go out of memory.

Use coalesce method to adjust the partition size of RDD based on our needs.

Filter rows which meet particular criteria

Map with case class

Use case class if you want to map on multiple columns with a complexdata structure.

OR using Row class.

Use selectExpr to access inner attributes

Provide easily access the nested data structures like json and filter themusing any existing udfs, or use your udf to get more flexibility here.

How to access RDD methods from pyspark side

Using standard RDD operation via pyspark API isn’t straight forward, to get thatwe need to invoke the .rdd to convert the DataFrame to support these features.

For example, here we are converting a sparse vector to dense and summing it in column-wise.

Pyspark Map on multiple columns

Filtering a DataFrame column of type Seq[String]

Filter a column with custom regex and udf

Sum a column elements

Remove Unicode characters from tokens

Sometimes we only need to work with the ascii text, so it’s better to clean outother chars.

Connecting to jdbc with partition by integer column

When using the spark to read data from the SQL database and then do theother pipeline processing on it, it’s recommended to partition the dataaccording to the natural segments in the data, or at least on an integercolumn, so that spark can fire multiple sql queries to read data from SQLserver and operate on it separately, the results are going to the sparkpartition.

Bellow commands are in pyspark, but the APIs are the same for the scala version also.

Parse nested json data

This will be very helpful when working with pyspark and want to pass verynested json data between JVM and Python processes. Lately spark community relay onapache arrow project to avoid multiple serialization/deserialization costs whensending data from java memory to python memory or vice versa.

So to process the inner objects you can make use of this getItem methodto filter out required parts of the object and pass it over to python memory viaarrow. In the future arrow might support arbitrarily nested data, but right now it won’tsupport complex nested formats. The general recommended option is to go without nesting.

'string ⇒ array<string>' conversion

Type annotation .as[String] avoid implicit conversion assumed.

Spark Sql Cheat Sheet

A crazy string collection and groupby

This is a stream of operation on a column of type Array[String] and collectthe tokens and count the n-gram distribution over all the tokens.

How to access AWS s3 on spark-shell or pyspark

Most of the time we might require a cloud storage provider like s3 / gs etc, toread and write the data for processing, very few keeps in-house hdfs to handle the datathemself, but for majority, I think cloud storage easy to start with and don’t needto bother about the size limitations.

Supply the aws credentials via environment variable

Supply the credentials via default aws ~/.aws/config file

Recent versions of awscli expect its configurations are kept under ~/.aws/credentials file,but old versions looks at ~/.aws/config path, spark 2.4.x version now looks at the ~/.aws/config locationsince spark 2.4.x comes with default hadoop jars of version 2.7.x.

Set spark scratch space or tmp directory correctly

This might require when working with a huge dataset and your machine can’t hold themall in memory for given pipeline steps, those cases the data will be spilled overto disk, and saved in tmp directory.

Set bellow properties to ensure, you have enough space in tmp location.

Pyspark doesn’t support all the data types.

When using the arrow to transport data between jvm to python memory, the arrow may throwbellow error if the types aren’t compatible to existing converters. The fixes may becomein the future on the arrow’s project. I’m keeping this here to know that how the pyspark getsdata from jvm and what are those things can go wrong in that process.

Work with spark standalone cluster manager

Start the spark clustering in standalone mode

Once you have downloaded the same version of the spark binary across the machinesyou can start the spark master and slave processes to form the standalone sparkcluster. Or you could run both these services on the same machine also.

Standalone mode,

Worker can have multiple executors.
Worker is like a node manager in yarn.
We can set worker max core and memory usage settings.
When defining the spark application via spark-shell or so, define the executor memory and cores.

Pyspark Tutorial For Beginners Pdf

When submitting the job to get 10 executor with 1 cpu and 2gb ram each,

This page will be updated as and when I see some reusable snippet of code for spark operations

Changelog

References

Go Top

Please enable JavaScript to view the comments powered by Disqus.