Overview of Apache Spark
Apache Spark is a distributed processing system which is used basically for large data workloads. Apache Spark uses optimized execution and also utilizes in-memory caching for fast performance. Moreover, it also has support for general batch processing, machine learning hoc queries, streaming analytics, graph databases, etc.
This article will give an overview of Apache Spark Security, Installation on AWS. Apache Spark service comes under Amazon EMR. You can go to the AWS management console, AWS CLI or Amazon EMR API and there you can create managed Apache spark cluster.You can also use additional features of Amazon EMR like Amazon S3 connectivity using Amazon EMRFS, integration with EC2 spot market and Glue data catalog, and auto-scaling for adding and removing instances from the cluster.
What are the features and use cases of Apache Spark?
The the features and use cases of Apache Spark are listed below:
Features
- Fast Performance
- Develop the application quickly
- Create a variety of workflows
- Integration with the Amazon EMR feature set
Use Cases
- Stream processing
- Machine learning
- Interactive SQL
What is Apache Spark Security?
Apache Spark security aids authentication through a shared secret. Spark authentication is the configuration parameter through which authentication can be configured. It is the parameter which checks whether the protocols of the spark communication are doing authentication using a shared secret or not.
Both the sender and receiver must have some shared secret to communicate. They will not be allowed to communicate with each other if the shared secret is not alike. The shared secret will be created as follows:
- For spark on YARN and local deployments, setting up spark authentication to actual will generate and distribute the shared secret.
- For any other spark deployments, spark authenticates. The secret should be configured on every node.
WEB UI: The spark can be secured by using https/SSL setting and by using javax servlet filters through spark.vi.filters settings.
Authentication: The user specifies javax servlet filters which can authenticate the user. Spark compares the user and the view ACLs to ensure that they are authorized to view UI, once they are logged in. Spark.acls control the behavior of ACLs.enable, spark.vi.viewls.groups. To control the accessibility to modify a running spark application, spark also supports to modify ACLs. Spark.acls do this.enable, spark.modify.acls and spark.modify.acls.groups.
Event Logging: For event logging, the event log files must have the proper permission set. The owner, who created the directory, must be the super user. There should be group permissions, which may allow the user to write to the directory but prevent unauthorized access from altering or updating a file. Only the owner is permitted to do that.
Encryption in Apache Spark Security: Spark support SASL encryption and SSL for HTTP protocols. It supports AES based encryption for RPC connections.
SSL Configuration for Apache Spark Security: There is a hierarchical organization or SSL configuration. Using this, basic settings can be provided to all the protocols. The SSL settings are at spark.ssl namespace. SSL must be configured at every node and each node component involved in communication.
YARN Mode: The preparation of key-store is done on the client side and is then distributed, and the executors use it as a component of their application. This is done as the user can deploy files before the starting of the application in YARN by using spark.yarn.dist.archives configuration settings.
Standalone Mode in Apache Spark Security: The user should provide the key-store and configuration options for master and worker. The user shall allow the executors to make use of SSL settings which are gained from the worker which brought on that executor. This can be done by setting spark.ssl.useNodeLocalConf to true. By setting this parameter,the executors cannot use the settings provided by the user on the client side.
Preparing the key stories of Apache Spark
The generation of the key-store is done by the keytool program and the steps involved in configuring the key-stores and the trust-stores are as follows:
- For each node, a pair of the key is generated.
- The public key from the pair is exported to a file on each node.
- All the public keys are imported into a single trust-store.
- The trust-stores are than distributed over the nodes.
Configuring SASL Encryption
Presently, SASL encryption is aided for block transfer when authentication is enabled. To enable SASL encryption, set spark.authenticate.enableSasl Encryption to true. It is possible to disable the unencryption connections by configuring spark.network.sasl.serverAlwaysEncrypt to true, when using an external shuffle service.
Installation of Apache Spark on AWS
The steps to install apache spark on aws is listed below:
Pre-requirements
- You require an AWS account to obtain AWS services.
- You need to produce an EC2 key pair and import it to an SSH client.
Why EC2?
We need a server if we want to install Spark. Amazon EC2 is one such server.
Configuring and Launching a new EC2 instance
The steps to configure and launch a new EC2 instance are below:
Creating an IAM role
- Login to your AWS management console and select Identity and Access Management services.
- Select ‘Create new Role.’
- On step 1, set role name.
- On step 2, set role type.
- Step 3 can be skipped. On step 4, attach the policy.
- On step 5, Review, select Create Role.
- Select the cube icon and return to the list of AWS service offerings.
Creating a security group
- From Management console, select EC2 service.
- Create a security group by navigating to Network and Security.
- Set the security group name to value and set the description to security group protecting the instance of the spark.
- Select the Inbound tab and then select Add Rule.
- Set the type of SSH and the source to My_IP. If in any case your IP address changes, the rule can be updated from here.
- Select Add Rule and add another rule.
- Select the Outband tab now, and you may review the rules now.
- Select create now.
- You can set the name if the name field is blank.
Creating the EC2 Instance
- Select the EC2 service from the AWS Management Console
- Select Launch Instance by navigating to Instances. This starts a wizard workflow now.
- On step 1, select an Amazon Machine Image (AMI).
- On step 2, select the Instance Type.
- On step 3, configure the details of the instances. Set IAM Role to the IAM Role that has been created earlier.
- On step 4, add Storage.
- On step 5, tag an Instance.
- On step 6, configure the security group, select ‘select an existing security group’ and chose the one you created earlier.
- On step 7, select Launch and review instance launch.
Managing the EC2 Instance
There are charges if we don’t stop our EC2 instance.
- To start or to prevent an EC2 instance, select Actions from the table of instances. From the menu now, you may start or stop. There will be no charges if the instance is stopped.
- You can permanently terminate an instance, select Actions from the table of instances. Select Instance settings and change termination protection.
Connecting to the EC2 Instance
- Select the EC2 instance from the dashboard. Details about the instance will appear.
- The Public_IP address of the instance is to be recorded. You may access this via a web browser.
- You may use an SSH client to connect to the Public IP once your instance starts running.
- There will be a login message on your first login.
What are the steps to install Apache Spark?
The Installation of Apache Spark is listed below:
Downloading Spark
Visit the Apache Spark Download page in your web browser. A download link needs to be generated which we can access from our EC2 instance. Copy the download link to your clipboard to paste it to your EC2 instance. From the EC2 instance, type these commands:
# Download Spark to the ec2-user's home directory
cd ~
wget http://www.gtlib.gatech.edu/pub/apache/spark/spark-2.4.0/spark-2.4.0-bin-
hadoop2.7.tgz
# Unpack Spark in the /opt directory
sudo tar zxvf spark-2.4.0-bin-hadoop2.7.tgz -C /opt
# Update permissions on installation
sudo chown -R ec2-user:ec2-user /opt/spark-2.4.0-bin-hadoop2.7
# Create a symbolic link to make it easier to access
sudo ln -fs spark-2.4.0-bin-hadoop2.7 /opt/spark
Set the SPARK_HOME environment variables to complete your installation. You need to log in or log out again, for an effect.