Home Big Data Use Amazon EMR with S3 Entry Grants to scale Spark entry to Amazon S3

Use Amazon EMR with S3 Entry Grants to scale Spark entry to Amazon S3

0
Use Amazon EMR with S3 Entry Grants to scale Spark entry to Amazon S3

[ad_1]

Amazon EMR is happy to announce integration with Amazon Easy Storage Service (Amazon S3) Entry Grants that simplifies Amazon S3 permission administration and permits you to implement granular entry at scale. With this integration, you may scale job-based Amazon S3 entry for Apache Spark jobs throughout all Amazon EMR deployment choices and implement granular Amazon S3 entry for higher safety posture.

On this submit, we’ll stroll by means of just a few totally different eventualities of easy methods to use Amazon S3 Entry Grants. Earlier than we get began on strolling by means of the Amazon EMR and Amazon S3 Entry Grants integration, we’ll arrange and configure S3 Entry Grants. Then, we’ll use the AWS CloudFormation template under to create an Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) Cluster, an EMR Serverless software and two totally different job roles.

After the setup, we’ll run just a few eventualities of how you should use Amazon EMR with S3 Entry Grants. First, we’ll run a batch job on EMR on Amazon EC2 to import CSV knowledge and convert to Parquet. Second, we’ll use Amazon EMR Studio with an interactive EMR Serverless software to research the information. Lastly, we’ll present easy methods to arrange cross-account entry for Amazon S3 Entry Grants. Many shoppers use totally different accounts throughout their group and even exterior their group to share knowledge. Amazon S3 Entry Grants make it simple to grant cross-account entry to your knowledge even when filtering by totally different prefixes.

Moreover this submit, you may study extra about Amazon S3 Entry Grants from Scaling knowledge entry with Amazon S3 Entry Grants.

Conditions

Earlier than you launch the AWS CloudFormation stack, guarantee you will have the next:

  • An AWS account that gives entry to AWS providers
  • The most recent model of the AWS Command Line Interface (AWS CLI)
  • An AWS Id and Entry Administration (AWS IAM) person with an entry key and secret key to configure the AWS CLI, and permissions to create an IAM position, IAM insurance policies, and stacks in AWS CloudFormation
  • A second AWS account in the event you want to take a look at the cross-account performance

Walkthrough

Create sources with AWS CloudFormation

So as to use Amazon S3 Entry Grants, you’ll want a cluster with Amazon EMR 6.15.0 or later. For extra info, see the documentation for utilizing Amazon S3 Entry Grants with an Amazon EMR cluster, an Amazon EMR on EKS cluster, and an Amazon EMR Serverless software. For the aim of this submit, we’ll assume that you’ve got two several types of knowledge entry customers in your group—analytics engineers with learn and write entry to the information within the bucket and enterprise analysts with read-only entry. We’ll make the most of two totally different AWS IAM roles, however you can even join your individual identification supplier on to IAM Id Heart in the event you like.

Right here’s the structure for this primary portion. The AWS CloudFormation stack creates the next AWS sources:

  • A Digital Non-public Cloud (VPC) stack with personal and public subnets to make use of with EMR Studio, route tables, and Community Deal with Translation (NAT) gateway.
  • An Amazon S3 bucket for EMR artifacts like log recordsdata, Spark code, and Jupyter notebooks.
  • An Amazon S3 bucket with pattern knowledge to make use of with S3 Entry Grants.
  • An Amazon EMR cluster configured to make use of runtime roles and S3 Entry Grants.
  • An Amazon EMR Serverless software configured to make use of S3 Entry Grants.
  • An Amazon EMR Studio the place customers can login and create workspace notebooks with the EMR Serverless software.
  • Two AWS IAM roles we’ll use for our EMR job runs: one for Amazon EC2 with write entry and one other for Serverless with learn entry.
  • One AWS IAM position that shall be utilized by S3 Entry Grants to entry bucket knowledge (i.e., the Function to make use of when registering a location with S3 Entry Grants. S3 Entry Grants use this position to create non permanent credentials).

To get began, full the next steps:

  1. Select Launch Stack:
  2. Settle for the defaults and choose I acknowledge that this template might create IAM sources.

The AWS CloudFormation stack takes roughly 10–quarter-hour to finish. As soon as the stack is completed, go to the outputs tab the place you will discover info mandatory for the next steps.

Create Amazon S3 Entry Grants sources

First, we’re going to create an Amazon S3 Entry Grants sources in our account. We create an S3 Entry Grants occasion, an S3 Entry Grants location that refers to our knowledge bucket created by the AWS CloudFormation stack that’s solely accessible by our knowledge bucket AWS IAM position, and grant totally different ranges of entry to our reader and author roles.

To create the mandatory S3 Entry Grants sources, use the next AWS CLI instructions as an administrative person and exchange any of the fields between the arrows with the output out of your CloudFormation stack.

aws s3control create-access-grants-instance 
  --account-id <YOUR_ACCOUNT_ID>

Subsequent, we create a brand new S3 Entry Grants location. What’s a Location? Amazon S3 Entry Grants works by merchandising AWS IAM credentials with entry scoped to a selected S3 prefix. An S3 Entry Grants location shall be related to an AWS IAM Function from which these non permanent periods shall be created.

In our case, we’re going to scope the AWS IAM Function to the bucket created with our AWS CloudFormation stack and provides entry to the information bucket position created by the stack. Go to the outputs tab to search out the values to switch with the next code snippet:

aws s3control create-access-grants-location 
  --account-id <YOUR_ACCOUNT_ID> 
  --location-scope "s3://<DATA_BUCKET>/" 
  --iam-role-arn <DATA_BUCKET_ROLE>

Notice the AccessGrantsLocationId worth within the response. We’ll want that for the subsequent steps the place we’ll stroll by means of creating the mandatory S3 Entry Grants to restrict learn and write entry to your bucket.

  • For the learn/write person, use s3-control create-access-grant to permit READWRITE entry to the “output/*” prefix:
    aws s3control create-access-grant 
      --account-id <YOUR_ACCOUNT_ID> 
      --access-grants-location-id <LOCATION_ID_FROM_PREVIOUS_COMMAND> 
      --access-grants-location-configuration S3SubPrefix="output/*" 
      --permission READWRITE 
      --grantee GranteeType=IAM,GranteeIdentifier=<DATA_WRITER_ROLE>

  • For the learn person, use s3control create-access-grant once more to permit solely READ entry to the identical prefix:
    aws s3control create-access-grant 
      --account-id <YOUR_ACCOUNT_ID> 
      --access-grants-location-id <LOCATION_ID_FROM_PREVIOUS_COMMAND> 
      --access-grants-location-configuration S3SubPrefix="output/*" 
      --permission READ 
      --grantee GranteeType=IAM,GranteeIdentifier=<DATA_READER_ROLE>

Demo State of affairs 1: Amazon EMR on EC2 Spark Job to generate Parquet knowledge

Now that we’ve bought our Amazon EMR environments arrange and granted entry to our roles by way of S3 Entry Grants, it’s vital to notice that the 2 AWS IAM roles for our EMR cluster and EMR Serverless software have an IAM coverage that solely permit entry to our EMR artifacts bucket. They haven’t any IAM entry to our S3 knowledge bucket and as a substitute use S3 Entry Grants to fetch short-lived credentials scoped to the bucket and prefix. Particularly, the roles are granted s3:GetDataAccess and s3:GetDataAccessGrantsInstanceForPrefix permissions to request entry by way of the precise S3 Entry Grants occasion created in our area. This lets you simply handle your S3 entry in a single place in a extremely scoped and granular vogue that enhances your safety posture. By combining S3 Entry Grants with job roles on EMR on Amazon Elastic Kubernetes Service (Amazon EKS) and EMR Serverless in addition to runtime roles for Amazon EMR steps starting with EMR 6.7.0, you may simply handle entry management for particular person jobs or queries. S3 Entry Grants can be found on EMR 6.15.0 and later. Let’s first run a Spark job on EMR on EC2 as our analytics engineer to transform some pattern knowledge into Parquet.

For this, use the pattern code supplied in converter.py. Obtain the file and replica it to the EMR_ARTIFACTS_BUCKET created by the AWS CloudFormation stack. We’ll submit our job with the ReadWrite AWS IAM position. Notice that for the EMR cluster, we configured S3 Entry Grants to fall again to the IAM position if entry isn’t supplied by S3 Entry Grants. The DATA_WRITER_ROLE has learn entry to the EMR artifacts bucket by means of an IAM coverage so it could possibly learn our script. As earlier than, exchange all of the values with the <> symbols from the Outputs tab of your CloudFormation stack.

aws s3 cp converter.py s3://<EMR_ARTIFACTS_BUCKET>/code/
aws emr add-steps --cluster-id <EMR_CLUSTER_ID> 
    --execution-role-arn <DATA_WRITER_ROLE> 
    --steps '[
        {
            "Type": "CUSTOM_JAR",
            "Name": "converter",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Args": [
                    "spark-submit",
                    "--deploy-mode",
                    "client",
                    "s3://<EMR_ARTIFACTS_BUCKET>/code/converter.py",
                    "s3://<DATA_BUCKET>/output/weather-data/"
            ]
        }
    ]'

As soon as the job finishes, we must always see some Parquet knowledge in s3://<DATA_BUCKET>/output/weather-data/. You’ll be able to see the standing of the job within the Steps tab of the EMR console.

Demo State of affairs 2: EMR Studio with an interactive EMR Serverless software to research knowledge

Now let’s go forward and login to EMR Studio and connect with your EMR Serverless software with the ReadOnly runtime position to research the information from situation 1. First we have to allow the interactive endpoint in your Serverless software.

  • Choose the EMRStudioURL within the Outputs tab of your AWS CloudFormation stack.
  • Choose Functions underneath the Serverless part on the left-hand facet.
  • Choose the EMRBlog software, then the Motion dropdown, and Configure.
  • Increase the Interactive endpoint part and guarantee that Allow interactive endpoint is checked.
  • Scroll down and click on Configure software to avoid wasting your adjustments.
  • Again on the Functions web page, choose EMRBlog software, then the Begin software button.

Subsequent, create a brand new workspace in our Studio.

  • Select Workspaces on the left-hand facet, then the Create workspace button.
  • Enter a Workspace title, go away the remaining defaults, and select Create Workspace.
  • After creating the workspace, it ought to launch in a brand new tab in just a few seconds.

Now join your Workspace to your EMR Serverless software.

  • Choose the EMR Compute button on the left-hand facet as proven within the following code.
  • Select EMR Serverless because the compute kind.
  • Select the EMRBlog software and the runtime position that begins with EMRBlog.
  • Select Connect. The window will refresh and you’ll open a brand new PySpark pocket book and observe alongside under. To execute the code your self, obtain the AccessGrantsReadOnly.ipynb pocket book and add it into your workspace utilizing the Add Recordsdata button within the file browser.

Let’s do a fast learn of the information.

df = spark.learn.parquet(f"s3://{DATA_BUCKET}/output/weather-data/")
df.createOrReplaceTempView("climate")
df.present()

We’ll do a easy depend(*):

spark.sql("SELECT yr, COUNT(*) FROM climate GROUP BY 1").present()


It’s also possible to see that if we attempt to write knowledge into the output location, we get an Amazon S3 error.

df.write.format("csv").mode("overwrite").save("s3://<DATA_BUCKET>/output/weather-data-2/")

Whereas you can even grant comparable entry by way of AWS IAM insurance policies, Amazon S3 Entry Grants may be helpful for conditions the place your group has outgrown managing entry by way of IAM, needs to map S3 Entry Grants to IAM Id Heart principals or roles, or has beforehand used EMR File System (EMRFS) Function Mappings. S3 Entry Grants credentials are additionally non permanent offering safer entry to your knowledge. As well as, as proven under, cross-account entry additionally advantages from the simplicity of S3 Entry Grants.

Demo State of affairs 3 – Cross-account entry

One of many different extra frequent entry patterns is accessing knowledge throughout accounts. This sample has grow to be more and more frequent with the emergence of information mesh, the place knowledge producers and shoppers are decentralized throughout totally different AWS accounts.

Beforehand, cross-account entry required organising advanced cross-account assume position actions and customized credentials suppliers when configuring your Spark job. With S3 Entry Grants, we solely have to do the next:

  • Create an Amazon EMR job position and cluster in a second knowledge client account
  • The info producer account grants entry to the information client account with a brand new occasion useful resource coverage
  • The info producer account creates an entry grant for the information client job position

And that’s it! If in case you have a second account helpful, go forward and deploy this AWS CloudFormation stack within the knowledge client account, to create a brand new EMR Serverless software and job position. If not, simply observe alongside under. The AWS CloudFormation stack ought to end creating in underneath a minute. Subsequent, let’s go forward and grant our knowledge client entry to the S3 Entry Grants occasion in our knowledge producer account.

  • Substitute <DATA_PRODUCER_ACCOUNT_ID> and <DATA_CONSUMER_ACCOUNT_ID> with the related 12-digit AWS account IDs.
  • You might also want to alter the area within the command and coverage.
    aws s3control put-access-grants-instance-resource-policy 
        --account-id <DATA_PRODUCER_ACCOUNT_ID> 
        --region us-east-2 
        --policy '{
        "Model": "2012-10-17",
        "Id": "S3AccessGrantsPolicy",
        "Assertion": [
            {
                "Sid": "AllowAccessToS3AccessGrants",
                "Principal": {
                    "AWS": "<DATA_CONSUMER_ACCOUNT_ID>"
                },
                "Effect": "Allow",
                "Action": [
                    "s3:ListAccessGrants",
                    "s3:ListAccessGrantsLocations",
                    "s3:GetDataAccess"
                ],
                "Useful resource": "arn:aws:s3:us-east-2:<DATA_PRODUCER_ACCOUNT_ID>:access-grants/default"
            }
        ]
    }'

  • After which grant READ entry to the output folder to our EMR Serverless job position within the knowledge client account.
    aws s3control create-access-grant 
        --account-id <DATA_PRODUCER_ACCOUNT_ID> 
        --region us-east-2 
        --access-grants-location-id default 
        --access-grants-location-configuration S3SubPrefix="output/*" 
        --permission READ 
        --grantee GranteeType=IAM,GranteeIdentifier=arn:aws:iam::<DATA_CONSUMER_ACCOUNT_ID>:position/<EMR_SERVERLESS_JOB_ROLE> 
        --region us-east-2

Now that we’ve performed that, we will learn knowledge within the knowledge client account from the bucket within the knowledge producer account. We’ll simply run a easy COUNT(*) once more. Substitute the <APPLICATION_ID>, <DATA_CONSUMER_JOB_ROLE>, and <DATA_CONSUMER_LOG_BUCKET> with the values from the Outputs tab on the AWS CloudFormation stack created in your second account.

And exchange <DATA_PRODUCER_BUCKET> with the bucket out of your first account.

aws emr-serverless start-job-run 
  --application-id <APPLICATION_ID> 
  --execution-role-arn <DATA_CONSUMER_JOB_ROLE> 
  --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://<DATA_CONSUMER_LOG_BUCKET>/logs/"
            }
        }
    }' 
  --job-driver '{
    "sparkSubmit": {
        "entryPoint": "SELECT COUNT(*) FROM parquet.`s3://<DATA_PRODUCER_BUCKET>/output/weather-data/`",
        "sparkSubmitParameters": "--class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver -e"
    }
  }'

Look forward to the job to succeed in a accomplished state, after which fetch the stdout log out of your bucket, changing the <APPLICATION_ID>, <JOB_RUN_ID> from the job above, and <DATA_CONSUMER_LOG_BUCKET>.

aws emr-serverless get-job-run --application-id <APPLICATION_ID> --job-run-id <JOB_RUN_ID>
{
    "jobRun": {
        "applicationId": "00feq2s6g89r2n0d",
        "jobRunId": "00feqnp2ih45d80e",
        "state": "SUCCESS",
        ...
}

In case you are on a unix-based machine and have gunzip put in, then you should use the next command as your administrative person.

Notice that this command solely makes use of AWS IAM Function Insurance policies, not Amazon S3 Entry Grants.

aws s3 ls s3:// <DATA_CONSUMER_LOG_BUCKET>/logs/purposes/<APPLICATION_ID>/jobs/<JOB_RUN_ID>/SPARK_DRIVER/stdout.gz - | gunzip

In any other case, you should use the get-dashboard-for-job-run command and open the ensuing URL in your browser to view the Driver stdout logs within the Executors tab of the Spark UI.

aws emr-serverless get-dashboard-for-job-run --application-id <APPLICATION_ID> --job-run-id <JOB_RUN_ID>

Cleansing up

So as to keep away from incurring future prices for examples sources in your AWS accounts, make sure to take the next steps:

  • You could manually delete the Amazon EMR Studio workspace created within the first a part of the submit
  • Empty the Amazon S3 buckets created by the AWS CloudFormation stacks
  • Ensure you delete the Amazon S3 Entry Grants, useful resource insurance policies, and S3 Entry Grants location created within the steps above utilizing the delete-access-grant, delete-access-grants-instance-resource-policy, delete-access-grants-location, and delete-access-grants-instance instructions.
  • Delete the AWS CloudFormation Stacks created in every account

Comparability to AWS IAM Function Mapping

In 2018, EMR launched EMRFS position mapping as a manner to supply storage-level authorization by configuring EMRFS with a number of IAM roles. Whereas efficient, position mapping required managing customers or teams domestically in your EMR cluster along with sustaining the mappings between these identities and their corresponding IAM roles. Together with runtime roles on EMR on EC2 and job roles for EMR on EKS and EMR Serverless, it’s now simpler to grant entry to your knowledge on S3 on to the related principal on a per-job foundation.

Conclusion

On this submit, we confirmed you easy methods to arrange and use Amazon S3 Entry Grants with Amazon EMR as a way to simply handle knowledge entry to your Amazon EMR workloads. With S3 Entry Grants and EMR, you may simply configure entry to knowledge on S3 for IAM identities or utilizing your company listing in IAM Id Heart as your identification supply. S3 Entry Grants is supported throughout EMR on EC2, EMR on EKS, and EMR Serverless beginning in EMR launch 6.15.0.

To study extra, see the S3 Entry Grants and EMR documentation and be happy to ask any questions within the feedback!


In regards to the creator

Damon Cortesi is a Principal Developer Advocate with Amazon Internet Providers. He builds instruments and content material to assist make the lives of information engineers simpler. When not arduous at work, he nonetheless builds knowledge pipelines and splits logs in his spare time.

[ad_2]

Supply hyperlink

LEAVE A REPLY

Please enter your comment!
Please enter your name here