Home Big Data Introducing AWS Glue serverless Spark UI for higher monitoring and troubleshooting

Introducing AWS Glue serverless Spark UI for higher monitoring and troubleshooting

0
Introducing AWS Glue serverless Spark UI for higher monitoring and troubleshooting

[ad_1]

In AWS, a whole bunch of 1000’s of consumers use AWS Glue, a serverless knowledge integration service, to find, mix, and put together knowledge for analytics and machine studying. When you have got advanced datasets and demanding Apache Spark workloads, it’s possible you’ll expertise efficiency bottlenecks or errors throughout Spark job runs. Troubleshooting these points may be troublesome and delay getting jobs working in manufacturing. Prospects usually use Apache Spark Internet UI, a preferred debugging instrument that’s a part of open supply Apache Spark, to assist repair issues and optimize job efficiency. AWS Glue helps Spark UI in two alternative ways, however you might want to set it up your self. This requires effort and time spent managing networking and EC2 situations, or by means of trial-and error with Docker containers.

As we speak, we’re happy to announce serverless Spark UI constructed into the AWS Glue console. Now you can use Spark UI simply because it’s a built-in part of the AWS Glue console, enabling you to entry it with a single click on when inspecting the small print of any given job run. There’s no infrastructure setup or teardown required. AWS Glue serverless Spark UI is a fully-managed serverless providing and usually begins up in a matter of seconds. Serverless Spark UI makes it considerably quicker and simpler to get jobs working in manufacturing as a result of you have got prepared entry to low degree particulars on your job runs.

This publish describes how the AWS Glue serverless Spark UI lets you monitor and troubleshoot your AWS Glue job runs.

Getting began with serverless Spark UI

You possibly can entry the serverless Spark UI for a given AWS Glue job run by navigating out of your Job’s web page in AWS Glue console.

  1. On the AWS Glue console, select ETL jobs.
  2. Select your job.
  3. Select the Runs tab.
  4. Choose the job run you need to examine, then select Spark UI.

The Spark UI will show within the decrease pane, as proven within the following display screen seize:

Alternatively, you may get to the serverless Spark UI for a particular job run by navigating from Job run monitoring in AWS Glue.

  1. On the AWS Glue console, select job run monitoring underneath ETL jobs.
  2. Choose your job run, and select View run particulars.

Scroll all the way down to the underside to view the Spark UI for the job run.

Conditions

Full the next prerequisite steps:

  1. Allow Spark UI occasion logs on your job runs. It’s enabled by default on Glue console and as soon as enabled, Spark occasion log recordsdata will likely be created through the job run, and saved in your S3 bucket. The serverless Spark UI parses a Spark occasion log file generated in your S3 bucket to visualise detailed data for each working and accomplished job runs. A progress bar reveals the proportion to completion, with a typical parsing time of lower than a minute. As soon as logs are parsed, you may
  2. When logs are parsed, you should utilize the built-in Spark UI to debug, troubleshoot, and optimize your jobs.

For extra details about Apache Spark UI, check with Internet UI in Apache Spark.

Monitor and Troubleshoot with Serverless Spark UI

A typical workload for AWS Glue for Apache Spark jobs is loading knowledge from relational databases to S3-based knowledge lakes. This part demonstrates the right way to monitor and troubleshoot an instance job run for the above workload with serverless Spark UI. The pattern job reads knowledge from MySQL database and writes to S3 in Parquet format. The supply desk has roughly 70 million data.

The next display screen seize reveals a pattern visible job authored in AWS Glue Studio visible editor. On this instance, the supply MySQL desk has already been registered within the AWS Glue Knowledge Catalog prematurely. It may be registered by means of AWS Glue crawler or AWS Glue catalog API. For extra data, check with Knowledge Catalog and crawlers in AWS Glue.

Now it’s time to run the job! The primary job run completed in half-hour and 10 seconds as proven:

Let’s use Spark UI to optimize the efficiency of this job run. Open Spark UI tab within the Job runs web page. While you drill all the way down to Levels and look at the Period column, you’ll discover that Stage Id=0 spent 27.41 minutes to run the job, and the stage had just one Spark process within the Duties:Succeeded/Whole column. Which means there was no parallelism to load knowledge from the supply MySQL database.

To optimize the information load, introduce parameters referred to as hashfield and hashpartitions to the supply desk definition. For extra data, check with Studying from JDBC tables in parallel. Persevering with to the Glue Catalog desk, add two properties: hashfield=emp_no, and hashpartitions=18 in Desk properties.

This implies the brand new job runs studying parallelize knowledge load from the supply MySQL desk.

Let’s attempt working the identical job once more! This time, the job run completed in 9 minutes and 9 seconds. It saved 21 minutes from the earlier job run.

As a greatest observe, view the Spark UI and examine them earlier than and after the optimization. Drilling all the way down to Accomplished phases, you’ll discover that there was one stage and 18 duties as a substitute of 1 process.

Within the first job run, AWS Glue mechanically shuffled knowledge throughout a number of executors earlier than writing to vacation spot as a result of there have been too few duties. Alternatively, within the second job run, there was just one stage as a result of there was no must do further shuffling, and there have been 18 duties for loading knowledge in parallel from supply MySQL database.

Issues

Remember the next concerns:

  • Serverless Spark UI is supported in AWS Glue 3.0 and later
  • Serverless Spark UI will likely be obtainable for jobs that ran after November 20, 2023, resulting from a change in how AWS Glue emits and shops Spark logs
  • Serverless Spark UI can visualize Spark occasion logs which is as much as 1 GB in measurement
  • There isn’t a restrict in retention as a result of serverless Spark UI scans the Spark occasion log recordsdata in your S3 bucket
  • Serverless Spark UI isn’t obtainable for Spark occasion logs saved in S3 bucket that may solely be accessed by your VPC
  • Spark UI within the AWS Glue console doesn’t help rolling logs, equivalent to these generated by default in streaming jobs. You possibly can flip off rolling logs for a streaming job by passing in extra configuration. Bear in mind that very massive log recordsdata might price rather a lot to take care of. To show off rolling logs, present the next two job parameters:
    • Key--spark-ui-event-logs-path, Worthtrue
    • Key--conf, Worthspark.eventLog.rolling.enabled=false

Conclusion

This publish described how the AWS Glue serverless Spark UI helps you monitor and troubleshoot your AWS Glue jobs. By offering instantaneous entry to the Spark UI immediately throughout the AWS Administration Console, now you can examine the low-level particulars of job runs to determine and resolve points. With the serverless Spark UI, there isn’t a infrastructure to handle—the UI spins up mechanically for every job run and tears down when now not wanted. This streamlined expertise saves you effort and time in comparison with manually launching Spark UIs your self.

Give the serverless Spark UI a attempt at this time. We predict you’ll discover it invaluable for optimizing efficiency and shortly troubleshooting errors. We sit up for listening to your suggestions as we proceed enhancing the AWS Glue console expertise.


In regards to the authors

Noritaka Sekiyama is a Principal Large Knowledge Architect on the AWS Glue crew. He works based mostly in Tokyo, Japan. He’s accountable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking on his street bike.

Alexandra Tello is a Senior Entrance Finish Engineer with the AWS Glue crew in New York Metropolis. She is a passionate advocate for usability and accessibility. In her free time, she’s an espresso fanatic and enjoys constructing mechanical keyboards.

Matt Sampson is a Software program Growth Supervisor on the AWS Glue crew. He loves working together with his different Glue crew members to make providers that our clients profit from. Outdoors of labor, he may be discovered fishing and possibly singing karaoke.

Matt Su is a Senior Product Supervisor on the AWS Glue crew. He enjoys serving to clients uncover insights and make higher choices utilizing their knowledge with AWS Analytic providers. In his spare time, he enjoys snowboarding and gardening.

[ad_2]

Supply hyperlink

LEAVE A REPLY

Please enter your comment!
Please enter your name here