Introduction

AWS Glue is an arsenal of possibilities for data engineers to create ETL processes with Amazon resources. It supports a setup of calculating units where jobs can be in the form of Python or Spark scripts made from scratch or using AWS Glue Studio with an interactive visual designer. The designer has a simple interface and comes up with helpful set of ready to use transformations. Still, it also presents some limitation and problems.

The Limitations

The visual designer automatically generates a script for every added transformation. This script can be modified, however, any change to it will block the possibility for further visual development as user code cannot be translated into visual transformations.

 

Currently there are 15 available transformations, like Select Fields, Join, or Filter. Those basic operations cover up most of typical data operations, yet there is always a need for more complex calculations. In those situations, SQL and Custom transformations come to the rescue. First one extends the job’s capabilities only to SQL functions. Second one allows to create a new transformation with user made Python function that can only accept one parameter and always need to return DynamicFrameCollection.

 

If there is a need to extend a job with additional parameters they need to be added in the job’s configuration, yet they are also needed to be added manually to the script. If a developer builds the job with visual templates, it makes them impossible to do the development further in the visual designer, as a proper visual operation to add jobs’ parameters into script is not implemented.

 

The Problems 

Some transformations, like SelectFields, do not handle empty datasets in a proper manner. If empty dataset needs to be processed, those transformations will return an empty object without headers. This in turn will lead to an error in the next step, if any processing is applied on the indicated columns.

 

There are several problems with the web interface itself, i.e., a significant amount of used visual transformation leads to a complete slowdown of the designer, or if someone wants to change the data type for only one column in ApplyMapping with selection menu, this sometimes causes unexpected changes in all other columns.

 

Data preview is a great addition to AWS Glue Studio as it allows to observe how parts of data are processed through every transformation. However, if there is any error in a job, it prints a general error message and restarts itself to print the same message on and on. This does not allow to really validate the error, which sometimes forces you to stop viewing the Data preview and run the job in standard mode.