Data lake
The data lake is created to store a large warehouse of various raw data. These data are accessed, processed, and analyzed as needed. For the storage part, the common open source version is hdfs. The major cloud vendors also provide their own storage services, such as Amazon S3, Azure Blob, etc.
Since the data stored in the data lake is all original data, it is generally necessary to do ETL (Extract-Transform-Load) on the data. For large data sets, the commonly used frameworks are Spark and pyspark. After the data is done with ETL, store the cleaned data to the storage system again (eg hdfs, s3). Based on this part of the cleaned data, data analysts or machine learning engineers can perform data analysis or train models based on these data. In these processes, another very important point is: how to manage metadata for data?
In AWS, Glue services not only provide ETL services, but also provide metadata management. Below we will use S3 + Glue + EMR to show a simple process of a data lake + ETL + data analysis.
Prepare data
This time I used GDELT data, the address is:
https://registry.opendata.aws/gdelt/
In this data set, each file name shows the date of the file. As the original data, we first put the 2015 data under a year = 2015 s3 directory:
aws s3 cp s3://xxx/data/20151231.export.csv s3://xxxx/gdelt/year=2015/20151231.export.csv
Use Glue to crawl data definitions
Create a crawling program through glue to crawl the data format in this file. The specified data source path is s3: // xxxx / gdelt /.
The function and specific introduction of this part can refer to the official aws document:
https://docs.aws.amazon.com/zh_cn/glue/latest/dg/console-crawlers.html
After the crawling program ends, in the Glue data directory, you can see the newly created gdelt table:
The original data is in csv format. Since there is no header, the column names are col0, col1 ..., col57. Since the directory structure under s3 is year = 2015, the crawler automatically recognizes year as a partition column.
So far, the metadata of this part of the original data is saved in Glue. Before doing ETL, we can use AWS EMR to verify its management of metadata.
AWS EMR
AWS EMR is a big data cluster provided by AWS. You can start a cluster with common frameworks such as Hive, HBase, Presto, and Spark with one click.
Start AWS EMR, check Hive and Spark, and use Glue as metadata for their tables. After EMR starts, log in to the master node and start Hive:
> show tables;
gdelt
Time taken: 0.154 seconds, Fetched: 1 row(s)
You can see this table can be seen in hive, execute the query:
> select * from gdelt where year=2015 limit 3; OK 498318487 20060102 200601 2006 2006.0055 CVL COMMUNITY CVL 1 53 53 5 1 3.8 3 1 3 -2.42718446601942 1 United States US US 38.0 -97.0 US 0 NULL NULL 1 United States US US 38.0 -97.0 US 20151231 http://www.inlander.com/spokane/after-dolezal/Content?oid=2646896 2015 498318488 20060102 200601 2006 2006.0055 CVL COMMUNITY CVL USA UNITED STATES USA 1 51 51 5 1 3.4 3 1 3 -2.42718446601942 1 United States US US 38.0 -97.0 US 1 United States US US 38.0 -97.0 US 1 United States US US 38.0 -97.0 US 20151231 http://www.inlander.com/spokane/after-dolezal/Content?oid=2646896 2015 498318489 20060102 200601 2006 2006.0055 CVL COMMUNITY CVL USA UNITED STATES USA 1 53 53 5 1 3.8 3 1 3 -2.42718446601942 1 United States US US 38.0 -97.0 US 1 United States US US 38.0 -97.0 US 1 United States US US 38.0 -97.0 US 20151231 http://www.inlander.com/spokane/after-dolezal/Content?oid=2646896 2015
You can see that there are many columns of raw data. Suppose we only need 4 columns: event ID, country code, date, and URL, and do analysis based on these data. Then our next step is to do ETL.
GLUE ETL
The Glue service also provides ETL tools. You can write scripts based on spark or python and submit them to glue etl for execution. In this example, we will extract the col0, col52, col56, col57, and year columns, and rename them. Then extract the records containing only "UK", and finally write them to the final s3 directory in the format of date = current_day, the storage format is parquet. The GLUE programming interface can be called through python or scala language. In this article, scala is used:
import com.amazonaws.services.glue.ChoiceOption import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.DynamicFrame import com.amazonaws.services.glue.MappingSpec import com.amazonaws.services.glue.ResolveSpec import com.amazonaws.services.glue.errors.CallSite import com.amazonaws.services.glue.util.GlueArgParser import com.amazonaws.services.glue.util.Job import com.amazonaws.services.glue.util.JsonOptions import org.apache.spark.SparkContext import scala.collection.JavaConverters._ import java.text.SimpleDateFormat import java.util.Date object Gdelt_etl { def main(sysArgs: Array[String]) { val sc: SparkContext = new SparkContext () val glueContext: GlueContext = new GlueContext(sc) val spark = glueContext.getSparkSession // @params: [JOB_NAME] val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) // db and table val dbName = "default" val tblName = "gdelt" // s3 location for output val format = new SimpleDateFormat("yyyy-MM-dd") val curdate = format.format(new Date()) val outputDir = "s3://xxx-xxx-xxx/cleaned-gdelt/date=" + curdate + "/" // Read data into DynamicFrame val raw_data = glueContext.getCatalogSource(database=dbName, tableName=tblName).getDynamicFrame() // Re-Mapping Data val cleanedDyF = raw_data.applyMapping(Seq(("col0", "long", "EventID", "string"), ("col52", "string", "CountryCode", "string"), ("col56", "long", "Date", "String"), ("col57", "string", "url", "string"), ("year", "string", "year", "string"))) // Spark SQL on a Spark DataFrame val cleanedDF = cleanedDyF.toDF() cleanedDF.createOrReplaceTempView("gdlttable") // Get Only UK data val only_uk_sqlDF = spark.sql("select * from gdlttable where CountryCode = 'UK'") val cleanedSQLDyF = DynamicFrame(only_uk_sqlDF, glueContext).withName("only_uk_sqlDF") // Write it out in Parquet glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions(Map("path" -> outputDir)), format = "parquet").writeDynamicFrame(cleanedSQLDyF) Job.commit() } }
Save this script as the gdelt.scala file and submit it to the GLUE ETL job for execution. After waiting for the execution to complete, we can see that the output file is generated at s3:
> aws s3 ls s3://xxxx-xxx-xxx/cleaned-gdelt/ date=2020-04-12/
part-00000-d25201b8-2d9c-49a0-95c8-f5e8cbb52b5b-c000.snappy.parquet
Then we execute a new GLUE crawling program for this / cleaned-gdelt / directory:
After the execution is completed, you can see that the new table has been produced at GLUE. The structure of this table is:
You can see that the input and output formats are both parquet, the partition key is date, and only contains the columns we need.
Entering EMR Hive again, you can see that the new table has appeared:
hive> describe cleaned_gdelt; OK eventid string countrycode string date string url string year string date string # Partition Information # col_name data_type comment date string
Query this table:
hive> select * from cleaned_gdelt limit 10; OK SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 498318821 UK 20151231 http://wmpoweruser.com/microsoft-denies-lumia-950-xl-withdrawn-due-issues-says-stock-due-strong-demand/ 2015 498319466 UK 20151231 http://www.princegeorgecitizen.com/news/police-say-woman-man-mauled-by-2-dogs-in-home-in-british-columbia-1.2142296 2015 498319777 UK 20151231 http://www.catchnews.com/life-society-news/happy-women-do-not-live-any-longer-than-sad-women-1451420391.html 2015 498319915 UK 20151231 http://www.nationalinterest.org/feature/the-perils-eu-army-14770 2015 … Time taken: 0.394 seconds, Fetched: 10 row(s)
It can be seen that the CountryCode of the results are all UK, reaching our goal.
automation
The following is to automate GLUE web crawling + ETL. In the workflow of GLUE ETL, create a workflow, as shown below:
As shown in the figure, the process of this workflow is:
- The workflow starts at 11:40 every night
- Trigger gdelt's web crawling job to crawl the metadata of the original data
- Trigger gdelt's ETL job
- Trigger the gdelt-cleaned web crawling program to crawl the metadata of the cleaned data
Below we add a new file to the original file directory, this new data is the data of year = 2016:
aws s3 cp s3://xxx-xxxx/data/20160101.export.csv s3://xxx-xxx-xxx/gdelt/year=2016/20160101.export.csv
Then execute this workflow.
During the period, we can see that ETL job is triggered normally after raw_crawler_done:
After the job is completed, the 2016 data can be queried in Hive:
select * from cleaned_gdelt where year=2016 limit 10; OK 498554334 UK 20160101 http://medicinehatnews.com/news/national-news/2015/12/31/support-overwhelming-for-bc-couple-mauled-by-dogs-on-christmas-day/ 2016 498554336 UK 20160101 http://medicinehatnews.com/news/national-news/2015/12/31/support-overwhelming-for-bc-couple-mauled-by-dogs-on-christmas-day/ 2016 …