Use Azure Blob optimized Hadoop cluster storage costs

  The relationship between big data and cloud computing just as inseparable sides of the same coin, big data cloud computing is a very important application scenarios, and the cloud was large data processing and data mining provides the best technical solution Program. Rapid supply of cloud computing, elastic expansion and pay-per-usage advantages have given tremendous changes in the IT industry, it has increasingly become a corporate choice of IT. Let business insight through data to enhance the efficiency and effectiveness at the same time, how to reduce the cost of big data platform is also a problem IT departments often concern.

 

  Why Azure Blob as the Hadoop Distributed File System big data?

 

  HDFS is Hadoop cluster distributed file system, it is divided into file blocks Block, the default setting is 64M, the data stored in the node DataNode in. In order to lower the rack failure brought about by the risk of data loss, HDFS will save 3 copies. Provide public cloud storage, whether disk storage or object store, is to improve access to reliable and secure data redundancy via multiple copies, usually 3 copies. In talking about this and we may have found, to build public cloud-based virtual machines and disk Hadoop cluster, superimposed storage levels will produce more copies of redundancy; while disk storage compared to the Blob object store price is higher, so this program the overall cost will be higher. If you can adopt Blob object is stored as a Hadoop cluster file system manage large amounts of data, it can significantly reduce costs.

  Hadoop can not be calculated as a cloud object storage with distributed file system? can. From the beginning of 2015, Microsoft Azure to provide the Apache Hadoop community hadoop-azure module, Hadoop through this integrated cloud storage connector Azure Blob storage (including Azure Data Lake), so as a large data storage and management of HDFS as massive, while achieving relatively low cost. In addition to Apache Hadoop, after the merger Cloudera and Hortonworks's CHD and HDP are already support this feature. Hadoop through access WASB protocol Azure Blob, through access ADLS protocol Azure Data Lake. Azure HDInsight services are Microsoft and Hortonworks cooperation has long been fully hosted big data analytics services, but also with its own Azure Blob as the default file system.

 

 

 

In addition to Azure Blob, AWS S3 can be used as Hadoop distributed file system, GCP is currently in the technology preview stage.

 

By HDP 2.6.5 below, for example, briefly set up a Hadoop cluster solutions based on Azure Blob + VM.

 

1. Installation HDP 2.6.5

 

We use Ambari to install HDP cluster configuration, follow these steps:

Step 1: Install a template configuration by Azure ARM Ambari VM and a number of Hadoop VM here to use Azure CLI environment, please install itself according to the document.

 

$ az group create -n rg-ryhdp01 -l westus2

$ az group deployment create -g rg-ryhdp01 --template-uri https://raw.githubusercontent.com/yongboyang/arm-templates/master/deployhdp26/azuredeploy.json

 

After the end of the virtual machine to create a custom script to perform operations on the Ambari Server VM:

Download Ambari Server package, and install Ambari VM

Increased IP address, FQDN, and host name of all deployed VM / etc / hosts in

Disable all the VM iptables and Transparent Huge Page; iptables turn will lead to service can not start

Configuring ssh key free dense Login

 

 

Step 2: Use Ambari deploy HDP 2.6.5 Cluster

After the end of the ARM template deployment, log Ambari management portal http://ry-ambarisrv-n2.westus2.cloudapp.azure.com:8080/ user password are: admin

 

 

 

After login, click "Launch Install Wizard" installation interface.

 

 

 

 

 

 

Login via ssh ry-ambarisrv-n2.westus2.cloudapp.azure.com, get a list of server FQDN and private key. User Password: linuxadmin / MSLovesLinux!

 

$ Spoon -s

$ cat /etc/hosts

$ cat /root/.ssh/id_rsa

 

Fill in the relevant information to the installation page, and then all the way down to the default. Process to fill several metadata service management and user passwords.

 

 

 

 

 

 2. 修改参数,配置HiveHBase使用Azure Blob

创建一个GPv2存储账号,名称为:ryhdp。实际生产要考虑容量和性能,根据存储账号配额限制来使用一个或多个存储账号。

在Ambari管理门户中,选择HDFS->Configs->Advanced->Custom core-site,

#配置 HDFS Custom core-site

按照命名规则增加存储账号和访问密钥,fs.azure.account.key.<account name>.blob.core.windows.net=<key>,如果账号位于Azure中国,则使用Azure China对应域名fs.azure.account.key.<account name>.blob.core.chinacloudapi.cn=<key>

这里配置如下(可以增加多个存储账号):

fs.defaultFS=wasb://[email protected]

fs.azure.account.key.ryhdp.blob.core.windows.net=exyVpVnuBXXKcilWi2KBrCNtm9bGsyHbKsu1hdgNkJePqjQ+sDsllOoAJhR0MyqpokC2AFNpC256PkVQ8VJgQ==

fs.azure.page.blob.dir=/WALs,/oldWALs

fs.azure.page.blob.size=262144000

fs.azure.page.blob.extension.size=262144000

 

Azure Blob分Block Blob和Page Blob,二者不同,适用于不同场景。Block Blob为默认类型,适用于Hive, Pig,Map/Reduce任务等,每个文件最大200GB,单个文件吞吐带宽60MB/s;Page Blob最大可以为1TB,适合随机读写,适合HBase WAL。这里的fs.azure.page.blob.* 参数是用于配置Page Blob,HBase会将WAL和oldWAL日志写入page blob。默认是每个RegionServer有1个WAL,生产环境还可以根据需要通过配置hbase.wal.provider=multiwall 使用多个管道来并行写入多个WAL流,增加写入期间的总吞吐量。

#配置 hive-site.xml

fs.azure.account.key.ryhdp.blob.core.windows.net=nz3iBFEiW9dG+cfVut65cLV8m3ISMndYXUbvIgJZyl4WcqGypPzkq36sZQdCbaSV/kLoCD5apbrWvOUdl8mLxQ==

 

[hbase-site.xml]

hbase.rootdir=wasb://[email protected]/

 

然后重启HDFS,Hive和HBase服务。

 

 3. 验证

#通过HDFS 验证

设置fs.defaultFS=wasb://[email protected]Hadoop FS将Azure Storage作为默认文件系统,因此hdfs dfs -ls / 查询结果为Azure Blob。

 

 

 

 

 

 

#通过Hive View验证工作正常,然后到存储账号中检查文件类型

CREATE EXTERNAL TABLE t1 (t1 string, t2 string, t3 string, t4 string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE

LOCATION 'wasb://[email protected]/t1';

load data inpath 'wasb://[email protected]/table1.csv' into table t1;

SELECT * FROM t1

 

 

 

# 通过HBase Shell验证工作正常,然后到存储账号中检查文件类型

create 'emp','personal data','professional data'

put 'emp','1','personal data:name','raju'

put 'emp','1','personal data:city','hyderabad'

put 'emp','1','professional data:designation','manager'

put 'emp','1','professional data:salary','50000'

scan 'emp'

get 'emp', '1'

 

 

 

 

 

 

总结:

通过VM自建Hadoop集群,采用Azure Blob或Azure Data Lake管理大数据可以获得更高的性价比。如何将数据从HDFS迁移到Azure Blob?下回分解。

 

 

附录:

https://blogs.msdn.microsoft.com/cindygross/2015/02/03/why-wasb-makes-hadoop-on-azure-so-very-cool/

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage

 

Guess you like

Origin www.cnblogs.com/royang/p/11978733.html