Hbase--Introduction and application scenarios

1 Introduction

origin

  • Three papers on the first three carriages of the 21st century
    • GFS→HDFS
    • MapReduce→MapReduce
    • Bigtable→Hbase

The difference between real-time and offline

  • The essence of big data: to process the data to obtain value from the data
    • The value of data will gradually decrease over time
  • Offline: The data has been generated for a period of time, and then the data is processed
    • T+1:今天处理昨天的数据
    • Annual bill: processed once a year
    • Timeliness: above the hour level
  • Real-time: Data is processed as soon as it is generated
    • Real-time risk control
    • Real-time recommendation
    • Timeliness: within seconds
      • The data is just generated in the last second, and the data processing is applied immediately in the next second
  • How to build real-time?
    • Data in real time产生
    • Data in real time 采集: Flume
    • Real-time data 存储: Hbase, Kafka
    • Real-time data 处理: Flink, SparkStreaming
    • Data in real time应用

Bigtable/Hbase design ideas

  • Requirements: How to read and write big data quickly?
  • Distributed solution: storing big data
    • Distributed database
    • Manage stored big data through structures such as databases and tables
    • Hive
      • 通过数据库与表的形式来管理大数据
      • The bottom layer is still HDFS
      • Realize row and column data management based on files
        • 读取某行或者读取某一列必须读文件
    • HDFS
      • Application scenarios:一次写入,多次读取
      • Disk-based distributed storage: random read and write disks

Question 1: How to achieve fast read and write access to data in a computer?

  • Storage of data in the computer: hard disk, memory
  • Reading and writing speed: from fast to slow
    • 顺序读写内存
    • 顺序读写磁盘
    • 随机读写内存
    • 随机读写磁盘
  • Data must be in memory to improve read and write performance
  • Hbase is the priority to read and write memory
    • Write: write memory first
    • Read: read memory first

Question 2: The memory is relatively small, how can I store big data?

  • Distributed design
    • Build distributed memory storage
      • The data written to Hbase will be partitioned and written to the memory of different machines
      • 100GB of data, 10 machines [32GB*10 = 320GB]
      • 10GB is written into the memory of each machine

Question 3: The capacity of memory can never be enough for data storage?

  • Data cannot be stored in memory forever
  • In the law of data processing: the probability of new data being processed is much greater than that of old data
  • solve
    • New data is stored in memory
      • New data and frequently used data
    • Old data is persisted on the hard disk
    • Question: What if I want to read old data?
      • Reading from the hard disk for the first time, after reading, the data will be put into the cache [memory]
        • Read this data later, just read it directly from the cache
      • Sort: organize all the data written to the hard disk in an orderly manner
        • Finding data from ordered data is very fast

Question 4: The hard disk is easily damaged. What should I do if the data in the hard disk is lost?

  • How to ensure that the data can still be read when the hard disk is broken?
    • Store data in HDFS
      • HDFS data is stored in the hard disk
      • Use copies to ensure that data will not be lost

to sum up

  • Prioritize reading and writing data by building distributed memory
    • Write: memory
      • Periodically write old data in memory to HDFS
      • Free up memory to store new data
    • read
      • 先读内存
      • 如果没有就读缓存【内存】
      • 如果没有就读HDFS
      • 读完以后,将数据放入缓存
  • Bottom layer用HDFS来实现数据的持久化
  • 既实现了数据的快速读写,又保证了数据的安全
  • The main difference from the bottom layer of Hive
    • Hive -> 读写 -> HDFS
      • 基于文件的行
    • Hbase -> 优先读写内存 -> 读写HDFS
      • 基于文件的列
  • Official website: hbase.apache.org
    Insert picture description here
  • Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.
  • Hbase is the database of hadoop, a distributed and scalable big data storage framework
  • Use Apache HBase™ when you need random, realtime read/write access to your Big Data.
  • When you need random real-time read and write access to your big data, you can use hbase

2. Function

  • Realize fast random real-time reading and writing of big data

3. Application scenarios

  • E-commerce
    • A large amount of product and order information are stored in the back-end database
    • MySQL存储近半年的订单
    • Hbase可以存储所有商品的信息
  • game
    • Orders, operations, upgrades
  • financial
    • Every consumption
  • telecommunications
    • SMS, phone
    • Print call log
  • traffic
    • Probe on the road
  • Real-time high-performance random large data volume read and write

Guess you like

Origin blog.csdn.net/qq_46893497/article/details/114181210