Druid Flatten JSON解析

druid是支持Flatten JSON数据的实时解析的，需要编写flattenSpec的配置，以下Flatten JSON数据为例

{"time":1529209115078,"product_type":"unknown","model":"Other","log_type":"timeon.behavior","api_no":"81","data":{"page":"POPUP:confirmAutorecPack","title":"22222222","page_session":"12","category":"ForumSubscribeAndAutorecFlow","action":"ToggleSubscribeFromWizard","label":"Unsubscribe pack with autorec disabled","area_code":25,"operation_time":"20180611044635","time_zone":"+0900","pdid":"0741968309533296"},"pdid":"0741968309533296","uid":"no_uid"}

针对上述Flatten JSON数据为例，从kafka中读取数据的JSON配置文件如下：

{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "timeon.behavior",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "flattenSpec": {
            "useFieldDiscovery":true,
            "fields": [
              {
                "type": "root",
                "name": "product_type"
              },
              {
                "type": "root",
                "name": "model"
              },
              {
                "type": "root",
                "name": "log_type"
              },
              {
                "type": "root",
                "name": "api_no"
              },
              {
                "type": "path",
                "name": "page",
                "expr": "$.data.page"
              },
              {
                "type": "path",
                "name": "title",
                "expr": "$.data.title"
              },
               {
                "type": "path",
                "name": "page_session",
                "expr": "$.data.page_session"
              },
               {
                "type":"path",
                "name":"category",
                "expr":"$.data.category"  
              },
              {
                "type": "path",
                "name": "action",
                "expr": "$.data.action"
              }, 
              {
                "type": "path",
                "name": "label",
                "expr": "$.data.label"
              }, 
              {
                "type": "path",
                "name": "value",
                "expr": "$.data.value"
              }, 
              {
                "type": "path",
                "name": "area_code",
                "expr": "$.data.area_code"
              }, 
              {
                "type": "path",
                "name": "operation_time",
                "expr": "$.data.operation_time"
              },
              {
                "type": "path",
                "name": "time_zone",
                "expr": "$.data.time_zone"
              }, 
              {
                "type": "path",
                "name": "data_pdid",
                "expr": "$.data.pdid"
              },   
              {
                "type": "root",
                "name": "pdid"
              },
              {
                "type": "root",
                "name": "uid"
              }                              
              ]
         },
         "dimensionsSpec" : {
          "dimensions": [],
          "dimensionExclusions" : [],
          "spatialDimensions" : []   
         },
         "timestampSpec": {
          "column": "time",
          "format": "posix"
        }
      }
    },
     "metricsSpec": [
         {
          "name" : "count",
          "type" : "count"
         }
       ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "NONE"
    }
  },
  "tuningConfig": {
    "type": "kafka",
    "maxRowsPerSegment": 5000000
  },
  "ioConfig": {
    "topic": "timeon.behavior",
    "consumerProperties": {
      "bootstrap.servers": "kafka-a:9092,kafka-b:9092,kafka-c:9092"
    },
    "taskCount": 3,
    "replicas": 1,
    "taskDuration": "PT1H10M"
  }
}

主要说以下几点：

（1）parseSpec主要包括flattenSpec、dimensionsSpec和timestampSpec。

（2）flattenSpec主要配置数据的fields，可利用root、path和jq等类型进行嵌套数据的读取，另外useFieldDiscovery=true,会自动读取root类型的field（timestamp、array和list 除外）。

timestamp column 不应该配置到fields中。

（3）dimensionsSpec主要配置dimension的列，如果dimensions=[],则会将fields中的字段直接作为dimensions，省去挨个配置dimension.

（4）timestampSpec主要配置时间戳列，时间戳的列配置主要在format,druid支持两种类型的时间戳列，字符型和数字型，

并且字符型兼容数字型，其中posix代表毫秒，millis代表毫秒 ,iso代表iso时间，Joda time参照http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html, format只是来定义数据源时间戳列的格式，并不是存入druid之后的数据格式，需根据实际情况来确定，否则数据解析会错误。

Druid Flatten JSON解析

猜你喜欢