MaxCompute basis and MaxCompute SQL optimization

General information:

Big Data Computing Services (MaxCompute, formerly known as ODPS)  is a fast, fully managed TB / PB-class data warehousing solutions. MaxCompute provide users with a complete data import programs and a variety of classic distributed computing model, users can more quickly solve the problem of massive data computing, reduce business costs, and guarantee data security. Meanwhile, Big Data Development Kit and MaxCompute close relationship with big data development kit to provide a one-stop MaxCompute data synchronization, the task of developing the data workflow development, data management and data operation and maintenance and other functions, you can see the  Big Data Development Kit Introduction  to its in-depth understanding.

MaxCompute main bulk storage and computing services in the structured data, can provide massive data warehouse solutions and services for the analysis of large data modeling. With the continuous improvement of the social and enrich the data collection instruments, more and more industry data are accumulated. Data has grown to the size of the massive amounts of data (one hundred GB, TB, and even PB) level of the traditional software industry can not carry.

In the analysis of massive data scene, since the processing capacity of a single server is limited, the data analysis usually distributed computing model. But distributed computing model for data analysts put forward higher requirements, and difficult to maintain. Using a distributed model, data analysts not only need to understand the business needs, but also need to be familiar with the underlying computing model. MaxCompute purpose is to provide users with a convenient means for data-intensive analysis. Users do not have to care about the details of distributed computing, so as to achieve the purpose of analyzing big data.

   First MaxCompute Unlike ordinary mysql, oracle such a relational database, it is actually a comprehensive data services platform, it does not return query results in seconds or even milliseconds level, execute a command odps usually need to go through the following process:

Submit jobs (a simple process):

  1. Submit a SQL statement, send a request to the RESTful HTTP server
  2. HTTP server to do user authentication. After the authentication, the request will be sent to the Worker to Kuafu communication agreement.
  3. Worker determines whether the job needs to request start Fuxi Job. If not, local execution and returns the result. If necessary, generate a instance, sent to the Scheduler.
  4. Scheduler instance the registration information to the OTS, which will set the state to Running. Scheduler add the instance to instance queue.
  5. Worker Instance ID is returned to the client.

Running jobs (a simple process):

  1. Scheduler instance will split into a plurality of Task, and generates a task flow DAG FIG.
  2. Task to be put into operation in the priority queue TaskPool.
  3. Scheduler has a background thread timing TaskPool sort of task. Scheduler has a background thread to query the timing computing cluster resource situation. Executor at less than the resources, the polling TaskPool, request Task. Scheduler judge computing resources. If the clusters have the resources, they send the Task Executor.
  4. Executor call SQL Parse Planner, generate SQL Plan. Executor converted to calculated layer FuXi Job description file SQL Plan. Executor presented to calculate the running layer profile, and query the Task execution status. After the Task execution is completed, Executor Task update information in the OTS, and report to Scheudler.
  5. End Analyzing Schduler instance, the OTS instance update information is set to Terminated.

Check status:

After receiving the Instance ID is returned, the job status can be queried by the Instance ID:

  1. The client sends a request to another REST, query job status.
  2. HTTP server to do user authentication based on configuration information. After the user authentication, sending the query request to the Worker.
  3. The implementation status of the job Worker query InstanceID according to the OTS. Worker query to the execution status is returned to the client.

    In fact MaxCompute is a transparent data services platform, users do not need to know the details of distributed data processing, it can be more convenient to deal with the PB-level data on the client. So, after understanding the above, it can only return results for MaxCompute minute level there is a more clear understanding.
ps: These content in large data development kits are transparent.

MaxCompute SQL basis:


    
    SQL MaxCompute SQL relational database with a common substantially similar, except that MaxCompute not support such a transaction, primary key constraint indexes, etc., can be seen as a subset of the standard SQL.

    DDL:data define language

    MaxCompute operating table based, ddl involved in a series of operations on the table, including create, drop, alter
    us to a table on the Big Data Development Kit as an example:

CREATE TABLE IF NOT EXISTS xxxx
(
    aa     STRING COMMENT 'xxxx',
    bb     STRING COMMENT 'xxxx',
    cc     STRING COMMENT 'xxxx',
    dd     STRING COMMENT 'xxxx',
    ee     STRING COMMENT 'xxxx',
    ff     STRING COMMENT 'xxx',
    gg     BIGINT COMMENT 'xxx'
)
COMMENT 'xxxx'
PARTITIONED BY (dt     STRING COMMENT '')
LIFECYCLE 10;

In the workflow we want the task to be able to smoothly execute, so whether it is DDL and DML, we all try to hope that the statement returns a success (IF not exist, Overwrite)
the Comment Comments Comments and correspondence table includes corresponding field, which can alter
the different conventional SQL, MaxCompute for global data, even with the create xxxx select xxxx from xxx need manner as with the name of the column.
Partition field indicate, due to the large amount of data MaxCompute operation, generally speaking partition field needs special attention
lifecycle: property very convenient, easy for users to free up storage space, simplify data recovery process does not require the traditional complicated maintenance space . LastDataModifiedTime and flexible use of touch (modification to the current time), focus on the difference between the partition table and non-partitioned tables.
For copying large table structure, odps provide a very flexible create like statements.
drop a table, the table will be the data and tables thrown into the recycle bin, plus purge keywords might be deleted, it can not be restored.
can alter almost all the property sheet changes, including columns, notes, partition, partition properties, life cycle and so on.
Archive can be used to reduce the space occupied by a large table, the compression space.

 

DML:data manipulation language


    Common such as insert, select, join

   insert overwrite|into table tablename

[partition (partcol1=val1, partcol2=val2 ...)]
       select_statement
       from from_statement;

     Static partition, partition field constant; dynamic partitions may not specify a value for the select clause partition column values
     multi-insert-per-read, write times, reducing data reading. 

     

   select [all | distinct] select_expr, select_expr, ...
        from table_reference
        [where where_condition]
        [group by col_list]
        [order by order_condition]
        [distribute by distribute_condition [sort by sort_condition] ]
        [limit number]

     With traditional SQL is different, distinct roles of all select fields

     Compilation group> select> order / sort / distribute, it is understood that the compiler will be understood that the order among the various words Use alias.
     distribute by: data slice made according to whether a hash value columns.
     sort by: topical sort, must be added to distribute by former statement. In fact sort by is distribute by partial sequencing results.
     Clear understanding of the function: order by by not common and distribute by / sort, and at the same time group by not distribute by / sort by common.
     
     join and traditional sql performance more consistent, odps support left outer join, right outer join, full outer join, inner join.
     mapjoin hint: the case where the small and large tables using table join mapjoin designated by the user all the small table is loaded into memory, thereby speed up the execution of the join, while supporting the non-equivalent connection, Full unavailable join, connect main table It must be a large table
 

Built-in functions:


    Including mathematical and statistical functions, string manipulation functions, time functions, window functions, aggregation function, a transposition function like
    not list, and powerful.
    

UDF:user defined function


   Including UDF, UDTF, UDAF
   UDF: user-defined scalar function
   udtf: user-defined table function values (returning multiple fields)
   UDAF: user-defined aggregate function
   UDF:

  package org.alidata.odps.udf.examples;
     import com.aliyun.odps.udf.UDF;
     public final class Lower extends UDF {
       public String evaluate(String s) {
         if (s == null) { return null; }
         return s.toLowerCase();
       }
     }

 UDF class inheritance, evaluate methods to achieve. The method can evaluate a plurality, satisfy the polymorphic characteristics.

 UDAF:
inheritance com.aliyun.odps.udf.Aggregator, the main achievement achieve iterate, merge and terminate three interfaces, logic relies mainly UDAF in these three interfaces. In addition, requiring the user to achieve custom Writable buffer, because the main logic UDAF after traversing the data slice, a merge after processing.

UDTF:
Inheritance com.aliyun.odps.udf.UDTF class, the main process to achieve forward and two interfaces, each record in the SQL calls will correspond to a process, the process parameters of the input parameter UDTF. Input parameters Object [] in the form of the incoming, forward the output by calling the function output.

Add UDF unified method:
the Add JAR xxx
the Create function xxx AS packagename.classname a using 'JARname'
    

PL: Stored Procedures


Sql the conventional type, except that when reference preceded by the variable $

DECLARE
 var_name var_type;
BEGIN
 可执行语句
END;

    Other commands:


     explain, show instance, merge smallfile, add / remove / display (add / remove / show statistc in line with the value of a statistical value or expression) statistics

MaxCompute SQL optimization and Big Data Development Kit:

Table election principles:

  • Selected to meet the needs of small tables, such as summary tables. Try to choose full-scale dimension table, fact table as we choose the increment table;
  • Select the output early table;
  • Rollback can select table, such as the use of additional purchase of an event table available plus the whole process instead of the table;
  • N-dependent upstream table, to try to ensure uniformity of the upstream output time, if there are differences, considered for dependency table;

Small table principles:

  • The number of rows is less than one million are considered small table table, this time using mapjoin performance will improve a lot;
  • Plus the time to read data partitioning filter conditions, a large table becomes small table. Field conditions commonly used filters, made dynamic partitioning to facilitate downstream filter;
  • N big had read table, the combined use of multiple days unionall mode transactions;

Code principles:

  • Join association as possible is the main key association. Associated field type to be consistent;
  • Summary days, one day Mr. Cheng mild summary, the use of multiple days 1 day then aggregated data;
  • multiinsert, to achieve write once read many times;
  • UDF instead of using the system to write his own UDF;

Scheduling principles:

  • Dependence max_pt, to exclude the day dependent;
  • H is an upstream task, be careful using max_pt;
  • Perform more than one hour to pay attention to the task;

Big Data Development Kit:

Big Data Development Kit provides intuitive data entry operations, data write code development process, debugging, optimizing, publishing can be carried out in big data development kit.
Takes too long to get a job as an example, look at the problem on a large data development kit is how we deal with the encounter.

The execution time of a task is too long, the code itself to get rid of performance problems, then there are two large possibilities:
one is to wait for the problem, one is data skew problem
waiting for the problem may be due to lack of system resources, the system is busy, priority not enough, too much data, ran into bad disk and other causes
we can be handled by adjusting the priority, re-run, the initial data filtering and other methods.

Tilt problem is generally the problem of data itself, common data skew is caused by how?

Shuffle when the same key on each node of a task to pull a node is processed, such as polymerization or the like carried out in accordance with the join key operation, if the amount of data corresponding to a key, then particularly large, the data will be tilted phenomenon. Data inclined plate becomes short task entire running time.

Common triggers shuffle the operator: distinct, groupBy, join and so on.

To solve the problem of data skew, data must first locate tilted in any place, which is the first stage, can be seen directly in D2 UI, if data is tilted
logview - odps task - detail - stage - longtail

According to the log stage, it is determined on which data is tilted operator.

According to the tilted stage, we can put them into map tilt, reduce tilt, join tilt

Generally speaking, the tilt phenomenon, we first look at the distribution of data results in data tilt key, the next probably have several treatment options :

1: filter data
to filter out some of the dirty data, for example, can be removed if null, the value of certain conditions remove the corresponding
2: increase the degree of parallelism
was added to the task processing resources, increasing the number of instance, violent
3: data split, divide and conquer
if a large table join small table, we can use mapjoin, small table into memory cache
+ non-hotspot secondary distribution further processing, plus a random prefix (data expansion), the data set is split hotspot
large table join large table may also be considered bloomfilter
. 4: combination of
the above methods, a combination of
5: modification of business
there is no room for improvement, from the business data filter

Original Address: http://click.aliyun.com/m/20506/  

Guess you like

Origin blog.csdn.net/qq1021979964/article/details/97390506