Database Data Analysis Extension—MADlib

MADlib is an open source software project of the University of Berkeley. It provides accurate data parallel implementation, statistical and machine learning methods to analyze structured and unstructured data. The main purpose is to expand the analytical capabilities of the database, which can be easily loaded into In the database, the analysis function of the database is extended. In July 2015, MADlib became an incubation project of the Apache Software Foundation. The latest version is MADlib1.9, which supports PostgreSQL, Greenplum Database and Apache HAWQ. Official website address: http://madlib.incubator.apache.org/.

alt

alt

Features of MADlib

Supervised classification, cluster analysis, text analysis, regression analysis, association rule mining, descriptive statistics, validation analysis, etc.

alt

MADlib installation

The latest released version of MADlib is 1.9. If you want to use it in the PostgreSQL database, according to the documentation, it supports two versions of the PostgreSQL database: PostgreSQL 9.3 and PostgreSQL 9.4. To use MADlib on these two versions of the database, install the corresponding operations first. The corresponding installation package under the system, the installation package download address: https://dist.apache.org/repos/dist/release/incubator/madlib/1.9-incubating/.

alt

Tip: The latest development version of MADlib already provides support for PostgreSQL9.5 and PostgreSQL9.6. To use MADlib on these two versions of PostgreSQL database, you need to download the MADlib source code and compile and install it yourself.

After MADlib is successfully installed, it can be loaded into any database. According to the documentation, load MADlib into the PostgreSQL database. The format of the loading command is as follows
/usr/local/madlib/bin/madpack -s madlib -p postgres -c [user [/password]@][host][:port][/database] install

In the command -s indicates the mode installed in the database, -s madlib indicates a new mode madlib and loads all data analysis functions of MADlib under it, the command example is as follows

/usr/local/madlib/bin/madpack -p postgres -s public -c [email protected]:5432/ databasename install

Tip: Loading MADlib in a PostgreSQL database requires installing an extension to enable PostgreSQL to support the python language

MADlib uses:

The picture below shows the population, house prices, crime and other information of a certain area in the UKalt

Using MADlib, you can easily perform multiple regression analysis on the housing price, population, and crime information in the above figure, and get the relationship between housing prices, population, and crime information. The specific way to use it is to write an SQL expression to call the MADlib regression analysis function linregr_train, SQL The expression is as follows:alt

The result is as follows:alt

From the value of pvalue in the above figure, it can be seen that only the relationship between population density and housing price is significant, and its pvalue<0.05. The regression coefficient coefficient of density in the figure is negative, which means that the higher the population density, the lower the housing price (ordinary residential area), and the lower the population density, the higher the housing price (high-end residential area, villa area).

Summarize

MADlib can directly use SQL statements in the database to perform data analysis on data, making data analysis convenient and convenient, and is a very practical and powerful data analysis tool

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326309872&siteId=291194637