Is Unstructured Data Analysis a Fudge?

        The rise of the concept of big data has also brought about unstructured data analysis. It is said that 80% of the data in an enterprise is unstructured data. If calculated in terms of occupied space, this ratio is roughly true. After all, audio and video data are really huge. With such a large amount of data, it is natural to need to analyze it, and to analyze it, of course, there must be corresponding technical means.

       So why is it said that unstructured data analysis technology is a fool?

There is no universal unstructured data computing technology

       There are various types of unstructured data, including audio images, text web pages, office documents, equipment logs, ...; each type of data has its own calculation and processing methods, such as speech recognition, image comparison, text search , graph structure calculation Etc., but there is no one general computing technique that works for all unstructured data. The method of speech recognition cannot be used for image comparison, text search and graph structure calculation.

       If a vendor is good at a certain technology, it will definitely directly claim that it is specialized in this field, rather than generally saying that it is good at unstructured data analysis. For example, face recognition is very accurate, or a professional company that mines text sensitive words, obviously it is easier to locate users and application scenarios this way. If a company says it's good at unstructured data analysis without specifying a specific area, it doesn't know what it can do.

The common technology for unstructured data is just storage

       Although many professional technical fields can be classified as the processing of unstructured data, the overall scope of application is not extensive, and most users do not use these specialized technologies, but only need to store these data. There is no general analysis and computing technology for unstructured data, but storage and corresponding management (addition, deletion, retrieval, etc.) can be generalized. Unstructured data occupies a large space and often requires special storage methods different from structured data.

       However, if the amount of data is not particularly large, or there is a need for high concurrent retrieval, most network file systems (such as HDFS) are already capable of meeting storage and access requirements. If the manufacturer only claims to be able to store and manage unstructured data, it will appear to have little technical content. Therefore, these manufacturers will spare no effort to rely on analysis, but there is no substance. However, professional storage vendors that can provide large-capacity and high-performance access only call storage, but do not deliberately mention analysis.

Universal analytics techniques lie in the accompanying structured data

       When collecting unstructured data, it is often accompanied by the collection of many related structured data, such as audio and video producers, production time, category, duration, ...; some unstructured data will also change after processing Structured data, such as disassembling the visitor's IP, access time, key search terms, etc. from web logs. The so-called unstructured data analysis is often actually aimed at these accompanying structured data. There are many mature general computing technologies in this field (such as relational algebra and relational database).

       But now it is not fashionable enough to only call structured data. In order to attract users, it is necessary to describe the essentially structured data analysis as unstructured data analysis.

       As a user on the demand side, at this time, you need to know exactly what to do with the data. If it is just simple storage, an open source network file system such as HDFS is enough; if you have high-performance access requirements, you need to find a professional storage manufacturer; if you actually want to analyze the associated structured data, it is already Familiar with database business; if you really have specific processing needs, you should find manufacturers and technologies in specialized fields. In conclusion, don't just generalize about the need for unstructured data analysis.

Introduction to columnists

       Jiang Buxing, Founder and Chief Scientist of Runqian Software

       Master of Computer Science, Tsinghua University, author of "Principles of Nonlinear Reporting Models", etc. In 1989, he was the team champion member of China's first International Mathematical Olympiad Competition and won an individual gold medal; in 2000, he founded Runqian Company; Proposed a nonlinear report model, which perfectly solved the problem of Chinese-style complex report making. At present, this model has become the standard in the report industry; in 2014, after 7 years of development, Runqian released a calculation engine that does not rely on the relational algebra model—set Calculator, which effectively improves the development and operation efficiency of complex structured big data calculations; in 2015, Runqian Software was named "2015 Forbes China's Top 100 Unlisted Potential Enterprises" by Forbes Chinese website; "Top Ten Leaders in China's Software and Information Service Industry in 2016" selected by the Industrial Development Research Institute; in 2017, independently innovated and developed a new generation of data warehouses, cloud databases and other products will be released soon.

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/131933298