APP data collection

table of Contents

1. App data capture analysis

2. Crawling ideas

2.1. Packet capture

2.2. HOOK technology

3. Pit to climb

Pit one: signature algorithm

Pit 2: The information retrieved from http is inconsistent with the page display

Pit 3: Pit in the simulator

Pit 4: Pit of account

4. Difficulty assessment

1. One star

2. Two stars

3. Samsung

4. Four stars

5. Tools

1. Packet capture tool

2. Decompile

6. Examples/Articles


The content of this article comes from the Internet, and the data collected on the mobile APP for technical research is the information collected on the Internet.

In fact, there is a certain difference between the so-called crawling of APP data and the crawling of webpage data. To capture webpage data, data can be captured by a mode that simulates visiting a website and then capturing the content of the webpage. The APP is more inclined to intercept the data transmission packet form ( Wireshark and Fiddler+Python ).

Generally speaking, we use WireShark+Fiddler to get most of the data without much problem. But there is a problem here, if you encounter network data encrypted with SSL/TLS and other encryption methods, often we can only do nothing. In the past, if we had the private key of the transmission session, we could still provide it to WireShark to decrypt these encrypted data packets, but this is the era when we wanted to use RSA for network data encryption. Thing. Today everyone has gradually embraced the era of forward encryption technology PFS , so this method is no longer applicable. Because the purpose of forward encryption technology is to make each data interaction use a different private key, so if you want to use only one private key to crack the network data packets of the entire session like the previous RSA era, it is It's impossible (in fact, it can also be solved by the Session Key function of a similar browser ).

1. App data capture analysis

Any APP data that can be seen can be captured.

No less than 300 APP captures have been analyzed and studied.

For 50% of apps, the capture parameters can be analyzed and the information can be captured through packet capture software.

30% of apps may need to be properly decompiled to analyze encryption algorithms and capture information.

10% of apps may be hardened and need to be unpacked, and then decompiled to analyze the encryption algorithm and capture the information.

10% of apps hide the encryption algorithm through various signatures, certificates, device binding and other methods.

In general, there is no app that cannot be crawled, just a matter of time cost.

2. Crawling ideas

1. Capture

2. HOOK

2.1. Packet capture

It is easy for students with code experience or APP development to understand. In fact, many APPs use webservice communication protocol, and because they are public data, most of them are unencrypted. So as long as you monitor the network port and simulate the APP, you can know how the data in the APP is obtained.

We only need to write code to simulate the request, no matter POST or GET, you can get the information returned by the request. Then through the structural analysis of the returned information, we can get the data we want.

public static void main(String[] args) {
   Spider.create(new GithubRepoPageProcessor())

           //从https://github.com/****开始抓

           .addUrl("https://github.com/****")

           //设置Scheduler,使用Redis来管理URL队列

           .setScheduler(new RedisScheduler("localhost"))

           //设置Pipeline,将结果以json方式保存到文件

           .addPipeline(new JsonFilePipeline("D:\\data\\webmagic"))

           //开启5个线程同时执行

           .thread(5)

           //启动爬虫

           .run();
}

 

2.2. HOOK technology

HOOK technology is a technology that takes the operating system kernel. Since the Android system is open source, you can modify the kernel with the help of some frameworks to achieve the functions you want. In the form of HOOK, we take the Xposed framework. Xposed is an open source framework service that can change the operation of the program without modifying any other developers' applications (including system services). Many powerful modules can be made based on it, so as to achieve the purpose of running the application according to your wishes.

If you think of an Android phone as a castle, Xposed will allow you to have a God's perspective, and you will be able to see the details of the operation of the city, and it will also allow you to intervene to change the operation of the castle.

What does that mean? Simply put, you can control your APP automatically through him. If we open our APP on the simulator, we can tell the APP what to do this step and what to do next through coding. You can understand it as similar to a key-press wizard or a game-killing plug-in.

And every time he takes a step, the data of the interaction between the APP and the server can be obtained. This method is widely used in some mature apps. For example, a letter collection.

public class HookActivity implements IXposedHookLoadPackage {

   @Override

   public void handleLoadPackage(LoadPackageParam lpparam) throws Throwable {

       final String packageName = lpparam.packageName;

       XposedBridge.log("--------------------: " + packageName);

       try {

           XposedBridge.hookAllMethods

           (Activity.class, "onCreate", new XC_MethodHook() {

               @Override

               protected void afterHookedMethod(MethodHookParam param)

               throws Throwable {

                   XposedBridge.log("=== Activity onCreate: " + param.thisObject);

               }

           });

       } catch (Throwable error) {

           XposedBridge.log("xxxxxxxxxxxx: " + error);

       }

   }

}

3. Pit to climb

Pit one: signature algorithm

Take a letter’s article list page and a certain information page as an example, if you capture its http access, you will find that one of the core parameters of its url is that we cannot know how to generate it, which makes it impossible for us to directly use the url to perform Information crawling; if the signature algorithm cannot be cracked, the HTTP road is a dead end.

Pit 2: The information retrieved from http is inconsistent with the page display

Taking a certain information page of a certain letter as an example, comparing the direct access to a certain letter page and the information crawled by http, it is obvious that there is less information crawled by http. It is necessary to use both methods to take care of both speed and integrity.

Pit 3: Pit in the simulator

APP automatically recognizes your operating environment and shields it. The most powerful one is a certain letter, even whether you open it with the emulator or the real machine, and what kernel it is, all restrictions are imposed. I've seen a great person, and I found a mobile phone manufacturer to customize a real phone to cooperate.

Pit 4: Pit of account

This pit is a bit big. It’s not easy to find and maintain a number, and even worse is the title, which really makes you go back to the pre-liberation overnight.

4. Difficulty assessment

1. One star

This type of app has no special protection, you can directly access the URL requested in the app on the web page

Difficulty: None

2. Two stars

The cookie and session technologies used by such apps require information such as cookies to request data

Difficulties:

1. The request header needs to be accompanied by a cookie value

3. Samsung

When such an app initiates a request, it adds an md5 verification field to the headers. This field performs special processing on the parameters of the requested url and then hashes it; if you want to crawl this type of app, you need to decompile the app and go through a lot of Code reading, analyze the hash algorithm and parameter splicing of the app;

Difficulties:

1. Decompile

2. Android code reading ability

3. It takes a lot of time and energy to find it, which is the most painful. . . .

4. Four stars

This type of app initiates a url request to the request. After receiving the request in the background, the returned data is encrypted for the valid data, so when analyzing with the packet capture tool, the specific data cannot be seen; if you want to crawl this type of app , You can only decompile first, and then analyze how to encrypt the requested data algorithm. Only after the algorithm is cracked can the data be analyzed.

Difficulties:

1. Unable to analyze the required data through the packet capture tool

2. Decompile

3. The ability to read Android code and find algorithms for encrypting data

3. It takes a lot of time and energy to find it, which is the most painful. . . .

5. Tools

1. Packet capture tool

                   Wireshark for mac

                   Mac system charles

                   windows Fiddler

2. Decompile

Apktool , dex2jar , jd-gui-windows

Jadx-gui
can directly decompile dex files, which is convenient and easy to use

JD-GUI
needs to transfer the dex file to the jar file, you can jump to the function

JEB
uses less

3.hook tool

Xposed

Frida

 

6. Examples/Articles

[APP Reverse-Entry Level] Remember the reverse process of a gray chan live broadcast APP

app reverse engineering

[Original] How to use Xposed+JustTrustMe to break through SSL Pinning

Crawler Mobile App-Tips for Data Collection

Crawl mobile APP data

Detailed explanation of fiddler packet capture tool (2)

Guess you like

Origin blog.csdn.net/someby/article/details/108454889