Tutorial on Using the Ember Malware Dataset

    In April 2018, cybersecurity firm Endgame released a large open-source dataset called EMBER. EMBER is a collection of over 1 million benign and malicious PE files (Windows executables), a common malware hiding format. At the same time, for the dataset, the company also released a tutorial post on how to use the dataset on github . Under the guidance of this tutorial, I tried to build the running environment of the project and executed it successfully. However, some problems were still encountered during the process, and the problems are now sorted as follows.

 

1. Run 

pip install lief == 0.83

Error: lief not found
solution:
     Update the source, i.e. run pip install --upgrade under terminal
 
Error: time out

solution:

     pip --default-timeout=100 install -U Pillow
 
2. Still can't find the lief
     Reason: lief has not been added to the underlying python library, so search for the library source file and install it directly
     solution:
     google lief python, come out github link
  Run pip install https://github.com/lief-project/packages/raw/lief-master-latest/pylief-0.8.3.dev.zip under terminal
 
3. Run  python train_ember.py [/path/to/dataset]
pqdm not found
solution:
     google pqdm github,
     Run pip install -e git+https://github.com/tqdm/tqdm.git@master#egg=tqdm under terminal
 
Note: [/path/to/dataset]是解压后的数据集所在的文件夹(注意:不是单个文件)路径,例如,我将数据集解压后,重命名为ember_data,并且放在了与train_ember.py同一层的目录文件,那么我就执行 python train_ember.py ember_data/
(the same below)
 
4. After the installation in the previous step, the following prompt will appear
ember 0.1.0 requires lightgbm==2.1.0, which is not installed.
ember 0.1.0 has requirement numpy==1.14.2, but you'll have numpy 1.13.3 which is incompatible.
ember 0.1.0 has requirement pandas==0.22.0, but you'll have pandas 0.20.3 which is incompatible.
ember 0.1.0 has requirement tqdm==4.21.0, but you'll have tqdm 4.23.2 which is incompatible.
Reason: Incompatible version of installed package
 
solve
reinstall, using
pip install -v lightgbm==2.1.0
pip install -v numpy==1.14.2
pip install -v tqdm==4.21.0
Conda install pandas=0.22.0
 
The reason why pandas uses conda for installation is that pip has been time out. Later, it was found that conda installation is really fast. It is recommended to use this installation directly.
5. Training samples
[See the figure below for source description]

Error: unrecognized arguments
 
Analysis of the reason: the file path is wrong
 
solution:
Put the decompressed dataset (renamed ember_data) in the same directory as train_ember.py, run python train_ember.py ember_data/
 
final result:
6. Run the classify_binaries.py file
[See the figure below for source description]
[/path/to/model] is the model.txt file generated during the training process in the previous step. In the fifth step, I put the dataset folder at the same level as train_ember.py. Similarly, they are also the same as classify_binaries .py sibling
Then, run the following code
python classify_binaries.py -m ember_data/model.txt
 
【illustrate】
I don't know why, model.txt just can't be seen, but it can be found by searching. But this does not affect the code to find it.
 
[Running result] (with errors)
It shows that there is no binary file, the problem still can't find my model.txt file
At this time, use the terminal to enter the ember_dataset directory and enter ll (the 12th letter of the English alphabet)
You can refresh all files, then close the folder and reopen it.
 
【Run again】
The terminal returns to the scripts directory
     python classify_binaries.py -m ember_data/model.txt
Still not working, skip this step first
 
7. Keep running
Open the terminal in the scripts directory (train_ember.py file) and enter the python3 environment
import ember ember.create_vectorized_features("ember_dataset/") ember.create_metadata(“ember_dataset/")
【illustrate】
ember_dataset/ 是数据集目录
The source tutorial is   /data/ember/ ,这个路径不是同级下的,因此执行这类代码的时候一定要十分注意。
……
Continue to execute the rest of the code without any problems
……
 
8. Just put an exe file in the specified directory and check its security

Source github link (dataset + tutorial + source code): https://github.com/endgameinc/ember

Company blog post: https://www.endgame.com/blog/technical-blog/introducing-ember-open-source-classifier-and-dataset

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325859801&siteId=291194637