Realization ideas
surroundings:
In the article https://blog.csdn.net/lockhou/article/details/113883940, we have implemented a series of c files on Win to generate corresponding AST files, and generate text vectors through node matching through AST files , Thus constructing a c file corresponding to a txt file storing AST corresponding to a txt file storing text vectors, and the corresponding three files have the same name, because we judge whether a file has loopholes is reflected in the file name.
Ideas:
Our principle is to classify the files into Train, Test, Validation, and then directly read the .c file for empty processing, stop word processing and save the converted data as a pickle file, which is used to provide subsequent word vector conversion. Data requirements for model training, model testing, and model verification. Since we want to use text vectors to represent structural information, we can no longer directly read the files, but directly read the corresponding text vector txt files of each c file to do the same operation, so as to provide the following word vector conversion, model training, model In the test, the data verified by the model has structural information to complete the work.
Move from win to linux process
install jdk on step1 java
See the article for specific steps: https://blog.csdn.net/lockhou/article/details/113904085
step2 modify movefiles.py
I finally decided to create a folder at the same level of each directory where the c file is located after the move to store the AST file extracted from the c file (stored in the Preprocessed folder) and the converted text vector file (stored in the processed folder) ). So we are building the Train, Test, Validation folders and their internal folders to create Preprocessed and processed folders under the Non_vulnerable_functions and Vulnerable_functions folders of each combination, so add the code in 46-55 as follows:
saveDir = tempDir
tempDir = saveDir + '/'+ "Preprocessed"
if not os.path.exists(tempDir):
#Non_vulnerable_functions/Non_vulnerable_functions/Preprocessed
os.mkdir(tempDir)
tempDir = saveDir + '/'+ "processed"
if not os.path.exists(tempDir):
#Non_vulnerable_functions/Non_vulnerable_functions/processed
os.mkdir(tempDir)
step2 Modify ProcessCFilesWithCodeSensor.py and ProcessRawASTs_DFT.py
We now have a place to store the AST and text vector, so we only need to call the ProcessCFilesWithCodeSensor.py and ProcessRawASTs_DFT.py files repeatedly, so in order to facilitate the call, we organize the two files into the form of functions, and use the required parameters as the shape The parameters are passed at the time of the call.
ProcessCFilesWithCodeSensor.py parameters:
1) CodeSensor_OUTPUT_PATH: extract each .c file and save the AST as the address stored in the .txt file
"G:\thesis\thesis\ast\function_representation_learningmaster\FFmpeg\Vulnerable_functions\Preprocessed\"
2) CodeSensor_PATH: the location of codesensor.java
"D:\codesensor\CodeSensor.jar" (the location is fixed and does not need to be passed in, that is, it does not need to be used as a parameter)
3) PATH: The directory where the .c file is stored
"G:\Thesis\Thesis\ast\function_representation_learning-master\FFmpeg \Vulnerable_functions"ProcessRawASTs_DFT.py parameters:
1) FILE_PATH: The directory where the TXT of the AST is stored
"G:\Thesis\Thesis\ast\function_representation_learning-master\" + Project_Name + "\Vulnerable_functions\Preprocessed\"
2) Processed_FILE: The txt file
"G:\Thesis" that stores the text vector \Thesis\ast\function_representation_learning-master\" + Project_Name + "\Vulnerable_functions\Processed\"
According to the above parameter requirements, we organize the file content into functions as follows:
#ProcessCFilesWithCodeSensor.py
def codesensor(CodeSensor_OUTPUT_PATH,PATH):
CodeSensor_PATH = "./Code/codesensor-codeSensor-0.2/CodeSensor.jar"
Full_path = ""
for fpathe,dirs,fs in os.walk(PATH):
for f in fs:
if (os.path.splitext(f)[1]=='.c'): # Get the .c files only
file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
# With each .c file open, CodeSensor will process the opened file and output all the processed files to a specified directory.
# Full_path = CodeSensor_OUTPUT_PATH + "_" + f + ".txt"
Full_path = CodeSensor_OUTPUT_PATH + os.path.splitext(f)[0] + ".txt"
with open(Full_path, "w+") as output_file:
Popen(['/home/jdk1.8.0_65/bin/java', '-jar', CodeSensor_PATH, file_path], stdout=output_file, stderr=STDOUT)
output_file.close()
# ProcessRawASTs_DFT.py
def DepthFirstExtractASTs(file_to_process, file_name):
lines = []
subLines = ''
f = open(file_to_process)
try:
original_lines = f.readlines()
print(original_lines)
#lines.append(file_name) # The first element is the file name.
for line in original_lines:
if not line.isspace(): # Remove the empty line.
line = line.strip('\n')
str_lines = line.split('\t')
#print (str_lines)
if str_lines[0] != "water": # Remove lines starting with water.
#print (str_lines)
if str_lines[0] == "func":
# Add the return type of the function
subElement = str_lines[4].split() # Dealing with "static int" or "static void" or ...
if len(subElement) == 1:
lines.append(str_lines[4])
if subElement.count("*") == 0: # The element does not contain pointer type. If it contains pointer like (int *), it will be divided to 'int' and '*'.
if len(subElement) == 2:
lines.append(subElement[0])
lines.append(subElement[1])
if len(subElement) == 3:
lines.append(subElement[0])
lines.append(subElement[1])
lines.append(subElement[2])
else:
lines.append(str_lines[4])
#lines.append(str_lines[5]) # Add the name of the function
lines.append("func_name") # Add the name of the function
if str_lines[0] == "params":
lines.append("params")
if str_lines[0] == "param":
subParamElement = str_lines[4].split() # Addd the possible type of the parameter
if len(subParamElement) == 1:
lines.append("param")
lines.append(str_lines[4]) # Add the parameter type
if subParamElement.count("*") == 0:
if len(subParamElement) == 2:
lines.append("param")
lines.append(subParamElement[0])
lines.append(subParamElement[1])
if len(subParamElement) == 3:
lines.append("param")
lines.append(subParamElement[0])
lines.append(subParamElement[1])
lines.append(subParamElement[2])
else:
lines.append("param")
lines.append(str_lines[4]) # Add the parameter type
if str_lines[0] == "stmnts":
lines.append("stmnts")
if str_lines[0] == "decl":
subDeclElement = str_lines[4].split() # Addd the possible type of the declared veriable
#print (len(subDeclElement))
if len(subDeclElement) == 1:
lines.append("decl")
lines.append(str_lines[4]) # Add the type of the declared variable
if subDeclElement.count("*") == 0:
if len(subDeclElement) == 2:
lines.append("decl")
lines.append(subDeclElement[0])
lines.append(subDeclElement[1])
if len(subDeclElement) == 3:
lines.append("decl")
lines.append(subDeclElement[0])
lines.append(subDeclElement[1])
lines.append(subDeclElement[2])
else:
lines.append("decl")
lines.append(str_lines[4]) # Add the type of the declared variable
if str_lines[0] == "op":
lines.append(str_lines[4])
if str_lines[0] == "call":
lines.append("call")
lines.append(str_lines[4])
if str_lines[0] == "arg":
lines.append("arg")
if str_lines[0] == "if":
lines.append("if")
if str_lines[0] == "cond":
lines.append("cond")
if str_lines[0] == "else":
lines.append("else")
if str_lines[0] == "stmts":
lines.append("stmts")
if str_lines[0] == "for":
lines.append("for")
if str_lines[0] == "forinit":
lines.append("forinit")
if str_lines[0] == "while":
lines.append("while")
if str_lines[0] == "return":
lines.append("return")
if str_lines[0] == "continue":
lines.append("continue")
if str_lines[0] == "break":
lines.append("break")
if str_lines[0] == "goto":
lines.append("goto")
if str_lines[0] == "forexpr":
lines.append("forexpr")
if str_lines[0] == "sizeof":
lines.append("sizeof")
if str_lines[0] == "do":
lines.append("do")
if str_lines[0] == "switch":
lines.append("switch")
if str_lines[0] == "typedef":
lines.append("typedef")
if str_lines[0] == "default":
lines.append("default")
if str_lines[0] == "register":
lines.append("register")
if str_lines[0] == "enum":
lines.append("enum")
if str_lines[0] == "union":
lines.append("union")
print(lines)
subLines = ','.join(lines)
subLines = subLines + "," + "\n"
finally:
f.close()
return subLines
def text_vector(FILE_PATH,Processed_FILE):
big_line = []
total_processed = 0
for fpathe,dirs,fs in os.walk(FILE_PATH):
for f in fs:
if (os.path.splitext(f)[1]=='.txt'): # Get the .c files only
file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
temp = DepthFirstExtractASTs(FILE_PATH + f, f)
print(temp)
f1 = open(Processed_FILE + os.path.splitext(f)[0]+".txt", "w")
f1.write(temp)
f1.close()
# big_line.append(temp)
#time.sleep(0.001)
total_processed = total_processed + 1
print ("Totally, there are " + str(total_processed) + " files.")
step3 modify movefiles.py
We have now established the location to store AST files and text vector files. We only need to call all c files in each directory containing c files to generate AST files through codesensor function calls and store them under the Preprocesses folder in the corresponding directory, and By calling the text_vector function, each AST file in the Preprocessed directory is converted into a text vector file and stored under the proceeded folder of the corresponding directory, so we add the following code at the end of movefiles.py, that is, after creating all the folders :
from ProcessCFilesWithCodeSensor import *
from ProcessRawASTs_DFT import *
for i in range(len(FirstDir)):
codesensor(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i])
codesensor(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i])
codesensor(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i])
codesensor(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i])
codesensor(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i])
codesensor(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i])
text_vector(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i]+"/processed/")
text_vector(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i]+"/processed/")
text_vector(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i]+"/processed/")
text_vector(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i]+"/processed/")
text_vector(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i]+"/processed/")
text_vector(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i]+"/processed/")
step4 Modify the removeComments_Blanks.py and LoadCFilesAsText.py files
Because we want to replace the c file with the generated text vector file, and in the removeComments_Blanks.py and LoadCFilesAsText.py files, the c file is directly read and the content in the file is removed. For pickle files. So we now replace the c file with the generated text vector file, we need to read the text vector file in the processed folder, and the other operations of reading the txt file remain unchanged.
So we have to expand all the directories that represent reading c files in all removeComments_Blanks.py and LoadCFilesAsText.py files to the processed folder in the same directory as the c file to let it read the text vector file, and when judging the file type Instead of looking for c files, you can search for txt files.