The realization method of converting to AST and then converting to word vector before converting word vector

Realization ideas

surroundings:

In the article https://blog.csdn.net/lockhou/article/details/113883940, we have implemented a series of c files on Win to generate corresponding AST files, and generate text vectors through node matching through AST files , Thus constructing a c file corresponding to a txt file storing AST corresponding to a txt file storing text vectors, and the corresponding three files have the same name, because we judge whether a file has loopholes is reflected in the file name.

Ideas:

Our principle is to classify the files into Train, Test, Validation, and then directly read the .c file for empty processing, stop word processing and save the converted data as a pickle file, which is used to provide subsequent word vector conversion. Data requirements for model training, model testing, and model verification. Since we want to use text vectors to represent structural information, we can no longer directly read the files, but directly read the corresponding text vector txt files of each c file to do the same operation, so as to provide the following word vector conversion, model training, model In the test, the data verified by the model has structural information to complete the work.

Move from win to linux process

install jdk on step1 java

See the article for specific steps: https://blog.csdn.net/lockhou/article/details/113904085

step2 modify movefiles.py

I finally decided to create a folder at the same level of each directory where the c file is located after the move to store the AST file extracted from the c file (stored in the Preprocessed folder) and the converted text vector file (stored in the processed folder) ). So we are building the Train, Test, Validation folders and their internal folders to create Preprocessed and processed folders under the Non_vulnerable_functions and Vulnerable_functions folders of each combination, so add the code in 46-55 as follows:

saveDir = tempDir
        tempDir =  saveDir + '/'+ "Preprocessed"
        if not os.path.exists(tempDir):
            #Non_vulnerable_functions/Non_vulnerable_functions/Preprocessed
            os.mkdir(tempDir)
        tempDir =  saveDir + '/'+ "processed"
        if not os.path.exists(tempDir):
            #Non_vulnerable_functions/Non_vulnerable_functions/processed
            os.mkdir(tempDir)

step2 Modify ProcessCFilesWithCodeSensor.py and ProcessRawASTs_DFT.py

We now have a place to store the AST and text vector, so we only need to call the ProcessCFilesWithCodeSensor.py and ProcessRawASTs_DFT.py files repeatedly, so in order to facilitate the call, we organize the two files into the form of functions, and use the required parameters as the shape The parameters are passed at the time of the call.

ProcessCFilesWithCodeSensor.py parameters:

1) CodeSensor_OUTPUT_PATH: extract each .c file and save the AST as the address stored in the .txt file
"G:\thesis\thesis\ast\function_representation_learningmaster\FFmpeg\Vulnerable_functions\Preprocessed\"
2) CodeSensor_PATH: the location of codesensor.java
"D:\codesensor\CodeSensor.jar" (the location is fixed and does not need to be passed in, that is, it does not need to be used as a parameter)
3) PATH: The directory where the .c file is stored
"G:\Thesis\Thesis\ast\function_representation_learning-master\FFmpeg \Vulnerable_functions"

ProcessRawASTs_DFT.py parameters:

1) FILE_PATH: The directory where the TXT of the AST is stored
"G:\Thesis\Thesis\ast\function_representation_learning-master\" + Project_Name + "\Vulnerable_functions\Preprocessed\"
2) Processed_FILE: The txt file
"G:\Thesis" that stores the text vector \Thesis\ast\function_representation_learning-master\" + Project_Name + "\Vulnerable_functions\Processed\"

According to the above parameter requirements, we organize the file content into functions as follows:

#ProcessCFilesWithCodeSensor.py

def codesensor(CodeSensor_OUTPUT_PATH,PATH):
	CodeSensor_PATH = "./Code/codesensor-codeSensor-0.2/CodeSensor.jar"
	Full_path = ""

	for fpathe,dirs,fs in os.walk(PATH):
  		for f in fs:
    			if (os.path.splitext(f)[1]=='.c'): # Get the .c files only
        			file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
        
        # With each .c file open, CodeSensor will process the opened file and output all the processed files to a specified directory.
        # Full_path = CodeSensor_OUTPUT_PATH + "_" + f + ".txt"
        			Full_path = CodeSensor_OUTPUT_PATH + os.path.splitext(f)[0] + ".txt"
        			with open(Full_path, "w+") as output_file:
            				Popen(['/home/jdk1.8.0_65/bin/java', '-jar', CodeSensor_PATH, file_path], stdout=output_file, stderr=STDOUT)
            				output_file.close()



# ProcessRawASTs_DFT.py

def DepthFirstExtractASTs(file_to_process, file_name):
    
    lines = []
    subLines = ''
    
    f = open(file_to_process)
    try:
        original_lines = f.readlines()
        print(original_lines)
        #lines.append(file_name) # The first element is the file name.
        for line in original_lines:
            if not line.isspace(): # Remove the empty line.
                line = line.strip('\n')
                str_lines = line.split('\t')   
                #print (str_lines)
                if str_lines[0] != "water": # Remove lines starting with water.
                    #print (str_lines)
                    if str_lines[0] == "func":
                        # Add the return type of the function
                        subElement = str_lines[4].split() # Dealing with "static int" or "static void" or ...
                        if len(subElement) == 1:
                            lines.append(str_lines[4])
                        if subElement.count("*") == 0: # The element does not contain pointer type. If it contains pointer like (int *), it will be divided to 'int' and '*'.
                            if len(subElement) == 2:
                                lines.append(subElement[0])
                                lines.append(subElement[1]) 
                            if len(subElement) == 3:
                                lines.append(subElement[0])
                                lines.append(subElement[1])    
                                lines.append(subElement[2])
                        else:
                            lines.append(str_lines[4])
                        #lines.append(str_lines[5]) # Add the name of the function
                        lines.append("func_name") # Add the name of the function
                    if str_lines[0] == "params":
                        lines.append("params")                    
                    if str_lines[0] == "param":
                        subParamElement = str_lines[4].split() # Addd the possible type of the parameter
                        if len(subParamElement) == 1:
                            lines.append("param")
                            lines.append(str_lines[4]) # Add the parameter type
                        if subParamElement.count("*") == 0:
                            if len(subParamElement) == 2:
                                lines.append("param")
                                lines.append(subParamElement[0])
                                lines.append(subParamElement[1]) 
                            if len(subParamElement) == 3:
                                lines.append("param")
                                lines.append(subParamElement[0])
                                lines.append(subParamElement[1])    
                                lines.append(subParamElement[2])
                        else:
                            lines.append("param")
                            lines.append(str_lines[4]) # Add the parameter type                           
                    if str_lines[0] == "stmnts":
                        lines.append("stmnts")                    
                    if str_lines[0] == "decl":
                        subDeclElement = str_lines[4].split() # Addd the possible type of the declared veriable
                        #print (len(subDeclElement))
                        if len(subDeclElement) == 1:
                            lines.append("decl")
                            lines.append(str_lines[4]) # Add the type of the declared variable
                        if subDeclElement.count("*") == 0:
                            if len(subDeclElement) == 2:
                                lines.append("decl")
                                lines.append(subDeclElement[0])
                                lines.append(subDeclElement[1]) 
                            if len(subDeclElement) == 3:
                                lines.append("decl")
                                lines.append(subDeclElement[0])
                                lines.append(subDeclElement[1])    
                                lines.append(subDeclElement[2])
                        else:
                            lines.append("decl")
                            lines.append(str_lines[4]) # Add the type of the declared variable
                    if str_lines[0] == "op":
                        lines.append(str_lines[4])
                    if str_lines[0] == "call":
                        lines.append("call")
                        lines.append(str_lines[4])
                    if str_lines[0] == "arg":
                        lines.append("arg")
                    if str_lines[0] == "if":
                        lines.append("if")
                    if str_lines[0] == "cond":
                        lines.append("cond")
                    if str_lines[0] == "else":
                        lines.append("else")
                    if str_lines[0] == "stmts":
                        lines.append("stmts")
                    if str_lines[0] == "for":
                        lines.append("for") 	
                    if str_lines[0] == "forinit":
                        lines.append("forinit")
                    if str_lines[0] == "while":
                        lines.append("while")
                    if str_lines[0] == "return":
                        lines.append("return")
                    if str_lines[0] == "continue":
                        lines.append("continue")
                    if str_lines[0] == "break":
                        lines.append("break")
                    if str_lines[0] == "goto":
                        lines.append("goto")
                    if str_lines[0] == "forexpr":
                        lines.append("forexpr")
                    if str_lines[0] == "sizeof":
                        lines.append("sizeof")
                    if str_lines[0] == "do":
                        lines.append("do")   
                    if str_lines[0] == "switch":
                        lines.append("switch")   
                    if str_lines[0] == "typedef":
                        lines.append("typedef")
                    if str_lines[0] == "default":
                        lines.append("default")
                    if str_lines[0] == "register":
                        lines.append("register")
                    if str_lines[0] == "enum":
                        lines.append("enum")
                    if str_lines[0] == "union":
                        lines.append("union")
                    
        print(lines)
        subLines = ','.join(lines)
        subLines = subLines + "," + "\n"
    finally:
        f.close()
        return subLines
 
def text_vector(FILE_PATH,Processed_FILE):  
	big_line = []
	total_processed = 0

	for fpathe,dirs,fs in os.walk(FILE_PATH):
  		for f in fs:
    			if (os.path.splitext(f)[1]=='.txt'): # Get the .c files only
        			file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
        			temp = DepthFirstExtractASTs(FILE_PATH + f, f)
        			print(temp)
        
        			f1 = open(Processed_FILE + os.path.splitext(f)[0]+".txt", "w")
        			f1.write(temp)
        			f1.close()
        			# big_line.append(temp)

       			#time.sleep(0.001)
        			total_processed = total_processed + 1

	print ("Totally, there are " + str(total_processed) + " files.")

step3 modify movefiles.py

We have now established the location to store AST files and text vector files. We only need to call all c files in each directory containing c files to generate AST files through codesensor function calls and store them under the Preprocesses folder in the corresponding directory, and By calling the text_vector function, each AST file in the Preprocessed directory is converted into a text vector file and stored under the proceeded folder of the corresponding directory, so we add the following code at the end of movefiles.py, that is, after creating all the folders :

from  ProcessCFilesWithCodeSensor import *
from  ProcessRawASTs_DFT import *

for i in range(len(FirstDir)):
    codesensor(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i])
    codesensor(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i])
    codesensor(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i])
    codesensor(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i])
    codesensor(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i])
    codesensor(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i])
    text_vector(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i]+"/processed/")
    text_vector(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i]+"/processed/")
    text_vector(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i]+"/processed/")
    text_vector(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i]+"/processed/")
    text_vector(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i]+"/processed/")
    text_vector(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i]+"/processed/")
    

step4 Modify the removeComments_Blanks.py and LoadCFilesAsText.py files

Because we want to replace the c file with the generated text vector file, and in the removeComments_Blanks.py and LoadCFilesAsText.py files, the c file is directly read and the content in the file is removed. For pickle files. So we now replace the c file with the generated text vector file, we need to read the text vector file in the processed folder, and the other operations of reading the txt file remain unchanged.
So we have to expand all the directories that represent reading c files in all removeComments_Blanks.py and LoadCFilesAsText.py files to the processed folder in the same directory as the c file to let it read the text vector file, and when judging the file type Instead of looking for c files, you can search for txt files.

step5 Next, run the word vector file and the training and test files normally.

Guess you like

Origin blog.csdn.net/lockhou/article/details/113921061