Spam Detection: Feature Extraction and Model Training

Part 2

wordsDict = readDict(filepath='./results.pk1')
wordsDict = getDictTopk(dict_data=wordsDict, topk=4000)
saveDict(dict_data=wordsDict, savepath='./wordsDict.pkl')

Explanation:

The code reads a dictionary from a file using the readDict() function.
It retrieves the top 4000 words from the dictionary using the getDictTopk() function.
The resulting dictionary is saved to a file using the saveDict() function.

Part 3

normal_path = 'D:/6wanDownload/新建文件夹/normal'
spam_path = 'D:/6wanDownload/新建文件夹/spam'
wordsDict = readDict(filepath='./wordsDict.pkl')
normals = getFilesList(filepath=normal_path)
spams = getFilesList(filepath=spam_path)
fvs = []
for normal in normals:
    fv = extractFeatures(filepath=os.path.join(normal_path, normal), wordsDict=wordsDict, fv_len=4000)
    fvs.append(fv)
normal_len = len(fvs)
for spam in spams:
    fv = extractFeatures(filepath=os.path.join(spam_path, spam), wordsDict=wordsDict, fv_len=4000)
    fvs.append(fv)
spam_len = len(fvs) - normal_len
print('[INFO]: Normal-%d, Spam-%d' % (normal_len, spam_len))
fvs = mergeFv(fvs)
saveNparray(np_array=fvs, savepath='./fvs_%d_%d.npy' % (normal_len, spam_len))

Explanation:

The code defines the paths to the directories containing normal and spam files.
It reads the dictionary from the file using the readDict() function.
It retrieves the list of files in the normal and spam directories using the getFilesList() function.
For each normal file, it extracts features using the extractFeatures() function and appends the feature vector to the list fvs.
The length of the normal files list is stored in the variable normal_len.
For each spam file, it extracts features using the extractFeatures() function and appends the feature vector to the list fvs.
The length of the spam files list is calculated by subtracting normal_len from the total length of fvs.
The lengths of the normal and spam files lists are printed.
The feature vectors are merged using the mergeFv() function and stored in the variable fvs.
The merged feature vectors are saved to a file using the saveNparray() function.

Part 4

fvs = readNparray(filepath='fvs_7063_7775.npy')
normal_len = 7063
spam_len = 7775
train(normal_len, spam_len, fvs)

Explanation:

The feature vectors are read from the file using the readNparray() function.
The lengths of the normal and spam files lists are assigned to the variables normal_len and spam_len.
The train() function is called to train a model using the lengths and feature vectors.

Spam Detection: Feature Extraction and Model Training