Process Test Data for Amex Default Prediction: Efficient Chunk Processing and Feature Engineering
This code block processes test data for the Amex Default Prediction challenge. It reads chunks of the test CSV file, performs feature engineering, saves the processed test files to disk, and saves the customer index of all test files. The 'PROCESS_DATA' variable acts as a boolean flag controlling whether to execute the processing. If 'PROCESS_DATA' is True, the code block is executed; otherwise, it is skipped.
The 'NUM_FILES' variable specifies the number of chunks to read from the test CSV file. The 'rows' variable is an array that holds the number of rows to read from each chunk.
The 'test_customer_hashes' array is initially an empty array with data type 'int64'. This array stores the customer hashes of all test files.
The 'for' loop iterates over the number of files defined by 'NUM_FILES'. In each iteration, it reads a chunk of the test CSV file using 'cudf.read_csv()'. The 'skip' variable skips already processed rows from previous iterations.
The 'feature_engineer()' function is called to perform feature engineering on the test data. The 'targets' parameter is set to 'None' because the test data lacks target labels.
The 'cust' variable holds the customer IDs from the current chunk. This information is extracted from the 'test' dataframe, sorted by index, and appended to the 'test_customer_hashes' array.
The 'data' variable stores the processed test data in a 3D array shaped as '(num_samples, num_features, seq_len)'. This data is saved to disk using 'cupy.save()'.
Finally, the customer index of all test files is saved to disk using 'cupy.save()'. The memory is then cleaned up by deleting the 'test' and 'data' variables and calling 'gc.collect()'.
原文地址: http://www.cveoy.top/t/topic/oFRV 著作权归作者所有。请勿转载和采集!