S3 Partitioning

S3 bucket partitioning

S3 buckets will contain partitions for the data files to store. The partitions are created at the S3 input bucket & the S3 Output bucket.

S3 -> Bucket name -> Data source -> emitted zip files with timestamp

The emitted zip file contains the CSV data files with timestamp and a manifest file with a timestamp.
The folders and the files will be removed from the S3 emission bucket once NIFI copies the data.
Unprocessed data files will remain in the S3 emission bucket for one week and then they will be deleted automatically at the end of the week.

All the files in the S3 input bucket will be in the CSV format.
S3 input bucket follows a hierarchical partitioning based on the Data source, Year, Month, date and timestamp

S3 -> Bucket name -> Data source -> Year -> Year - Month -> date_Source name

Example for the S3 input bucket:

S3/cqube-gj-input/student_attendance/2020/2020-05/2020-05-29_student_attendance

All the files in the S3 Output bucket will be in the JSON format.
S3 output bucket follows the hierarchical partitioning based on the data source, Year, Month, date and timestamp, similar to the partitioning that the S3 input bucket follows.
Metadata files will have information of the latest updated output files which helps cQube to consider the latest output file during the visualization stage.

Last updated 3 years ago

Was this helpful?