Member-only story
The MSCK REPAIR TABLE
command in Hive is used to update the metadata in the Hive metastore to reflect the current state of the partitions in the file system. This is particularly necessary for external tables where partitions might be added directly to the file system (such as HDFS or Amazon S3) without using Hive commands.
What MSCK REPAIR TABLE Does
- Scans the File System: It scans the file system (e.g., HDFS or S3) for Hive-compatible partitions that were added after the table was created.
- Updates Metadata: It compares the partitions in the table metadata with those in the file system. If it finds new partitions in the file system that are not in the metadata, it adds them to the Hive metastore.
- Partition Detection: It detects partitions by reading the directory structure and creating partitions based on the folder names.
Why MSCK REPAIR TABLE is Needed
- Partition Awareness: Hive stores a list of partitions for each table in its metastore. When new partitions are added directly to the file system, Hive is not aware of these partitions unless the metadata is updated. Running
MSCK REPAIR TABLE
ensures that the Hive metastore is synchronized with the actual data layout in the file system. - Querying New Data: Without updating the metadata, queries on the table will not include the data in the new partitions. By running
MSCK REPAIR TABLE
, you make the new data…