A unified data governance solution that helps manage, protect, and discover data across your organization
Thank you for reaching out! We understand how frustrating it can be when your JSON and Parquet metadata isn't showing up in Purview after scanning from AWS S3. Let us walk you through the most common causes and how to fix them.
First, the good news — Microsoft Purview's Amazon S3 Multicloud Scanning Connector does fully support both JSON and Parquet file types for scanning and schema extraction. So the feature is there; we just need to make sure a few things are configured correctly on your end.
- Complex or nested data types Purview's scanner cannot extract schema for Parquet files that contain complex data types like MAP, LIST, or STRUCT. The asset will still be discovered, but the schema/metadata tab will appear blank. Please verify whether your files contain any nested structures.
- S3 storage class Files stored in Glacier storage classes are not supported for schema extraction, classification, or sensitivity labels. Please ensure your files are in S3 Standard storage.
- Integration Runtime (IR) setup If you're using a Self-Hosted Integration Runtime (SHIR), you must install the 64-bit JRE 11 or OpenJDK on the SHIR machine — without this, Parquet schema extraction will fail silently. Also, make sure your SHIR is updated to the latest version. If you're using the AWS auto-resolve IR, no additional setup is needed.
- Network access The S3 connector does not support Purview private endpoints. Your environment must have public internet access to communicate with the Purview service. Please check that no firewall or network rules are blocking outbound connectivity.
- Parquet compression format For compressed Parquet files, only Snappy compression is supported for schema extraction. If your files use GZIP, LZ4, or ZSTD, the schema won't be extracted.
- Scan status and logs Navigate to Data Map → Data Sources → your S3 source → Scans tab and check whether the scan completed with "Succeeded" status. Look for any warnings like "Schema extraction not supported for complex types". For SHIR issues, check the Windows Event Viewer under the Integration Runtime section.
To help us narrow this down further, could you share:
- Is your scan completing successfully, or are there any warnings/errors?
- Which IR are you using — auto-resolve or Self-Hosted?
- What S3 storage class are your files in?
- Do your files contain nested/complex data types?
- Are the assets appearing in the Data Map at all (even with an empty schema)?
the inability to view JSON and Parquet metadata from AWS S3 in Purview is typically caused by one of the above factors — complex data types in your files, an unsupported storage class, a missing Java runtime on your SHIR, or a compression format other than Snappy. We'd encourage you to go through each checkpoint above, and once you share the additional details, we'll be happy to assist you further in resolving this completely.
References:
- Amazon S3 Multicloud Scanning Connector for Microsoft Purview
- Supported data sources and file types in Microsoft Purview
- Create and manage a self-hosted integration runtime
- Troubleshoot scans and connections in Microsoft Purview
Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.