GCP Public Datasets
Google Cloud Platform hosts a number of public datasets:
Public Datasets on Google Cloud Platform makes it easy for users to access and analyze data in the cloud. These datasets are freely hosted and accessible using a variety of data warehouse and analytics software, from open source Apache Spark to cutting edge Google technologies like Google BigQuery and Google Cloud Dataflow. From structured genomic or encyclopedic data to unstructured climate data, Public Datasets provide a playground for those new to big data and data analysis and a powerful repository for skilled researchers. You can also integrate with your application to add valuable insights for your users. Whatever your use case, these datasets are freely available on GCP.
The thing I find surprising is that the Common Crawl web archives aren’t on GCP, especially given Google’s web heritage. Apropos the late, lamented Fairness Doctrine, Common Crawl is hosted on AWS. There was a good, recent GCP Podcast episode with the Public Datasets team that had an e-mail contact. Maybe I’ll fire off a question.