As you learn data science, and even when you’re a long-time data scientist, you will eventually ask: “How can I acquire the data I need?” Sometimes, you need to obtain the data within your enterprise. Other times, you need data for analyzing broader trends, e.g., how fires in the West are impacting the supply chain, or whether there will be enough trained people to build a staff in a remote location.
So, where is all the data? A good first step is querying Google Dataset Search. Early in its development Dataset Search provided results that hid the real gems of open data in a haystack of for-fee datasets. No more! Choose the free dataset filter to benefit from the massive number of open datasets on the Internet.
Perhaps you are interested in acceptance rates at business schools? You can enter the following query:
The results list several datasets at a popular dataset sharing site, Kaggle.
But these results are for all universities, not just business schools. Another query for business schools will produce datasets with lists of business schools. The information you want is buried in these two datasets, and you are now moving on in the data science pipeline to managing and wrangling your data.
You need to combine the datasets together to meet your need. That might not be straightforward. The names of the universities may be spelled differently in the two datasets. I’ll address the next steps in preparing this data in subsequent posts. Note that this process is very similar to the Extract, Transform and Load (ETL) process for building a data warehouse. In ETL, data is extracted from a company’s operational databases, and then restructured and loaded into a data warehouse for analytical processing.
Your search for datasets has begun! You may discover data sharing sites you wish to join to take advantage of their instructional and data compilation resources. In addition, Kaggle runs data science and machine learning competitions to help you sharpen your skills. Many of the sites provide space for you to share the results of your analysis and other datasets you’ve collected.
Learn more about Google Data Set Search in the article Discovering millions of datasets on the web. Some other very powerful data source sites are:
- Economic Time Series Data from the University of Maryland
- FiveThirtyEight Data (and links to articles analyzing the data)
- The Educational Opportunity Project at Stanford University
- NJ Child Welfare Data Hub (in collaboration with Rutgers)
- NJ Open Data Center
- Pew Research Center Datasets
- Registry of Open Data on AWS
- Rutgers University Libraries Guide to Public Affairs, Public Service, and Public Administration Data
- US Bureau of Labor Statistics
- US CDC National Center for Health Statistics
- US Civil Rights Data Collection on Schools
- US Department of Education Data and Statistics
- US Government Open Data Catalog
Other compendiums of free data sources are:
—
Thanks to Jack Moreh and FreeRangeStock.com for permission to use the first illustration in this article. The illustration was retrieved from https://freerangestock.com/photos/38942/traders–winners-and-losers.html. Check out Jack’s work there! It’s great.