Skip to main content

Where is all the data?

As you learn data science, and even when you’re a long-time data scientist, you will eventually ask: “How can I acquire the data I need?”  Sometimes, you need to obtain the data within your enterprise.  Other times, you need data for analyzing broader trends, e.g., how fires in the West are impacting the supply chain, or whether there will be enough trained people to build a staff in a remote location.

So, where is all the data?  A good first step is querying Google Dataset Search.  Early in its development Dataset Search provided results that hid the real gems of open data in a haystack of for-fee datasets.  No more!  Choose the free dataset filter to benefit from the massive number of open datasets on the Internet.

Perhaps you are interested in acceptance rates at business schools?  You can enter the following query:

Google Dataset Search Query: university business school acceptance rates united states
Google Dataset Search Query

The results list several datasets at a popular dataset sharing site, Kaggle.

Three results from a Google Dataset Search Query, all from kaggle
Partial Results from Google Dataset Search Query

But these results are for all universities, not just business schools.  Another query for business schools will produce datasets with lists of business schools.  The information you want is buried in these two datasets, and you are now moving on  in the data science pipeline to managing and wrangling your data.

Diagram of steps in the data science pipeline: data acquisition, data management, cleaning/wrangling, data exploration, hypothesis formation, data analysis, data visualization producing results or a feedback loop to an earlier step
The Data Science Pipeline

You need to combine the datasets together to meet your need.  That might not be straightforward.  The names of the universities may be spelled differently in the two datasets.  I’ll address the next steps in preparing this data in subsequent posts.  Note that this process is very similar to the Extract, Transform and Load (ETL) process for building a data warehouse.  In ETL, data is extracted from a company’s operational databases, and then restructured and loaded into a data warehouse for analytical processing.

Your search for datasets has begun!  You may discover data sharing sites you wish to join to take advantage of their instructional and data compilation resources.  In addition, Kaggle runs data science and machine learning competitions to help you sharpen your skills.  Many of the sites provide space for you to share the results of your analysis and other datasets you’ve collected.

Learn more about Google Data Set Search in the article Discovering millions of datasets on the web.  Some other very powerful data source sites are:

Other compendiums of free data sources are:

Thanks to Jack Moreh and FreeRangeStock.com for permission to use the first illustration in this article. The illustration was retrieved from https://freerangestock.com/photos/38942/traders–winners-and-losers.html.  Check out Jack’s work there!  It’s great.