Weather Analysis Follow Up: Parsing through NOAA’s Data

Since my last Power BI dashboard on weather analysis in the U.S., I’ve had a few requests and questions around the data I used to build the dashboard and how I transformed it into a CSV file. In this post I’ll provide a step by step walkthrough of the process I used, along with a copy of the SSIS solution and the output files.

Step 1 – Get the data

The data for this dashboard was sourced from NOAA’s website. They have several different datasets that are free to download and get updated on a daily frequency. My original goal was to use global weather data, but this proved to be quite a task, so I chose to scale it down and focus on the continental U.S. only. To get the data, I went to NOAA’s FTP site (ftp.ncdc.noaa.gov/pub/data/ghcn/daily/) and downloaded 4 files: ghcnd_hcn.tar.gz, ghcnd-states.txt, and ghcnd-stations.txt. The tar.gz file is the primary zipped file containing the measures captured by the different weather stations.

To access their site, click here.

I also made sure to read through the readme.txt file a few times in order to get familiar with the fields, how the metrics are represented in the files, and what additional files I would need.

Step 2 – Unzip the files and load them into SQL database

Once I had the files downloaded I used 7-zip to unzip the ghcnd_hcn.tar.gz file. Unzipping it will create a subfolder with a number of .dly files, one for each weather station. I then used SSIS and created several packages to load the data into a SQL server database. For the .dly files I used a ForEachLoop Container to loop through all the files in the subfolder and extract the data into a SQL table.

.dly files for each station

Overview of the ETL dataflow to load the data into SQL Server

I also created some simple data flow tasks to bring the metadata information for the stations, states, and countries into SQL tables. At the end I didn’t end up using Countries since I narrowed down the scope to just the United States.

Step 3 – Parse the data from the .dly files, filter to last 6 years, and unpivot it

This part was a bit time consuming, as I had to break out the values into individual fields by using SUBSTRING() while constantly referencing the readme.txt file to get the lengths of each column. I also spent a bit of time here doing some data validation and some quality checks to make sure I didn’t skew or incorrectly altered the data. I added some additional columns that provided the temperature values in Fahrenheit, and broke out the measures into individual columns based on the value from the Element field (see .sql file).

Step 4 – Extract to CSV

This step was intended more for the purpose of being able to easily share the data, and I could have skipped it and used the SQL database as my data source when building the Power BI dashboard. I used SSIS for this as well and did a pull from the SQL tables into 2 CSV files.

This is a pretty high-level summary of the ETL process, but if you feel like looking at the details of each step, I’ve included a zip file containing the solution files and SQL queries I used when parsing the data. I also have the create SQL table scripts if you are interested in re-creating it in your own environment. A copy of the CSV file is included in there as well for reference.

Download Files (324MB)

As a bonus, I created a new dashboard with some YOY (year-over-year) calculations and more insights around precipitation. Check it out and let me know your thoughts in the comments section below!

Note: if the dashboard doesn’t render properly, click on the Enter Full Screen Mode icon on the bottom right side of the dashboard.