Skip to content

Here's the code associated with my Machine Learning (ML) based hurricane forecasting system

License

Notifications You must be signed in to change notification settings

M00NSH0T/Hurricanes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Hurricanes

2/28/2021 Update: I have recently returned to this project to make use of NOAA's Big Data Project's offerings on both AWS and GCP, which weren't available when I started the series. Please check out the "2021 update" folder and included code / notebooks.

Here's the code associated with my "Predicting Hurricanes" YouTube series. https://www.youtube.com/channel/UCPmLClJE0GmnZ4e7sW_Fu7A

This series attempts to teach you how to use the "core engineering design process" to tackle a problem using Machine Learning and raw, uncleaned data. This process is used effectively by all disciplines of engineering, from mechanical and chemical to electrical and computer to build prototypes. The AGILE / SCRUM framework is great for improving a system that's already in place, but for coming up with an initial design (or even a minimially viable product), nothing beats the time proven core engineering design process with good old fashioned project management techniques, like critical path analysis and resource leveling, which we'll also briefly touch on here. Following this is how we figure out what precise problem needs to be solved, define our design requirements, research existing solutions to figure out what's been done and how we might be able to improve it or apply it in a different way, and then design a solution that will actually solve the problem.

Just a few notes up front:

  1. Instead of acting like a standard repository with the latest version being reflected by what you see here, I'm going to keep the code I use from each episode in its associated folder. So, "Episode_5" may have a newer version of the same file contained in "Episode_4" but that's just so you can follow along with the series. I'll include a readme in each episode folder to help aid you in using each file, but for the best experience, you should watch the associated episode.
  2. Coding doesn't really start until Episode 4, which is why that's the first folder. I wanted to emphasize the work that goes into setting up a big problem like this, and so episodes 1-3 discuss the early steps of the core engineering design process.
  3. Also, the datasets I use here are massive, and so I will not be posting any of that here. If you want to follow along, you'll have to download it yourself using the ftp sites and Python scripts I've written. Just a heads up though, I've basically filled up an entire 4TB hard drive, and I've only scratched the surface of the European data.
  4. I will be using Python with Keras/Tensorflow. I'm not going to be teaching Python at all, and I'm not going to be focusing too much on how many of these algorithms work. There are a number of awesome free courses out there that you can take to teach you all that. I'd recommend Andrew Ng's Machine Learning Course on Coursera and David Silver's course on YouTube as a starting point. Instead, we're going to focus on the stuff that isn't really covered in any of the other courses I've seen out there. Namely, setting up the problem, working with real-world data, and training with utterly massive datasets.

If you're still onboard, you'll need a few things.

MySQL

Available here: https://www.mysql.com/downloads/

Make sure that you also install MySQL Workbench. Mine is installed in a Windows environment, but I think it should be pretty much the same if you install in Linux. I usually install the mysqlclient and mysql-connector-c python libraries via Conda as well. There's a lot of documentation available to help you get started, but it's very straightforward if you've ever used any type of SQL before.

Anaconda (aka "Conda")

This is what you'll generally want to use to install Python and manage whatever environments and associated installed packages you want to set up. It's just so much easier to do this with Conda than any other distribution I've found. You won't be able to find every package in Conda, but the core ones, like numpy, Pandas, sqlalchemy, etc. are all there, and Conda makes sure that all the versions are compatible. There are almost always instances where you'll have to install additional packages. Just do those at the end, after everything that's in Conda. We'll be using PyGrib, for instance, which you'll need to install with pip. Installing these last doesn't guarantee you won't "break" anything in your environment, but it minimizes the risk. With that said, do make sure you set up a second environment for all of this. Never use just the base environment. It's very difficult to reset it if something goes wrong, whereas it's very easy to delete a broken environment and start over.

One other thing I like about Conda is it comes with Spyder, which is a pretty decent IDE that's great for working with data. A problem with using just a text editor with a terminal / command prompt is that it's annoying to view large datasets or even samples. With Spyder, you can view all your active variables in the upper right corner, and then simply double click one, like a numpy array or Pandas DataFrame, and you'll be able to see the whole dataset, with numerical columns color coded by value to help you spot outliers. Also, you can easily create several iPython consoles to run numerous parallel instances of simple modules that you may have running. I use this when I set up clusters. You can alternatively use threading or multiprocessing to accomplish the same thing, but the problem there is printing output. There are ways to do that, but it's far easier to spot / debug issues that may only show up several hours into a run when each thread is running in its own iPython console, and if you code things up correctly, you won't necessarily have to start the whole experiment over again to fix the issue. Just fix it in each of the modules, and get them running again. As long as the central "learner" or whatever you want to call it hasn't choked on a bad input, you can keep running. I find this to be far safer when setting up a training run that could go for a week, and when you're working on something real-world and not just training a neural network to play a simple low resolution Atari game or learn on a cultivated, pre-processed sample dataset, you'll find this is really helpful.

RabbitMQ and Pika

RabbitMQ is a simple queueing / messaging service, and Pika is the python package we use to access the queues. It's accomplishing much the same thing that you would get from a Spark implementation on a bigger cluster, but for this problem and with just two desktop computers to work with, I decided to go with this approach, at least for now. Running a queuing service like this lets numerous threads working on multiple computers on a LAN all send data in the form of numpy arrays or even Pandas DataFrames to a central queue. Pulling data out of that queue is very fast, especially compared with running a query on a SQL database. What this lets us do is to store our data in a central database and have multiple threads running queries on that database in parallel, pulling samples and putting them in the queue so that our central "learner" (i.e. our thread that's actually training a neural network or whatever machine learning algorithm we're working with) can pull out a batch without having to wait for the SQL database to do any matching and return results. This lets us more fully utilize both our CPU and GPU while training... enabling our CPU to continue preparing and loading data while our GPU works on training.

The only real problem I have with it is the documentation isn't really written in plain english, and there's almost too much of it so it's easy to get overwhelmed trying to get it set up. This is because it's a very powerful piece of software, that's highly configurable so you can set it up on multiple network configurations, but for those just trying to do it on a home LAN, it's definitely information overload. All you really need to do is ensure that every computer that's using this has the same cookie installed. And you might need to make sure that that cookie is saved in two locations on some computers, since the command line interface tool looks in a different folder location for some reason. The Clustering Guide tells you where to install these. https://www.rabbitmq.com/clustering.html. If you're only using two computers, just use the rabbitmq command prompt (in windows) to join one to the other:

rabbitmqctl stop_app rabbitmqctl join_cluster <other computer name - like rabbit@computer_name> rabbitmqctl start_app

In Linux, it's the same thing, just with 'sudo' (in Ubuntu) before each line. You only need to do this on one machine. To make sure it's working, you can use 'cluster_status' but I highly recommend instead installing the web interface tool by typing:

rabbitmq-plugins enable rabbitmq_management

Then, you just go to http://localhost:15672 in your web browser. You can probably log in the first time with guest/guest, but check out this thread for more information. https://stackoverflow.com/questions/22850546/cant-access-rabbitmq-web-management-interface-after-fresh-install/22854222#22854222. I generally setup my own account / password with the admin tag.

Considerations Moving Forward

There are many other ways of setting up a cluster to train a neural network, and many more to leverage multi-threading one a single machine to get some of the same benefits. If you only have one computer, Keras offers a nice solution called "fit_generator" that will do pretty much the same thing (i.e. preparing the next sample with your extra CPU capacity while training with the GPU). However, be aware it only really works in Linux (per this issue: keras-team/keras#10842), so if you intend to go that route, get yourself a nice Linux distribution. Ubuntu is usually the best place for newbies to start because it's widely distributed and you can find answers to just about any issue on StackOverflow. I'm personally starting to eye the new Intel Clear distibution though, which has been making the news lately by putting out benchmark results that seem almost too good to be true. You may want to check that out.

Also, as of Tensorflow 2.0 (which released after I made most of the original series), Keras is now the official API for defininig neural networks and is included in all distributions. Further, there's now integrated support for training both on a cluster, as well as on a single system with multiple GPUs. The only downside to this is that you need to rework your datasets into the tf.data format so they can be used by the new Estimator class, and TF.data can be a bit tricky. However, TF.data also allows us to do some pretty slick things when used in combination with Tensorflow's new feature columns. You can now easily build the preprocessing and feature engineering directly into your model. I'll be making use of this in the 2021 update.

And then there's Spark / PySpark. This seems to be the most commonly used platform amongst the bigger companies working with Machine Learning these days. It's pretty compelling.. allowing you to store data across a cluster instead of simply using a cluster to increase processing capability the way I do here. With Spark, you can cache a huge amount of your data in memory across the cluster and work with it much faster using dataframes not too dissimilar from what we use in Pandas. And it can even be used to accelerate MySQL queries by utilizing more threads to pull that data, or pre-fetch the most commonly used tables to memory.

Finally, GCP and AWS have come a long ways with their machine learning offerings. While the cost of training models in the cloud is arguably rather overpriced compared with using your own local resouces, these services are an absolute must when it comes to deploying your final model(s) to production, and they offer a lot of quality of life tools that can really speed up the labor side of model development and data preprocessing. AWS Sagemaker's Ground Truth service makes labelling data easy and fast. And, quite relevant to this project, Google Earth Engine can be directly tied into a tensorflow model that you deploy on the Google AI Platform. My 2021 update jumps right into using data hosted on AWS, and accessed directly with Python. I plan to build on this in the coming weeks to also integrate data from Google Earth Engine, and eventually deploy the model on GCP for live inferencing.

Stay tuned, and star for updates.

About

Here's the code associated with my Machine Learning (ML) based hurricane forecasting system

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published