Women in Data Science (WiDS)
  • Home
  • About
    • Blog
    • WiDStory
    • News
    • Research
    • Sponsors
    • Collaborators
    • Contact
    • Donate
  • Conferences
    • WiDS Regional Events 2023
    • WiDS Stanford 2023 Online
    • WiDS Stanford 2023 Agenda
    • WiDS Stanford 2023 Speakers
    • Ambassadors 2023 >
      • Ambassador Advisory Council
    • WiDS Ambassador Program
    • Past Conferences >
      • WiDS 2023
      • WiDS 2022
      • WiDS 2021
      • WiDS 2020
      • WiDS 2019
      • WiDS 2018
      • WiDS 2017
      • WiDS 2015
  • Datathon
    • Datathon Details
    • Datathon Resources >
      • Datathon Press Release
    • WiDS Datathon Workshops 2023
    • Datathon News
    • Datathon Collaborators
    • Datathon Committee
  • Podcast
    • Podcast Committee
  • Education
    • Workshops >
      • Workshop Instructors
      • Workhop Committee
    • Next Gen >
      • Next Gen Resources
      • Next Gen Committee

A Beginner’s Tutorial for the WiDS Datathon 2022 challenge

1/5/2022

 
Picture
Climate change is one of the critical challenges facing humanity today. Over the past few years, there have been widespread climate-driven disruptive events such as floods and wildfires. The devastation caused by these events has resulted in an awareness of the urgency of the issue. Indeed, people and governments have started working together in the direction of climate-focused coordinated action. At WiDS, we believe that it will be important for future data scientists to gain familiarity with mathematical and statistical models used to model climate data.  For this reason, the focus of the WiDS Datathon this year is a climate-focused challenge: prediction of building energy consumption.
The challenges of predicting energy consumption are multiple and complex. Modeling energy consumption depends on multiple building-related variables. Climatic variables such as temperature, rainfall, snowfall etc drive patterns of energy consumption that are both cross-section and temporal. For example, one would expect in general zip codes on the east coast to experience much higher snowfall in the winter than zip codes on the west coast. That said, there are annual climate patterns that are clearly temporal. The challenge we are faced with is choosing a modeling methodology that permits us to model these patterns appropriately.

Step 1: Consider alternate modeling approaches
It is often very helpful to consider alternate approaches to a given modeling problem. Here I review two methodologies that you might consider when deciding how to develop a model. 

a) XGBoost
This approach might come in handy if you decide to focus on capturing the cross-section patterns in the data.  In previous tutorials I’ve described how you might fit an XGBoost model. I discussed how you could  i) build a DMatrix and ii) work through hyper-parameter tuning. For your reference, here’s the blog: 
https://www.widsconference.org/blog_archive/a-beginners-guide-to-the-wids-datathon-2021-challenge

Also, here’s a resource for more information: https://xgboost.readthedocs.io/en/stable/

Here’s an example (in R programming language) of how you might create a DMatrix and try alternate parameters to fit an XGboost model.
Picture
b.Deep Learning- Recursive Neural Net
If you decide to pursue a temporal approach towards modeling, you might consider deep learning, a family of models that has been very effective in modeling complex datasets. In this vast field, RNNs (Recursive Neural Nets) have been particularly useful in modeling temporal patterns. Here I provide you with resources to further explore this area, focusing on resources that are freely available over the internet.
 
The internet has many great free resources discussing deep learning in detail. Here are a few resources:
  • https://www.deeplearningbook.org/  (by Goodfellow, Ian et all)
  • https://cs230.stanford.edu/lecture/
  • https://keras.io/
  • https://www.tensorflow.org/tutorials
 
As discussed in the resource by Goodfellow et all, https://www.deeplearningbook.org/ RNNs (Recursive Neural Nets) maintain state information between loops (or cycles) of the temporal sequence. We are trying to compute the conditional distribution of the next sequence element given the past inputs. Decomposing the joint probability over the sequence of values as a series of one-step probabilistic predictions helps to compute the full joint distribution across the whole sequence.  Essentially, we are using the RNN to try to parametrize long-term relationships and dependencies between variables efficiently. 
 
There are problems you might run into while trying to build an RNN. As Goodfellow et all further discuss, results are highly non-linear and gradients repeated over many stages might vanish or explode. Architecture approaches such as LSTM help to overcome the vanishing gradient problem. 
 
The deeplearningbook also has practical recommendations on fitting your model. Goodfellow et all recommend assessing the components that are overfitting or under-fitting. They also recommend using dropout, which they suggest is an excellent and easy-to-use regularizer. Batch normalization could also sometimes reduce generalization error. Understanding and tuning hyper-parameters (such as the learning rate), either manually or through grid-search or random search, is highly recommended. 
 
I highly recommend you also look closely at the Keras api (https://keras.io/), which has many examples addressing specifically the type of modeling problem we’re focused on. As you prepare the dataset, you might consider re-formatting it to reflect temporal dependencies. If you decide to pursue this direction, here’s a resource that focuses specifically on this subject.

Step 2. Read advanced domain-specific background information
It might also be useful to get domain-specific knowledge to help you with the process of formulating a model. To help you get started, we have a tutorial from Marcus Voss and Nikola Milojevic-Dupont, domain experts at CCAI!

Step 3. Consider Research
Last year, we had a tremendous response to the Research Phase II of the Datathon. In response to the interest, we have significantly expanded Phase II this year. I recommend that you think about research questions that might emerge from your exploration of this data that you could pursue in Phase II. 

Finally, have fun! This exercise is really meant as an introduction to this vast field. We hope that now that your curiosity is piqued, you will explore this area both more fully.

Ready to try it yourself? Sign up to take the challenge!

A discussion by Sharada Kalanidhi, WiDS Datathon Co-Chair and Data Scientist at the Stanford Genome Technology Center (Department of Biochemistry), Stanford University School of Medicine. Her research interests are mathematical and statistical analysis of multi-omics data.

Related Articles:
  • Getting Started with Kaggle - WiDS 2022 Datathon Video
  • The Women in Data Science (WiDS) Datathon 2022 is Now Live on Kaggle!

Comments are closed.

    Categories

    All
    WiDS Ambassadors
    WiDS Conference
    WiDS Datathon
    WiDS NextGen
    WiDS Podcast
    WiDS Regional Events
    WiDStory
    WiDS Workshops

    RSS Feed

Initiatives

Conference
Ambassador Program
Datathon
Podcast
Workshops 
Next Gen

Follow Us

LinkedIn
Twitter
Facebook
Instagram
YouTube
​Blog

connect

LinkedIn Group
Facebook Group
subscribe
donate

© 2023 Women in data science. Women in Data Science is a Registered trademark of Stanford University. 

  • Home
  • About
    • Blog
    • WiDStory
    • News
    • Research
    • Sponsors
    • Collaborators
    • Contact
    • Donate
  • Conferences
    • WiDS Regional Events 2023
    • WiDS Stanford 2023 Online
    • WiDS Stanford 2023 Agenda
    • WiDS Stanford 2023 Speakers
    • Ambassadors 2023 >
      • Ambassador Advisory Council
    • WiDS Ambassador Program
    • Past Conferences >
      • WiDS 2023
      • WiDS 2022
      • WiDS 2021
      • WiDS 2020
      • WiDS 2019
      • WiDS 2018
      • WiDS 2017
      • WiDS 2015
  • Datathon
    • Datathon Details
    • Datathon Resources >
      • Datathon Press Release
    • WiDS Datathon Workshops 2023
    • Datathon News
    • Datathon Collaborators
    • Datathon Committee
  • Podcast
    • Podcast Committee
  • Education
    • Workshops >
      • Workshop Instructors
      • Workhop Committee
    • Next Gen >
      • Next Gen Resources
      • Next Gen Committee