Today is the first day of the Fall 2015 term at Waterloo, and I will be starting the second half of my third year. In this blog post, I will reflect upon the things I did this summer and what my goals are for the term.
I just completed a summer internship in San Mateo, California about 2 weeks ago. During the internship I had the pleasure of working and learning alongside talented full time employees on my team and other teams. I also received mentorship from my supervisor who helped me to reflect upon my progress on a weekly basis, and got to work closely with a bright intern on the same team who lives in the Bay Area. My internship helped me to develop a strong interest in working with data, so I used the summer as an opportunity to learn more about the subject through online courses offered on edX.
Previously, I took Intro to Hadoop and MapReduce on Udacity, and learned about writing mappers and reducers for Hadoop Streaming to obtain aggregate statistics of large text files. In June, I took a course called Introduction to Big Data with Apache Spark and learned about easier ways of performing MapReduce. By composing calls to
reduce together, passing a lambda function to each action, tasks such as getting the list of top k stores which have the highest revenues from a list of transactions can be performed much more easily and succinctly.
In July, I took the successor of the Spark course, called Scalable Machine Learning. This course taught me how to use MLlib in Spark to run linear and logistic regression algorithms in a distributed manner. Furthermore, the course taught how to analyze the time and space complexities of distributed algorithms, which was somewhat dry for me but should nevertheless be understood to justify using them. I hope to review the complexity calculations in the future.
Throughout June to August, I also took a course called The Analytics Edge. This is by far my favorite course, since it taught me the process of approaching a data analysis problem through doing many assignments on exciting data sets. Although the course is taught in R, the process is language-agnostic and can be followed by other languages such as Python using pandas and scikit-learn. The best part about this course is the in-class Kaggle competition. Even though I did not spend as much time as I wanted to on the competition since my parents visited me and we travelled to southern California, I am happy that I still got to try out many techniques taught in the course. In the end my best model, measured using the AUC of the ROC curve, a characteristic for measuring how well a model can classify observations, turns out to be a pretty simple logistic regression model. When I get more time I hope to write more details about the courses I took this past summer.
Besides pursuing my passion in working with data, I also visited the Grand Canyon, Las Vegas, Death Valley, Yosemite, Big Sur, Los Angeles, and San Diego. Here are a few pictures:
After having lots of fun as an Orientation Leader to welcome new students to the university, a new term has come. Here are the most important goals for me in this new term - the so-called “New Term Resolutions”. I really hope that I won’t regret when looking back at this list at the end of the term:
This is it for today. Hope I will have free time to write a new blog post soon.