I am a third year Graduate Student in Software Engineering. Focus of studies are in the area of Data Science, such as analytics, visualization, statistics, data mining and machine learning. 4.0 GPA and loving every minute of my Harvard experience!
Fall Spring 2016 I will be a member of the teaching staff for a new course being taught at the Harvard School of Public Health. This will be an introductory course in Data Science which includes concepts from Statistics, Computer Science and Software Engineering. We will be teaching the necessary skills to manage and analyze data. We will teach concepts such as exploratory data analysis, statistical inference and modeling, machine learning, and high dimensional data analysis. We will also teach the necessary skill to develop data products including R programming, data wrangling, reproducible research, and communicating results.
Spring 2015 I have been working in Bioinformatics Algorithms. I just wrapped up a project on analyzing Single Nucleotide Polymorphisms in Autosomal DNA. For this project I analyzed millions of points of data in the Human Genome and writing thousands of lines of code to do so. I wrote my own mutation code which was able to use real world data to produce discrete probability distributions to produce very real world like chromosomes which I could use for analysis. We made a very unpolished report (I still need to edit it) here.
Fall 2014 and 2015 I am a Teaching Fellow for CS109 Data Science, which also includes the courses STAT-121, AC-209, CSCI-E109. I am also a student in STAT E-100 Introduction to Quantitative Methods for the Social Sciences and Humanities.
Fall 2013 I worked as part of a group of 4 Harvard graduate and doctoral researchers in building a Social Network exploration tool SEE:NET Social Evolution Experiment Network Exploration Tool. This tool explores the Social Evolution data which is part of the MIT Reality Commons project. This experiment involved 84 students that were monitored 24/7 for an entire year. This data included bluetooth proximity data, WLAN location data, SMS data, Call data, GPS data as well as numerous survey datasets. We used visualization techniques such as time series exploration, force directed node graphs, chord diagrams and heat maps. Project was included in the CS171 Hall of Fame.
Spring 2014 I worked as a part of a group of 4 Harvard graduate and doctoral researchers in building a topic model for United States K-12 schools. We are analyzing over ~1M reviews, the largest of its kind, from GreatSchools. Project includes complete ingestion of reviews data, cleaning of text, pre-processing, and building multiple models primarily using Latent Dirchlet Allocation (LDA). Research involves the discovery of ideal Alpha and Eta hyper-parameters via perplexity grid search routines. All of the coding is being done in Python, with further analysis in R. Our project is online and a summary can be found here.
This semester, Fall 2014 I have just finished doing extensive analysis of the Retail market for Organic and Non-Organic food varieties. This include construction of multiple models, F-tests, permutation tests, t-tests, normalization and standardization of all data as well as extensive documentation of the project.
- 4.0 grade point average, recognized by Harvard faculty for distinguished contribution in Data Science and Big Data Analytics
- SAS 2014 Global Forum Scholarship recipient (1 of 20 nationally)
- Top 1% of all Kaggle competitors
- I love data
- I like riding my motorcycle
- I love conversation, and I would love to talk to you, please contact me
- Although I have a passion for academics and I work very hard, there is so much to be done in this world, so many social/economic problems, and I would love to help
Areas of expertise:
- Columnar Store MPP databases and distributed computing architectures for processing large amounts of data
- Extensive experience in using the Hadoop ecosystem including Hadoop, mrjob, sqoop, pig, hive
- Exploratory data analysis using R, SAS, python, Rattle, Weka, RapidMiner, etc. to generate hypothesis and intuition
- Node graph visualization using NetworkX, GraphViz, Jgraph, JgraphT and Gephi
- Data munging/sampling/scraping/cleaning using a variety of tools in python such as Beautiful Soup, requests, pattern, fnmatch, re and Pandas
- Machine learning models, both supervised and unsupervised such as KNN, K-means, SVM, Decision Trees, Naive Bayes, Recommender Systems, Neural Networks, Linear/Logistic Regression - Classification, Clustering, Dimensionality Reduction, Optimiztion, Regression, Cross Validation, Prediction and more.
Past research includes:
- Developing my own optimization model and for solving a modified version of the traveling salesman problem (TSP), including writing my own implementation of 2-opt, 2.5-opt and 3-opt heuristics, to discover two optimum disjoint paths (Java)
- Developing an image recognition system using Support Vector Machines and Radial Basis Kernel with > 97% accuracy
- Developing a complete text processing pipeline (tokenization, spell correction, lemmatization, etc) using the NLTK toolkit and parallel processing (Python)
- Parallel Processing of large datasets using iPythons distributed RMI-like architecture including "scatter/gather" of large datasets (Python)
- Using Pythons Pyro4 framework to parallel process LDA across multiple cores simultaneously to deal with large data sets.
- Development of distributed client/server architectures using Java RMI (Java)
- Analyzing US election outcomes using Logistic Regression
- Predictive analytics of Rotten Tomatoes reviews data using Multinomial Naive Bayes classifier
- Predictive analytics of Yelp restaurant data using Collaborative Filtering recommender system combined with global and local K-nearest neighbors as well as Gibbs Sampler
- Analysis of word positivity using Amazon EMR, Hadoop and python mrjob
- Social network analysis using node graphs, R and python
- Analysis of the voting patterns and bipartisanship of the 113th US Senate using node graphs and visualization
- Construction of a real time, database driven mashup between Google Maps and the BART API showing real time arrival / departure times
- Building a complete Data Warehouse using Talend Open Studio and MySQL