Some things I learned playing the role of ‘data scientist’

In general: Most of my career

has been focused on Business Intelligence– which is gathering and visualizing historical to data to know what happened. Data science is using historical data to predict the future.
For example: run an regression model based a few key features of a user visiting ebay to try and predict likelihood of that user buying something in near future.

80% of the data science work is still spent on ETL and data wrangling before any data science models start.
Below are some tips/things I learned while doing data science.

You’re either team R or team Python (both programming languages).
One can do correlations and regressions in excel or sql; but it’s like using a spoon to hammer nails.

R :
Coding in R syntax is a pain. It’s opposite Ruby in terms of readability.
R is memory intensive so spin up a big memory Ec2 box.
Don’t forget to clear your R workspce when you run R BATCH files.

rm(list = ls())

When googling stuff for R use [R]

scikit is your golden library

Some Basics of Data Science:
Regression, Classification, Clustering, Dimensionality reduction.

Regression: use history to predict a future value e.g. using hours studying math to predict sat score for math

Classification: use data to identify something. e.g. Spammer detection based on user actions

Clustering: use data to id and group similar objects to sets. e.g. scraping an log take out orders to id restaurant type.

Machine Learning: using data patterns to do predictive analytics.

Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data. (this is the one I am applying to data sets to find data gold) x,y,z features of a student indicate high likelihood of failure.

Unsupervised machine learning: The program is given a bunch of data and must find patterns and relationships therein. (this is more or less AI, which I’m not involved in)

Data Science Studio is a great visual GUI tool to do data wrangling and some quick data modeling.

This entry was posted in data science and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s