Credits to: Peter James Thomas:
Most companies don’t realize the effort involved in Business Intelligence. Often they just throw semi-technical business analysts at a problem instead of committing dedicated technical resources to Data Management and Analytics Reporting.
Taken from a Dataiku meetup slide. This picture hit close to home.
I switched to wordpress.com as my host. I will most likely switch to AWS later.
Redshift is based off branch of PostGreSQL 8.0.2 [ PostgreSQL 8.0.2 was released in 2005]
here’s all the unsupported fancy PostGres Stuff: taken directly from amazon’s manual.
The bigs ones are: No Store Procedures, No Constraints enforcement, No triggers and no table functions, no upserts.
However, don’t ever forget that for an Amazon Redshift query can do:
Select count(disinct column_name) from table.
200x faster than PostGres on a Billion row table.
These PostgreSQL features are not supported in Amazon Redshift.
Do not assume that the semantics of elements that Amazon Redshift and PostgreSQL have in common are identical. Make sure to consult the Amazon Redshift Developer Guide SQL Commands to understand the often subtle differences.
Unique, primary key, and foreign key constraints are permitted, but they are informational only. They are not enforced by the system, but they are used by the query planner.
Best Practices for Micro-Batch Loading on Amazon Redshift Article by AWS blog
I work with Redshift everyday now at Amazon. It’s very useful big data warehouse tool.
Here’s a blog post about loading data into it. It’s very s3 dependent and heavy use of the Copy command.
Some quick notes:
-It’s faster to drop and load big tables into staging areas.
-Split input files in to pieces and load in parallel.
-COPY option ‘STATUPDATE OFF.’
-Avoid Vacuum of tables when possible
You could just read the main points in the how to guide.
here’s quick and eas do the following in a single transaction:
1. Create staging table “tablename_staging” like main table
2. Copy data from S3 into staging table
3. Delete rows in main table that are already present in staging table
4. Copy all rows from staging table to main table
5. Drop staging table
Redshift is :
Fast like Ferrari
Cheap like a Ford Fiesta
Useful like a Minivan
Self Driving Auto-magics like Tesla with Autopilot
Really fancy features under-the-hood:
-interleaved sort keys
-columnar distributed storage
-smart parallel execution
-IO optimization (return results fast)
-Easy to add nodes and scalable
-Shared nothing architecture
-Less need for dbas, monitor and use in AWS console
-Amazon Cloud only
-Requirements of S3 dependability
-Only useful for very very large datasets
-A limited number concurrent queries
Snowflake: data warehouse in the cloud (specially amazon)
Snowflake compute is basically an analytics computing database that has scalability. Data is stored / shared on AWS S3 buckets instead of in snowfalek. You spin up snowflake tool injest and load into it’s proprietary parallel columnar sql engine for analytics processing. Afterwards you run sql against it. It’s mostly PL/SQL and/or TSQL like syntax. In the world of hadoop snowflake is sql based data warehouse you run on demand.
Looker: Cloud based Business Intelligence Reporting and Dashboards.
Looker’s a new BI tool. It does reports, dashboards, and allow data exploration. The thing that makes it different is it cloud hosted, it uses more developer like frameworks (LookML like yaml language syntax, GIT version control, releases pushes). It’s very fast to build up in the hands of a seasoned data engineer / data architect. It also simplifies a lot of common data warehouse tasks (auto generate time dimension lookups, rolling totals, data manipulation), and it has connectors to most data sources via jdbc.
Common between Looker and Snowflake is that they’re both AWS Cloud based, easy to get setup fast, and use SQL as the lingua franca.
TensorFLow : google’s machine learning api can be super powerful.
Just make json REST calls to the end-point and get results based on google’s machine learning lib.
1. identify an image (image classification)
2. parse speech into text
3. translate languages
4. use for predictive analytics to run wide and deep learning on datasets
Google Cloud Platform : It is Google’s AWS cloud.
Key takeaways: They only charge by the minute and not by the hour like amazon.
Extremely cheap pay as you use model on an extremely fast network; Access to Google’s pipes and servers.
A ton of of useful Cloud tools: SaaS, PaaS, BaaS etc..
Compute Engine : AWS EC2 boxes except more scaleable.
App Engine : AWS Beanstalk
Cloud Storage : AWS S3
data tools : the ones I’m interested in.
Google’s Big Table : Similar to amazon DynoDB, a nosql data store. The basic store engine of gmail,googlemaps and etc. Google’s original Hbase hapdoop data store.
Google BigQuery : Web interface to scan and query millions, billions, trillions of rows.
Google DataProc : Managed Hadoop, Spark Pig, Hive combined into one interface. Spin up a multiple cluster nodes in seconds and run a spark, pig, hive job using google’s compute power.
(better than amazon emr)
Summary: pay-as-you-go for cloud computing service. Watch out Amazon Web Services and Azure.
google data studio : neat visualization app for data.