Amazon Redshift’s Unsupported Features of PostGres

Redshift is based off branch of PostGreSQL 8.0.2 [ PostgreSQL 8.0.2 was released in 2005]

here’s all the unsupported fancy PostGres Stuff: taken directly from amazon’s manual.

The bigs ones are: No Store Procedures, No Constraints enforcement, No triggers and no table functions, no upserts.

However, don’t ever forget that for an Amazon Redshift query can do:

Select count(disinct column_name) from table.

200x faster than PostGres on a Billion row table.

Unsupported PostgreSQL Features

These PostgreSQL features are not supported in Amazon Redshift.

Important

Do not assume that the semantics of elements that Amazon Redshift and PostgreSQL have in common are identical. Make sure to consult the Amazon Redshift Developer Guide SQL Commands to understand the often subtle differences.

  • Only the 8.x version of the PostgreSQL query tool psql is supported.
  • Table partitioning (range and list partitioning)
  • Tablespaces
  • Constraints
    • Unique
    • Foreign key
    • Primary key
    • Check constraints
    • Exclusion constraints

    Unique, primary key, and foreign key constraints are permitted, but they are informational only. They are not enforced by the system, but they are used by the query planner.

  • Inheritance
  • Postgres system columnsAmazon Redshift SQL does not implicitly define system columns. However, the PostgreSQL system column names cannot be used as names of user-defined columns. See http://www.postgresql.org/docs/8.0/static/ddl-system-columns.html
  • Indexes
  • NULLS clause in Window functions
  • CollationsAmazon Redshift does not support locale-specific or user-defined collation sequences. See Collation Sequences.
  • Value expressions
    • Subscripted expressions
    • Array constructors
    • Row constructors
  • Stored procedures
  • Triggers
  • Management of External Data (SQL/MED)
  • Table functions
  • VALUES list used as constant tables
  • Recursive common table expressions
  • Sequences
  • Full text search
Posted in data wrangling, mpp databases | Tagged , , , | Leave a comment

Best Practices for Micro-Batch Loading on Amazon Redshift

Best Practices for Micro-Batch Loading on Amazon Redshift Article by AWS blog

I work with Redshift everyday now at Amazon. It’s very useful big data warehouse tool.
Here’s a blog post about loading data into it. It’s very s3 dependent and heavy use of the Copy command.

Some quick notes:
-It’s faster to drop and load big tables into staging areas.
-Split input files in to pieces and load in parallel.
-COPY option ‘STATUPDATE OFF.’
-Avoid Vacuum of tables when possible

You could just read the main points in the how to guide.

here’s quick and eas do the following in a single transaction:
1. Create staging table “tablename_staging” like main table
2. Copy data from S3 into staging table
3. Delete rows in main table that are already present in staging table
4. Copy all rows from staging table to main table
5. Drop staging table

Posted in big data, data wrangling, etl | Tagged , , , | Leave a comment

Amazon Redshift is an amazing database product

Redshift is :
Fast like Ferrari
Cheap like a Ford Fiesta
Useful like a Minivan
Self Driving Auto-magics like Tesla with Autopilot

Key features:

Really fancy features under-the-hood:
-interleaved sort keys
-columnar distributed storage
-smart parallel execution
-IO optimization (return results fast)
-Easy to add nodes and scalable
-Shared nothing architecture
-Less need for dbas, monitor and use in AWS console

Caveats:
-Amazon Cloud only
-Requirements of S3 dependability
-Only useful for very very large datasets
-A limited number concurrent queries

Posted in big data, Business Intelligence, Cloud, data analysis, relational databases | Tagged , , , , , | Leave a comment

Review of two New Cloud BI tools : Snowflake and Looker

Snowflake: data warehouse in the cloud (specially amazon)

Snowflake compute is basically an analytics computing database that has scalability. Data is stored / shared on AWS S3 buckets instead of in snowfalek. You spin up snowflake tool injest and load into it’s proprietary parallel columnar sql engine for analytics processing. Afterwards you run sql against it. It’s mostly PL/SQL and/or TSQL like syntax. In the world of hadoop snowflake is  sql based data warehouse you run on demand.

Pros:

  • Magically scale up/down compute nodes  with a few clicks. No need for database tuning.
  • Ingest JSON data directly from flatfiles/tez files.
  • Pay only for what you use platform.
  • Sql based platform so no need to learn another language.

Cons:

  • Amazon Cloud locked  Vendor
  • All your data need to  be in S3 and cleaned into a semi-structured manner to read and load into snowflake.
  • SQL is not completely up-to-date with latest (missing window functions etc.)

Looker: Cloud based Business Intelligence Reporting and Dashboards. 

Looker’s a new BI tool. It does reports, dashboards, and allow data exploration. The thing that makes it different is it cloud hosted, it uses more developer like frameworks (LookML like yaml language syntax, GIT version control, releases pushes).  It’s very fast to build up in the hands of a seasoned data engineer / data architect. It also simplifies a lot of common data warehouse tasks (auto generate time dimension lookups, rolling totals, data manipulation), and it has connectors to most data sources via jdbc.

Pros:

  • Easy to setup quickly and get baseline reports working.
  • Cloud based so you can just point to your db and host on them or setup on premise instance yourself.
  • GIT version control and rollback, something not in most BI tools.
  • Relatively cheap to embed into existing applications.

Cons:

  • Still quite new and doesn’t have all mature BI  features built out.
  • Visualizations still simple grids, bars, lines, pies and etc.
  • Requires  a fairly technical person to setup lookML schemas before business analysts can self service and data explore.
  • Need to know sql well to troubleshoot results.

Common between Looker and Snowflake is that they’re both AWS Cloud based, easy to get setup fast, and use SQL as the lingua franca.

Posted in BI reporting, Business Intelligence, Cloud | Tagged , , , | Leave a comment

DevFestDC : key takeaways of the Google Cloud Products

TensorFLow : google’s machine learning api can be super powerful.
Just make json REST calls to the end-point and get results based on google’s machine learning lib.
Uses cases:
1. identify an image (image classification)
2. parse speech into text
3. translate languages
4. use for predictive analytics to run wide and deep learning on datasets

Google Cloud Platform : It is Google’s AWS cloud.
Key takeaways: They only charge by the minute and not by the hour like amazon.
Extremely cheap pay as you use model on an extremely fast network; Access to Google’s pipes and servers.

A ton of of useful Cloud tools: SaaS, PaaS, BaaS etc..

Compute Engine : AWS EC2 boxes except more scaleable.
App Engine : AWS Beanstalk
Cloud Storage : AWS S3

data tools : the ones I’m interested in.

Google’s Big Table : Similar to amazon DynoDB, a nosql data store. The basic store engine of gmail,googlemaps and etc. Google’s original Hbase hapdoop data store.

Google BigQuery : Web interface to scan and query millions, billions, trillions of rows.

Google DataProc : Managed Hadoop, Spark Pig, Hive combined into one interface. Spin up a multiple  cluster nodes in seconds and run a spark, pig, hive job using google’s compute power.
(better than amazon emr)

Summary: pay-as-you-go for cloud computing service. Watch out Amazon Web Services and Azure.

google data studio : neat visualization app for data.

Posted in big data, machine learning, Uncategorized | Tagged , | Leave a comment

basic database table creation and load from csv using mysql and postgres

Basic database table creation with MySql and PostGreSQL.
The starting point to most data applications is getting the data feeds and populating the tables.
here’s an example of the process I’m loading a stock_history table from yahoo finance api source.

Make a table :

mysql:

create table stock_history
( id bigint not null primary key AUTO_INCREMENT,
symbol varchar(10) not null,
data_date date not null,
open_at decimal(14,6) null,
close_at decimal(14,6) null,
high decimal(14,6) null,
low decimal(14,6) null,
volume decimal(24,0) null
) ;
create index sidx on stock_history (symbol);

postgres:

create table stock_history
( id SERIAL primary key,
symbol varchar not null,
data_date date not null,
open_at money null,
close_at money null,
high money null,
low money null,
volume numeric null
) ;
create index sidx on stock_history (symbol);

 

Load a table from CSV source
mysql:

LOAD DATA INFILE “/tmp/data_source.csv”
INTO TABLE stock_tickers
COLUMNS TERMINATED BY ‘,’
OPTIONALLY ENCLOSED BY ‘”‘
ESCAPED BY ‘”‘
LINES TERMINATED BY ‘n’
IGNORE 1 LINES;


postgres:

copy stock_history (symbol, data_date, open_at, close_at, high, low, volume)
from '/tmp/data_source.csv' WITH HEADER DELIMITER ',' csv;

I’m using stock_quote gem to source my stock csv feed.

Posted in data wrangling, relational databases | Tagged , , , , | Leave a comment

My personal review of a few top Business Intelligence tools

The Business Intelligence Space changes every year but some things remain the same. Here’s my reviews of some of the most common BI software in the market; some I have used extensively.

 

Top Dogs: (aka expensive enterprise software)

  • MicroStrategy
  • Cognos
  • Oracle BI / OBIEE
  • SAP Business Objects

Mid-Tier: (mid-priced, BI software for mid-cap companies)

  • Tableau
  • Sql Server Reporting Suite/ Sql Server Data Tools (SSRS/SSDT/SSDT-BI) , whatever Microsoft decide to names it.
  • Looker
  • Domo
  • Qwikview
  • GoodData
  • LogiAnalytics

Free (Open Source) : (some are semi free)

  • Jaspersoft
  • Pentaho
  • Kibana
  • Druid

These are all my opinion and only how I feel based on working with the software. I’ll write this up like awards by category:

Worst BI Software Still Used by Big Corporations: 

  • Business Objects: The entirely of SAP Business Objects  software list is consisting BO buying smaller  BI companies  back in the 90’s and early 2000s’ and rebranding them. None of the products work well and many of the tools do the same thing (Webi/Crystal). It’s remarkably hard  to setup, but they sell you consulting services with the tools.
  • Cognos : gets second place. It’s a IBM owned product similar to SAP owned BOBJ in many ways. However more solid, less buggy than BO and fairly well supported. Not much innovation since company got purchased by IBM in 2008.

Best BI Software Used by Big Corporations: 

  • MicroStrategy : I’m biased because I worked at MicroStrategy. However, this tool is rock solid when setup correctly. It never crashes; they grew organically and never tried to buy other companies to incorporate features into the product. MSTR is however fairly expensive, very complex to learn,  and only affordable by large-cap corporations. It’s got a ton of features and super complex/expensive software licensing model. Great support and training teams.
  • Oracle BI: It’s not bad if you’re already on the Oracle database platform. It extends BI reporting and dash-boarding. Nothing special or spectacular by any means. It has the Oracle support enterprises need the 24/7 support feature.

Mid-Tier: Worst

  • SQL Server / PowerBI etc: If Micro$oft could decide to stick with one and not make a new BI product every 2 years. Is it SSRS? or PowerBI or what? It generally works okay assuming you’re on the MSFT boat of tools: SQL Server, Sharepoint, Excel and Azure. SSIS keeps changing which annoys ETL guys. It does not have good integration with non Microsoft products on purpose.
  • Logi Analytics : it’s not bad; but to use it well requires you need to be very familiar with Microsoft .net programming.

Mid-Tier: Best

  • Looker :  Looker is amazing. For data engineers and database developers who can architect a proper schema. Looker has some of the quickest integration path to setup a nice BI portal/embedded data product. It is however lacking in visualization libraries and UI could use more polish. It’s a well made product on the rise.
  • Tableau :  Tableau gets a close second to Looker. Tableau’s strength is it’s visualization library, data exploration and ultra ease of use for non-technicals to analyze data. Tableau Server is however, not as robust as the MicroStrategy/Cognos BI portals of the world.

Free-Tier: Worst

  • JapserSoft:  This one’s ‘free’ and open source.  But it’s not easy to use , it’s wysiwyg isn’t great to use. It’s  free and that is the highest selling point. It doesn’t do anything well. It’s really a freemium product for cheap companies to get locked in.
  • Pentaho : it’s free, it works okay. It’s not great; but it’s free. It looks pretty bad. However, it integrates well with Kettle etl. It’s super simple BI tool but slow, bloated with bad UI.

Free-Tier: Best

  • Apache Kibana : this is free(open source) as it gets.  It is used with elastic search and often clumped with the hadoop/spark to analyze big data sets. I think it’s more useful for log data. Not sure of BI slice and dice analysis yet. I am currently looking into it.
  • Apache Druid: This is another (open source free) one that I know can do slice and dice analysis. It’s an insanely fast columnar store that’s scaleable to Petabytes. Downside is I think it requires a distributed data store (Hadoop/spark) for realization the full power of Druid.

Unknowns:

I didn’t get to evaluate or dig far enough into some of the newer BI tools that are popular:

  • Domo : I think the key feature here is data federation.
  • Splunk : large logs of data to analyze
  • Quikview: (update this later)
  • Gooddata: (update this later)
  • SiSense:  (update this later)
  • Alteryx:  (update this later)
Posted in BI reporting, Business Intelligence | Leave a comment