Data Ninjaeering

DevFestDC : key takeaways of the Google Cloud Products

Posted on September 26, 2016 by mx

TensorFLow : google’s machine learning api can be super powerful.
Just make json REST calls to the end-point and get results based on google’s machine learning lib.
Uses cases:
1. identify an image (image classification)
2. parse speech into text
3. translate languages
4. use for predictive analytics to run wide and deep learning on datasets

Google Cloud Platform : It is Google’s AWS cloud.
Key takeaways: They only charge by the minute and not by the hour like amazon.
Extremely cheap pay as you use model on an extremely fast network; Access to Google’s pipes and servers.

A ton of of useful Cloud tools: SaaS, PaaS, BaaS etc..

Compute Engine : AWS EC2 boxes except more scaleable.
App Engine : AWS Beanstalk
Cloud Storage : AWS S3

data tools : the ones I’m interested in.

Google’s Big Table : Similar to amazon DynoDB, a nosql data store. The basic store engine of gmail,googlemaps and etc. Google’s original Hbase hapdoop data store.

Google BigQuery : Web interface to scan and query millions, billions, trillions of rows.

Google DataProc : Managed Hadoop, Spark Pig, Hive combined into one interface. Spin up a multiple cluster nodes in seconds and run a spark, pig, hive job using google’s compute power.
(better than amazon emr)

Summary: pay-as-you-go for cloud computing service. Watch out Amazon Web Services and Azure.

google data studio : neat visualization app for data.

Posted in big data, machine learning, Uncategorized | Tagged DevFestDc, GCP | Leave a comment

basic database table creation and load from csv using mysql and postgres

Posted on September 22, 2016 by mx

Basic database table creation with MySql and PostGreSQL.
The starting point to most data applications is getting the data feeds and populating the tables.
here’s an example of the process I’m loading a stock_history table from yahoo finance api source.

Make a table :

mysql:
create table stock_history ( id bigint not null primary key AUTO_INCREMENT, symbol varchar(10) not null, data_date date not null, open_at decimal(14,6) null, close_at decimal(14,6) null, high decimal(14,6) null, low decimal(14,6) null, volume decimal(24,0) null ) ; create index sidx on stock_history (symbol);
postgres:
create table stock_history ( id SERIAL primary key, symbol varchar not null, data_date date not null, open_at money null, close_at money null, high money null, low money null, volume numeric null ) ; create index sidx on stock_history (symbol);

Load a table from CSV source
mysql:

LOAD DATA INFILE “/tmp/data_source.csv”
INTO TABLE stock_tickers
COLUMNS TERMINATED BY ‘,’
OPTIONALLY ENCLOSED BY ‘”‘
ESCAPED BY ‘”‘
LINES TERMINATED BY ‘n’
IGNORE 1 LINES;

postgres:
copy stock_history (symbol, data_date, open_at, close_at, high, low, volume) from '/tmp/data_source.csv' WITH HEADER DELIMITER ',' csv;

I’m using stock_quote gem to source my stock csv feed.

Posted in data wrangling, relational databases | Tagged database, mysql, postgresql, sql, tips | Leave a comment

My personal review of a few top Business Intelligence tools

Posted on September 6, 2016 by mx

The Business Intelligence Space changes every year but some things remain the same. Here’s my reviews of some of the most common BI software in the market; some I have used extensively.

Top Dogs: (aka expensive enterprise software)

MicroStrategy
Cognos
Oracle BI / OBIEE
SAP Business Objects

Mid-Tier: (mid-priced, BI software for mid-cap companies)

Tableau
Sql Server Reporting Suite/ Sql Server Data Tools (SSRS/SSDT/SSDT-BI) , whatever Microsoft decide to names it.
Looker
Domo
Qwikview
GoodData
LogiAnalytics

Free (Open Source) : (some are semi free)

Jaspersoft
Pentaho
Kibana
Druid

These are all my opinion and only how I feel based on working with the software. I’ll write this up like awards by category:

Worst BI Software Still Used by Big Corporations:

Business Objects: The entirely of SAP Business Objects software list is consisting BO buying smaller BI companies back in the 90’s and early 2000s’ and rebranding them. None of the products work well and many of the tools do the same thing (Webi/Crystal). It’s remarkably hard to setup, but they sell you consulting services with the tools.
Cognos : gets second place. It’s a IBM owned product similar to SAP owned BOBJ in many ways. However more solid, less buggy than BO and fairly well supported. Not much innovation since company got purchased by IBM in 2008.

Best BI Software Used by Big Corporations:

MicroStrategy : I’m biased because I worked at MicroStrategy. However, this tool is rock solid when setup correctly. It never crashes; they grew organically and never tried to buy other companies to incorporate features into the product. MSTR is however fairly expensive, very complex to learn, and only affordable by large-cap corporations. It’s got a ton of features and super complex/expensive software licensing model. Great support and training teams.
Oracle BI: It’s not bad if you’re already on the Oracle database platform. It extends BI reporting and dash-boarding. Nothing special or spectacular by any means. It has the Oracle support enterprises need the 24/7 support feature.

Mid-Tier: Worst

SQL Server / PowerBI etc: If Micro$oft could decide to stick with one and not make a new BI product every 2 years. Is it SSRS? or PowerBI or what? It generally works okay assuming you’re on the MSFT boat of tools: SQL Server, Sharepoint, Excel and Azure. SSIS keeps changing which annoys ETL guys. It does not have good integration with non Microsoft products on purpose.
Logi Analytics : it’s not bad; but to use it well requires you need to be very familiar with Microsoft .net programming.

Mid-Tier: Best

Looker : Looker is amazing. For data engineers and database developers who can architect a proper schema. Looker has some of the quickest integration path to setup a nice BI portal/embedded data product. It is however lacking in visualization libraries and UI could use more polish. It’s a well made product on the rise.
Tableau : Tableau gets a close second to Looker. Tableau’s strength is it’s visualization library, data exploration and ultra ease of use for non-technicals to analyze data. Tableau Server is however, not as robust as the MicroStrategy/Cognos BI portals of the world.

Free-Tier: Worst

JapserSoft: This one’s ‘free’ and open source. But it’s not easy to use , it’s wysiwyg isn’t great to use. It’s free and that is the highest selling point. It doesn’t do anything well. It’s really a freemium product for cheap companies to get locked in.
Pentaho : it’s free, it works okay. It’s not great; but it’s free. It looks pretty bad. However, it integrates well with Kettle etl. It’s super simple BI tool but slow, bloated with bad UI.

Free-Tier: Best

Apache Kibana : this is free(open source) as it gets. It is used with elastic search and often clumped with the hadoop/spark to analyze big data sets. I think it’s more useful for log data. Not sure of BI slice and dice analysis yet. I am currently looking into it.
Apache Druid: This is another (open source free) one that I know can do slice and dice analysis. It’s an insanely fast columnar store that’s scaleable to Petabytes. Downside is I think it requires a distributed data store (Hadoop/spark) for realization the full power of Druid.

Unknowns:

I didn’t get to evaluate or dig far enough into some of the newer BI tools that are popular:

Domo : I think the key feature here is data federation.
Splunk : large logs of data to analyze
Quikview: (update this later)
Gooddata: (update this later)
SiSense: (update this later)
Alteryx: (update this later)

Posted in BI reporting, Business Intelligence | Leave a comment

Why PostgreSQL is the better MySQL

Posted on August 31, 2016 by mx

Ever since MySql has been purchased by Oracle, it has been lagging in development in the open source space. MariaDB , Percona, Aurora are spin offs that try to address it. MySql is the original M of the LAMP stack. However, things are more becoming more LNPP. (linux, nginx, postgres, php) or just Rails with Postgres.

Anyhow here are advantages and differences of PostgreSql vs MySql. We will use: PostgreSQL 9.3 vs MySql 5.6.

Things PostGreSQL has that MySQL doesn’t have.

CHECK constraints (checks values of fields)
True SERIALIZABLE isolation (avoid dirty reads)
Arrays (like p[values])
hstore (hash data type)
Composite Data types (like structs in C)
Window functions (row_number,partition over)
set functions like : general series
User defined aggregates and operators
Common table expressions (and recursive Cte’s)
XML and JSON Support
MVCC (multi-version concurrency control)
PostGis (GIS support)
Transactional DDL (rollback schema changes)
Fancier Indexing (GIN, GIST)
Un-logged Tables (like mysql myISAM)
inheritance (tables inherit other tables woah!)

Things PostGreSQL doesn’t have that MySQL have.

Session variables (@:= hackery)
a ton of storage engines types for varying use.
Upserting aka insert on duplicate key
Event scheduler that Mysql has basically Unix Cron
GROUP BY … ROLLUP summaries in Mysql.
native ability to cross a query between databases. Postgres needs foreign_data_wrapper.

Things MySQL doesn’t have :

CTE’s, cte recursion and tree traversal.
Native easy to use Json and XML parsing
Temp table have self join limitations
Fancy functions written in procedural language
unions, intersections
transaction isolation.
Full outer joins (c’mon man)

Things to note on differences:

MySQL uses nonstandard ‘#’ to begin a comment line; PostgreSQL doesn’t. Instead, use ‘–‘ (double dash), as this is the ANSI standard, and both databases understand it.
MySQL uses ‘ or ” to quote values (i.e. WHERE name = “John”). This is not the ANSI standard for databases. PostgreSQL uses only single quotes for this (i.e. WHERE name = ‘John’). Double quotes are used to quote system identifiers; field names, table names, etc. (i.e. WHERE “last name” = ‘Smith’).
MySQL uses ` (accent mark or backtick) to quote system identifiers, which is decidedly non-standard.
PostgreSQL is case-sensitive for string comparisons.
Database, table, field and columns names in PostgreSQL are case-independent, unless you created them with double-quotes around their name, in which case they are case-sensitive. In MySQL, table names can be case-sensitive or not, depending on which operating system you are using.
PostgreSQL and MySQL seem to differ most in handling of dates, and the names of functions that handle dates.
PostgreSQL supports many standard parts of SQL that MySQL doesn’t, such as sub-queries, unions, intersections, etc.

If I were starting fresh a new web site/app today I’d go PostgreSQL over MySQL or NoSQL: Mongo/Cassandra (if scaling for big data).

Posted in data wrangling, relational databases | Tagged database, mysql, postgresql, relations databases, research, trends | Leave a comment

Big Data == teens talking about sex

Posted on August 31, 2016 by mx

Dan Ariely. Duke University Professor.

Big data may be sexy. But companies I’ve worked with or interviewed are still dealing with small-to-medium size data management problems. #smalldataproblems

Posted in big data, Uncategorized | Tagged big data, quote, visual | Leave a comment

OPEN SOURCE BIG DATA SOFTWARE AS HOW I CURRENTLY UNDERSTAND IT. part 2

Posted on August 30, 2016 by mx

https://github.com/mxento/Big-Data-Dictionary

I created a big data dictionary for myself and anyone else who gets confused by all the big data applications out there and just want a simple one or two sentence descriptions for reference.

Posted in big data | Leave a comment

The Report or Dashbard is taking forever to load

Posted on August 29, 2016 by mx

A general rule with BI reports and dashboards is 10 seconds or less for a report or 30 seconds for dashboards.

But quite often an analyst will run a report and it never comes back.
They’ll say something like so and so report is running slow or broken.

Here’s some ways to troubleshoot this.

1. You’re pulling too much data from the table and it’s not cached.
Try adding a view filter or report filter. Different BI apps have different terms.
Edit the report so it does a where statement against the data-source.
Otherwise, if there’s a cube/cache process for this set of reports then by all means cache it.

2. There’s a messed join somewhere. This is likely case with MicroStrategy. Managing any BI tool’s sql generation is always a hit or miss. Sometimes it the BI architect messed up the schema relationships. Sometimes the Looker explore wasn’t meant to cross 5 objects. Accidental many to many joins can be caused by the report’s sql generator. If it never comes back; this is a possible reason.

3. Someone wrote some bad custom sql job that ‘works’ but doesn’t really work well. You’re going to have to pull up your sleeves and dig into the details. I hope you’re not the unlucky sob to taking over from some previous consultant’s Cognos project that was out-sourced to India and being held together with bubble gum and duct tape.

Tips to fix SQL performance:
-never join on text/string/varchar fields between two tables, if you do index it.
-index join columns and try to keep join columns always as integers.
-remove complexity like multiple levels of sql subqueries and other craziness.
-Use optimized CTE’s which are just fast temp tables. (Sql Server only case).
-Never ever use cursors for reporting.

Posted in BI reporting, data wrangling | Tagged BI, Reporting, tips | Leave a comment

Some Google Trends Research of Big Data Vendors

Posted on August 27, 2016 by mx

https://ssl.gstatic.com/trends_nrtr/680_RC25/embed_loader.js trends.embed.renderExploreWidget(“TIMESERIES”, {“comparisonItem”:[{“keyword”:”Cloudera”,”geo”:””,”time”:”2008-07-27 2016-08-27″},{“keyword”:”mapR”,”geo”:””,”time”:”2008-07-27 2016-08-27″},{“keyword”:”hortonworks”,”geo”:””,”time”:”2008-07-27 2016-08-27″},{“keyword”:”hbase”,”geo”:””,”time”:”2008-07-27 2016-08-27″},{“keyword”:”bigtable”,”geo”:””,”time”:”2008-07-27 2016-08-27″}],”category”:0,”property”:””}, {});
Cloudera’s peaking EMR’s actually declining.

https://ssl.gstatic.com/trends_nrtr/680_RC25/embed_loader.js

trends.embed.renderExploreWidget(“TIMESERIES”, {“comparisonItem”:[{“keyword”:”/m/05ynw”,”geo”:””,”time”:”all”},{“keyword”:”/m/04y3k”,”geo”:””,”time”:”all”},{“keyword”:”/m/0120vr”,”geo”:””,”time”:”all”},{“keyword”:”/m/0120tv”,”geo”:””,”time”:”all”},{“keyword”:”/m/01vw9z”,”geo”:””,”time”:”all”}],”category”:0,”property”:””}, {});
The decline of the web 2.0 Relational Data Stores. Looks like RDS peaked at web 2.0 when google ipo’ed.
https://ssl.gstatic.com/trends_nrtr/680_RC25/embed_loader.js

https://ssl.gstatic.com/trends_nrtr/680_RC25/embed_loader.js trends.embed.renderExploreWidget(“TIMESERIES”, {“comparisonItem”:[{“keyword”:”/m/01ls32″,”geo”:””,”time”:”all”},{“keyword”:”/m/016jq3″,”geo”:””,”time”:”all”},{“keyword”:”/m/0bs2j8q”,”geo”:””,”time”:”all”},{“keyword”:”/m/018j6p”,”geo”:””,”time”:”all”},{“keyword”:”/m/0148v7″,”geo”:””,”time”:”all”}],”category”:0,”property”:””}, {});
Software Engineers always in demand. DBA’s and SysAdmins are in decline. BI remains steady. 2004-Present.

trends.embed.renderExploreWidget(“TIMESERIES”, {“comparisonItem”:[{“keyword”:”/m/0bs2j8q”,”geo”:””,”time”:”all”},{“keyword”:”/m/0fdjtq”,”geo”:””,”time”:”all”},{“keyword”:”/m/05nrgx”,”geo”:””,”time”:”all”},{“keyword”:”/m/01hyh_”,”geo”:””,”time”:”all”},{“keyword”:”/m/04y7lrx”,”geo”:””,”time”:”all”}],”category”:0,”property”:””}, {});
All the new tech on the uptick. Aws,Azure, machine learning, big data are all rising to new highs.

Future proof your tech career. Learn to work on cloud based systems with big data backends as software developer and have some expertise in machine learning.

Posted in data analysis | Tagged big data, hadoop vendors, research, trends | Leave a comment

Open Source big data software as how I currently understand it.

Posted on August 26, 2016 by mx

Since I’m moving away from relational databases and data warehousing (so 2008).
I’m self learning Big Data Architecture.

This Open Source big data software list as how I currently understand it.

Hadoop : hdfs is the basic distributed file store; like a database but spread about lots of linux boxes. Many things run on top of it. Best of which is Spark which is gaining popularity.
There’s variants of hadoop similar to different linux os distros: Apache Hadoop (vanilla), MapR, Hbase, Cloudera, MongoDB. I don’t know the differences too well.

MapReduce: language/job to take run queries against hadoop. It’s syntax is hard; basically script how to jump between boxes and split data set and reduce the problem from a DAG. Lot of people prefer abstraction layer to do it. Hive is Sql based , which translates to map-reduce jobs. MR jobs are generally slow and does in big batches.

Apache Spark: is an memory Haddoop App that does apparently everything. Spark is fast, probably why people like it. Spark Sql Spark Stream, everything’s growing.
It also has spark ML (machine learning) library.

Apache Kafka: a popular open source pub/sub data brokering service also used for streaming. It does real-time processing and log aggregation and monitoring.

Apache Zookeeper: a cluster management software, open sourced. To manage all the different nodes’s and linux servers.

Apache Mesos: The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with resource management and scheduling across entire datacenter and cloud environments. so basically == Yarn 2.0.

Apache Storm: Big time streaming processing of incoming big data. It does lambda function or what used to be transforms on datasets fast.

Apache Druid : In memory slice and dice cube that sits on top big data stores.

Kibana : to have data visualization and reporting from the distributed systems.

Airflow : Open source data engineering pipeline tool. Used to parallel etl data from nodes to target stores. AirBnb made it.

Apache Nifi : Log collection service, used to collect IOT or similar volumes of data.

Apache Hue: web app gui for interfacing with Hadoop data stores.

Apache Lucene Search:
Apache Solr : open source search through all the stuff.
Elastic Search : distributed searching of nosql json same as Solr mostly

NoSql key-value stores:

Cassandra : Most reliable and super scalable data store of key-values. Very hard to analytics on.
Redis : in memory data store , good for caching hot data like websites, not a long term data store.

One or more familarity with these languages needed to be hand one with Big Data:
Java,Scala,Python,Go,Kafka,Storm, SQL

My next post will cover:
Drill, Flink, Yarn, Pig, Mahout, Tez, parquet, Impala, Sqoop, Avro, Oozie, Bigtop, HCatalog, Luigi

Posted in big data | Tagged big data, hadoop tools | Leave a comment

SQL tip: To get first 10 Rows from a Table and profile the columns

Posted on August 25, 2016 by mx

For people working with database tables:
Most will want to check out the columns in the table and do a quick scan to get 10 rows to sample data in the table.

Here’s the SQL syntax for doing that with some of the biggest relational database vendors.

The first query gets 10 rows from a table called ‘Students’.
The second query shows all the columns in this table.
The third query is an alternative style (if exists).

MySql Syntax
SELECT * FROM Students LIMIT 10;


# Two ways to do in mysql

DESCRIBE Students;

SELECT * FROM information_schema.columns WHERE table_name = 'Students';

Postgres syntax
SELECT * FROM Students LIMIT 10;


# this one below only works in console

d+ Students

# or SELECT column_name, data_type FROM INFORMATION_SCHEMA.COLUMNS where table_name = 'Students';

MicroSoft Sql Server T-Sql syntax
SELECT TOP 10 * FROM Students GO

SELECT * FROM information_schema.columns WHERE table_name = 'Students' ORDER BY ordinal_position GO

Oracle PL/Sql syntax
SELECT * FROM Students WHERE ROWNUM <=10;

DESCRIBE Students;

Posted in data wrangling | Tagged database, sql, tip | Leave a comment