Slide Share’s full of useful knowledge

Best practices for PySpark programming

History of SQL and all the advanced features over the last 30 years among big vendors.

Posted in spark, Uncategorized | Tagged , , , | Leave a comment

What I think of every time I hear Stakeholders

 

credits to: http://darrenwalsh.co.uk/drawings/major-steakholder/

 

Posted in Business Intelligence, wheresthejoke | Tagged | Leave a comment

Re-Blog: 10 Risks that Beset Data Programmes

Credits to: Peter James Thomas:

https://www.linkedin.com/pulse/10-risks-beset-data-programmes-peter-james-thomas

  1. Not establishing a dedicated team. The team never escapes from “the day job” or legacy / BAU issues; the past prevents the future from being built.
  2. Staff lack skills and prior experience of data programmes. Sub-optimal functionality, slippages, later performance problems, higher ongoing support costs. Time is also wasted educating people rather than getting on with work.
  3. Not establishing an appropriate management / governance structure. The programme is not aligned with business needs, is not able to get necessary time with business users and cannot negotiate the inevitable obstacles that block its way.
  4. Poor programme management. The programme loses direction. Time is expended on non-core issues. Milestones are missed. Expenditure escalates beyond budget.
  5. Big Bang approach. Too much time goes by without any value being created. The eventual Big Bang is instead a damp squib. Large sums of money are spent without any benefits.
  6. Lack of focus on interim deliverables. Business units become frustrated and seek alternative ways to meet their pressing needs. This leads to greater fragmentation and reputational damage to the programme.
  7. Insufficient time spent understanding source system data and how data is transformed as it flows between systems. This leads to data capabilities that do not reflect business transactions with fidelity. There is inconsistency with reports directly drawn from source systems. Reconciliation issues arise.
  8. Not enough up-front focus on understanding key business decisions and the information necessary to take them. Analytic capabilities do not focus on what people want or need, leading to poor adoption and benefits not being achieved.
  9. Lack of leverage of new data capabilities in front-end / digital systems. These systems are less effective. The data team is jealous about its own approved capabilities being the only way that users should get information, rather than adopting a more pragmatic and value-added approach.
  10. Education is an afterthought, training is technology- rather than business-focused. People neither understand the capabilities of new analytical tools, nor how to use them to derive business value. Again this leads to poor adoption and little return on investment.

 

Most companies don’t realize the effort involved in Business Intelligence. Often they just throw semi-technical business analysts at a problem instead of committing dedicated technical resources to Data Management and Analytics Reporting.

Posted in Business Intelligence | Tagged , | Leave a comment

The reality of a data worker.

data_worker

Taken from a Dataiku meetup slide.  This picture hit close to home.

Posted in data wrangling, Uncategorized, wheresthejoke | Tagged | Leave a comment

Things to note when migrating web hosts

  • Backup everything! all the files on the hosts. Hostmonster, my previous domain had this thing about increasing the monthly fee $1 per yearly renewal.
  • On domain manger set dns redirect 301 to new host ASAP
  • Find cheap (free hosting) gitpages is nice option for a mostly static website.
  • Don’t forget to change gmail’s mx servers for email redirects or your emails break.
  • I switched to wordpress.com as my blog host instead running my own LAMP wordpress instance. Less work. Export xml for all the posts.
  • That’s all;  Little harder than squarespace but still smarter than myspace.
  • Cons: I’m missing many google analytics tracking and SEO extensions from running my own wordpress.
Posted in Uncategorized | Leave a comment

New Year, New Site

I switched to wordpress.com as my host. I will most likely switch to AWS later.

Posted in Uncategorized | Leave a comment

Amazon Redshift’s Unsupported Features of PostGres

Redshift is based off branch of PostGreSQL 8.0.2 [ PostgreSQL 8.0.2 was released in 2005]

here’s all the unsupported fancy PostGres Stuff: taken directly from amazon’s manual.

The bigs ones are: No Store Procedures, No Constraints enforcement, No triggers and no table functions, no upserts.

However, don’t ever forget that for an Amazon Redshift query can do:

Select count(disinct column_name) from table.

200x faster than PostGres on a Billion row table.

Unsupported PostgreSQL Features

These PostgreSQL features are not supported in Amazon Redshift.

Important

Do not assume that the semantics of elements that Amazon Redshift and PostgreSQL have in common are identical. Make sure to consult the Amazon Redshift Developer Guide SQL Commands to understand the often subtle differences.

  • Only the 8.x version of the PostgreSQL query tool psql is supported.
  • Table partitioning (range and list partitioning)
  • Tablespaces
  • Constraints
    • Unique
    • Foreign key
    • Primary key
    • Check constraints
    • Exclusion constraints

    Unique, primary key, and foreign key constraints are permitted, but they are informational only. They are not enforced by the system, but they are used by the query planner.

  • Inheritance
  • Postgres system columnsAmazon Redshift SQL does not implicitly define system columns. However, the PostgreSQL system column names cannot be used as names of user-defined columns. See http://www.postgresql.org/docs/8.0/static/ddl-system-columns.html
  • Indexes
  • NULLS clause in Window functions
  • CollationsAmazon Redshift does not support locale-specific or user-defined collation sequences. See Collation Sequences.
  • Value expressions
    • Subscripted expressions
    • Array constructors
    • Row constructors
  • Stored procedures
  • Triggers
  • Management of External Data (SQL/MED)
  • Table functions
  • VALUES list used as constant tables
  • Recursive common table expressions
  • Sequences
  • Full text search
Posted in data wrangling, mpp databases | Tagged , , , | Leave a comment

Best Practices for Micro-Batch Loading on Amazon Redshift

Best Practices for Micro-Batch Loading on Amazon Redshift Article by AWS blog

I work with Redshift everyday now at Amazon. It’s very useful big data warehouse tool.
Here’s a blog post about loading data into it. It’s very s3 dependent and heavy use of the Copy command.

Some quick notes:
-It’s faster to drop and load big tables into staging areas.
-Split input files in to pieces and load in parallel.
-COPY option ‘STATUPDATE OFF.’
-Avoid Vacuum of tables when possible

You could just read the main points in the how to guide.

here’s quick and eas do the following in a single transaction:
1. Create staging table “tablename_staging” like main table
2. Copy data from S3 into staging table
3. Delete rows in main table that are already present in staging table
4. Copy all rows from staging table to main table
5. Drop staging table

Posted in big data, data wrangling, etl | Tagged , , , | Leave a comment

Amazon Redshift is an amazing database product

Redshift is :
Fast like Ferrari
Cheap like a Ford Fiesta
Useful like a Minivan
Self Driving Auto-magics like Tesla with Autopilot

Key features:

Really fancy features under-the-hood:
-interleaved sort keys
-columnar distributed storage
-smart parallel execution
-IO optimization (return results fast)
-Easy to add nodes and scalable
-Shared nothing architecture
-Less need for dbas, monitor and use in AWS console

Caveats:
-Amazon Cloud only
-Requirements of S3 dependability
-Only useful for very very large datasets
-A limited number concurrent queries

Posted in big data, Business Intelligence, Cloud, data analysis, relational databases | Tagged , , , , , | Leave a comment

Review of two New Cloud BI tools : Snowflake and Looker

Snowflake: data warehouse in the cloud (specially amazon)

Snowflake compute is basically an analytics computing database that has scalability. Data is stored / shared on AWS S3 buckets instead of in snowfalek. You spin up snowflake tool injest and load into it’s proprietary parallel columnar sql engine for analytics processing. Afterwards you run sql against it. It’s mostly PL/SQL and/or TSQL like syntax. In the world of hadoop snowflake is  sql based data warehouse you run on demand.

Pros:

  • Magically scale up/down compute nodes  with a few clicks. No need for database tuning.
  • Ingest JSON data directly from flatfiles/tez files.
  • Pay only for what you use platform.
  • Sql based platform so no need to learn another language.

Cons:

  • Amazon Cloud locked  Vendor
  • All your data need to  be in S3 and cleaned into a semi-structured manner to read and load into snowflake.
  • SQL is not completely up-to-date with latest (missing window functions etc.)

Looker: Cloud based Business Intelligence Reporting and Dashboards. 

Looker’s a new BI tool. It does reports, dashboards, and allow data exploration. The thing that makes it different is it cloud hosted, it uses more developer like frameworks (LookML like yaml language syntax, GIT version control, releases pushes).  It’s very fast to build up in the hands of a seasoned data engineer / data architect. It also simplifies a lot of common data warehouse tasks (auto generate time dimension lookups, rolling totals, data manipulation), and it has connectors to most data sources via jdbc.

Pros:

  • Easy to setup quickly and get baseline reports working.
  • Cloud based so you can just point to your db and host on them or setup on premise instance yourself.
  • GIT version control and rollback, something not in most BI tools.
  • Relatively cheap to embed into existing applications.

Cons:

  • Still quite new and doesn’t have all mature BI  features built out.
  • Visualizations still simple grids, bars, lines, pies and etc.
  • Requires  a fairly technical person to setup lookML schemas before business analysts can self service and data explore.
  • Need to know sql well to troubleshoot results.

Common between Looker and Snowflake is that they’re both AWS Cloud based, easy to get setup fast, and use SQL as the lingua franca.

Posted in BI reporting, Business Intelligence, Cloud | Tagged , , , | Leave a comment