The Data Engineering 2021

I consider these tools the New Data Engineering the current standard :

Here’s a visual guideline for modern data engineer roadmap

credit to reddit /r/dataengineering

Posted in Uncategorized | Leave a comment

How Netflix does data in AWS

Image for post
Separate data : stuff in motion vs stuff at rest. Use bunch of latest open source and aws tools.

Posted in Uncategorized | Leave a comment

GCP for AWS professionals

Someone made a GCP lookup list for AWS cloud people

Service comparisons

The following table provides a side-by-side comparison of the various services available on AWS and Google Cloud.

Service Category Service AWS Google Cloud
Compute IaaS Amazon Elastic Compute Cloud Compute Engine
PaaS AWS Elastic Beanstalk App Engine
FaaS AWS Lambda Cloud Functions
Containers CaaS Amazon Elastic Kubernetes Service, Amazon Elastic Container Service Google Kubernetes Engine
Containers without infrastructure AWS Fargate Cloud Run
Container registry Amazon Elastic Container Registry Container Registry
Networking Virtual networks Amazon Virtual Private Cloud Virtual Private Cloud
Load balancer Elastic Load Balancer Cloud Load Balancing
Dedicated interconnect AWS Direct Connect Cloud Interconnect
Domains and DNS Amazon Route 53 Google Domains, Cloud DNS
CDN Amazon CloudFront Cloud CDN
DDoS firewall AWS Shield, AWS WAF Google Cloud Armor
Storage Object storage Amazon Simple Storage Service Cloud Storage
Block storage Amazon Elastic Block Store Persistent Disk
Reduced-availability storage Amazon S3 Standard-Infrequent Access, Amazon S3 One Zone-Infrequent Access Cloud Storage Nearline and Cloud Storage Coldline
Archival storage Amazon Glacier Cloud Storage Archive
File storage Amazon Elastic File System Filestore
In-memory data store Amazon ElastiCache for Redis Memorystore
Database RDBMS Amazon Relational Database Service, Amazon Aurora Cloud SQLCloud Spanner
NoSQL: Key-value Amazon DynamoDB FirestoreCloud Bigtable
NoSQL: Indexed Amazon SimpleDB Firestore
In-memory data store Amazon ElastiCache for Redis Memorystore
Data analytics Data warehouse Amazon Redshift BigQuery
Query service Amazon Athena BigQuery
Messaging Amazon Simple Notification Service, Amazon Simple Queueing Service Pub/Sub
Batch data processing Amazon Elastic MapReduce, AWS Batch DataprocDataflow
Stream data processing Amazon Kinesis Dataflow
Stream data ingest Amazon Kinesis Pub/Sub
Workflow orchestration Amazon Data Pipeline, AWS Glue Cloud Composer
Management tools Deployment AWS CloudFormation Cloud Deployment Manager
Cost management AWS Budgets Cost Management
Operations Monitoring Amazon CloudWatch Cloud Monitoring
Logging Amazon CloudWatch Logs Cloud Logging
Audit logging AWS CloudTrails Cloud Audit Logs
Debugging AWS X-Ray Cloud Debugger
Performance tracing AWS X-Ray Cloud Trace
Security & identity IAM Amazon Identity and Access Management Cloud Identity and Access Management
Secret management AWS Secrets Manager Secret Manager
Encrypted keys AWS Key Management Service Cloud Key Management Service
Resource monitoring AWS Config Cloud Asset Inventory
Vulnerability scanning Amazon Inspector Web Security Scanner
Threat detection Amazon GuardDuty Event Threat Detection (beta)
Microsoft Active Directory AWS Directory Service Managed Service for Microsoft Active Directory
Machine learning Speech Amazon Transcribe Speech-to-Text
Vision Amazon Rekognition Cloud Vision
Natural Language Processing Amazon Comprehend Cloud Natural Language API
Translation Amazon Translate Cloud Translation
Conversational interface Amazon Lex Dialogflow Enterprise Edition
Video intelligence Amazon Rekognition Video Video Intelligence API
Auto-generated models Amazon SageMaker Autopilot AutoML
Fully managed ML Amazon SageMaker AI Platform
Internet of Things IoT services Amazon IoT Cloud IoT
Posted in Uncategorized | Leave a comment

Modern Data Engineering is Complicated

Modern Data Engineering is Complicated.
There are so many things to know to be good.
Languages : SQL , Python, Scala
Operating Systems : Linux, bash shell
Cloud : AWS, Azure, GCP
Data Pipelines : Airflow, Kubeflow
DevOps : Kubernetes, Docker, VPCs, IAM etc.
Relational Database : PostGres, Mysql, Sql Server
MPP Databases : Redshift, Google Big Query, Snowflake
Big Data Storage : S3, HDFS, Google storage
Data Architecture : Machine Learning models, data warehouse models
Streaming Data : Kafka, Kinesis Firebase, Flink, Storm
No SQL: Dynodb , Mongo
Business Intelligence related : Tableau, Looker, Sisense
Data Lakes : S3 or hdfs or databricks.


Posted in Uncategorized | Leave a comment

My other hobby is Stocks

My other hobby is investing and picking stocks.
Here’s my other blog:

Posted in Uncategorized | Leave a comment

How do I unlock blocking queries in Amazon Redshift?

I noticed my dimension table was stuck when I query it for etl job and the truncate statement fails to return.

“Locking is a protection mechanism that controls how many sessions can access a table at the same time and which operations can be performed in those sessions. Most relational databases use row-level locks. Amazon Redshift uses table-level locks. You might experience locking conflicts if you perform frequent DDL operations on user tables, such as ALTER, DROP or TRUNCATE, or DML changes such as UPDATE, DELETE, or INSERT. ”

I use this handy check what’s the locking blocked pids with this code:

use the blocking pid column from this output to find the jobs to kill.

select a.txn_owner, a.txn_db, a.xid,, a.txn_start, a.lock_mode, a.relation as table_id,nvl(trim(c."name"),d.relname) as tablename, a.granted, as blocking_pid ,datediff(s,a.txn_start,getdate())/86400||' days '||datediff(s,a.txn_start,getdate())%86400/3600||' hrs '||datediff(s,a.txn_start,getdate())%3600/60||' mins '||datediff(s,a.txn_start,getdate())%60||' secs' as txn_duration
from svv_transactions a 
left join (select pid,relation,granted from pg_locks group by 1,2,3) b 
on a.relation=b.relation and a.granted='f' and b.granted='t' 
left join (select * from stv_tbl_perm where slice=0) c 
left join pg_class d on a.relation=d.oid
where  a.relation is not null;
select pg_terminate_backend(blocking_PID);

Remember amazon redshift locks the whole table. 


Posted in etl | Leave a comment

Creating a Tableau Report level View Filter (1 report varying view depending on tableau user logged in)

Creating a Report level View Filter in Tableau (1 report varying view depending on tableau user logged in)

use case:

You want a report to show an employee’s daily sales and for the employee to only see his data and manager to see all his employees’ data.

1. create mapping table with user_manager along with the daily sales data

table 1: user_manager

user_name, manager_name,
mike_s, boss_a
mike_s, mike_s

table 2: user_sales

date, sales_amt, user_name

2018/03/04, $20, mike_s
2018/03/03, $20, mike_s

join two objects together in tableau or before it.

select a.user_name, a.manager_name,, b._sales_amt from user_manager a join user_sales b on a.user_name= b.user_name

now we have:   user_name, manager_name, date, sales_amt

2. create tableau to import the entire object of two tables.

3. create calculated field called permission similar. *most important*
with formula username() = manager_name

4. add and apply permissions filter to tableau workbook

source of this example:

Posted in BI reporting, Business Intelligence, tableau | Tagged , | Leave a comment

Deep Learning versus Machine Learning in One Picture


Found on:  and

Sometimes a simple regression model will do.

Posted in data science, machine learning | Tagged | Leave a comment

mysql tricks: do instant table swap to mitigate mysql deadlock error.

A deadlock in MySQL happens when two or more transactions mutually hold and request for locks, creating a cycle of dependencies. … InnoDB automatically detects transaction deadlocks, rollbacks a transaction immediately and returns an error. It uses a metric to pick the easiest transaction to rollback. MyISAM engine locks the entire table from read/write requests.

For example you have a near realtime table that you need to update via etl every 10 min , and that same table is used by your clients to read the current status of say batch processes status.

Avoid the mysql deadlock error by re-loading table in near-realtime. First to check if your tables are locked run as root or db admin :

show processlist ;

How do you write to a table while someone’s reading it without deadlock? Simple: load a staging table of the same table structure and do a rename swap in mysql.

Here’s the steps in list format

  1. create a table with the same structure of the original one.
  2. have etl load that table instead.
  3. do some data validity checks.
  4. do rename table swap code:  a->c, b->a, c->b  (a = prod, b = stage, c=temp)

example mysql code:

  1. create table current_status_stage like  current_status_stage;

  2. — load table from source system via mysql command line or otherwise. ex csv

  3. LOAD DATA INFILE "/tmp/current_status.csv"
    INTO TABLE current_status_stage
    ESCAPED BY '"'
  4. RENAME TABLE reporting.current_status TO reporting.current_status_old, reporting.current_status_stage To reporting.current_status;RENAME TABLE reporting.current_status_old TO reporting.current_status_stage

Continue reading

Posted in etl, relational databases, Uncategorized | Tagged , , , , | Leave a comment

Improve Amazon Redshift table performance fast & easy.

Assuming Redshift tables are un-compressed because most people don’t do it.

  1. list all the big tables by size, here’s a script
  2. run analyze compression
    analyze compression public.report_table_name;

    Example results:

    table column encoding est_reduction_pct
    report_table_name rowid raw 0
    report_table_name create_date zstd 9.49
    report_table_name create_day delta 38.72
    report_table_name create_date_key zstd 28.98
    report_table_name type zstd 67.59
    report_table_name type_desc zstd 58.25
    report_table_name created_by zstd 35.17
  3. recreate table with new compression encoding and see performance gains.

DROP TABLE IF EXISTS public.report_table_name cascade;

CREATE TABLE public.report_table_name
( rowid INT IDENTITY(1,1),
create_date timestamp encode zstd,
create_day date encode zstd,
type integer encode zstd,
type_desc varchar(28) encode zstd,
created_by varchar(30) encode zstd,
data_refresh_date date encode raw
) ;

Amazon manual regarding compression command:

Posted in big data | Tagged , | Leave a comment