mysql tricks: do instant table swap to mitigate mysql deadlock error.

A deadlock in MySQL happens when two or more transactions mutually hold and request for locks, creating a cycle of dependencies. … InnoDB automatically detects transaction deadlocks, rollbacks a transaction immediately and returns an error. It uses a metric to pick the easiest transaction to rollback. MyISAM engine locks the entire table from read/write requests.

For example you have a near realtime table that you need to update via etl every 10 min , and that same table is used by your clients to read the current status of say batch processes status.

Avoid the mysql deadlock error by re-loading table in near-realtime. First to check if your tables are locked run as root or db admin :

show processlist ;

How do you write to a table while someone’s reading it without deadlock? Simple: load a staging table of the same table structure and do a rename swap in mysql.

Here’s the steps in list format

  1. create a table with the same structure of the original one.
  2. have etl load that table instead.
  3. do some data validity checks.
  4. do rename table swap code:  a->c, b->a, c->b  (a = prod, b = stage, c=temp)

example mysql code:

  1. create table current_status_stage like  current_status_stage;

  2. — load table from source system via mysql command line or otherwise. ex csv

  3. LOAD DATA INFILE "/tmp/current_status.csv"
    INTO TABLE current_status_stage
    COLUMNS TERMINATED BY ','
    OPTIONALLY ENCLOSED BY '"'
    ESCAPED BY '"'
    LINES TERMINATED BY '\n';
  4. RENAME TABLE reporting.current_status TO reporting.current_status_old, reporting.current_status_stage To reporting.current_status;RENAME TABLE reporting.current_status_old TO reporting.current_status_stage

Continue reading

Advertisements
Posted in etl, relational databases, Uncategorized | Tagged , , , , | Leave a comment

Improve Amazon Redshift table performance fast & easy.

Assuming Redshift tables are un-compressed because most people don’t do it.

  1. list all the big tables by size, here’s a script
  2. run analyze compression
    analyze compression public.report_table_name;

    Example results:

    table column encoding est_reduction_pct
    report_table_name rowid raw 0
    report_table_name create_date zstd 9.49
    report_table_name create_day delta 38.72
    report_table_name create_date_key zstd 28.98
    report_table_name type zstd 67.59
    report_table_name type_desc zstd 58.25
    report_table_name created_by zstd 35.17
  3. recreate table with new compression encoding and see performance gains.

DROP TABLE IF EXISTS public.report_table_name cascade;

CREATE TABLE public.report_table_name
( rowid INT IDENTITY(1,1),
create_date timestamp encode zstd,
create_day date encode zstd,
type integer encode zstd,
type_desc varchar(28) encode zstd,
created_by varchar(30) encode zstd,
data_refresh_date date encode raw
) ;

Amazon manual regarding compression command:

http://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE_COMPRESSION.html

Posted in big data | Tagged , | Leave a comment

Slide Share’s full of useful knowledge

Best practices for PySpark programming

History of SQL and all the advanced features over the last 30 years among big vendors.

Posted in spark, Uncategorized | Tagged , , , | Leave a comment

What I think of every time I hear Stakeholders

 

credits to: http://darrenwalsh.co.uk/drawings/major-steakholder/

 

Image | Posted on by | Tagged | Leave a comment

Re-Blog: 10 Risks that Beset Data Programmes

Credits to: Peter James Thomas:

https://www.linkedin.com/pulse/10-risks-beset-data-programmes-peter-james-thomas

  1. Not establishing a dedicated team. The team never escapes from “the day job” or legacy / BAU issues; the past prevents the future from being built.
  2. Staff lack skills and prior experience of data programmes. Sub-optimal functionality, slippages, later performance problems, higher ongoing support costs. Time is also wasted educating people rather than getting on with work.
  3. Not establishing an appropriate management / governance structure. The programme is not aligned with business needs, is not able to get necessary time with business users and cannot negotiate the inevitable obstacles that block its way.
  4. Poor programme management. The programme loses direction. Time is expended on non-core issues. Milestones are missed. Expenditure escalates beyond budget.
  5. Big Bang approach. Too much time goes by without any value being created. The eventual Big Bang is instead a damp squib. Large sums of money are spent without any benefits.
  6. Lack of focus on interim deliverables. Business units become frustrated and seek alternative ways to meet their pressing needs. This leads to greater fragmentation and reputational damage to the programme.
  7. Insufficient time spent understanding source system data and how data is transformed as it flows between systems. This leads to data capabilities that do not reflect business transactions with fidelity. There is inconsistency with reports directly drawn from source systems. Reconciliation issues arise.
  8. Not enough up-front focus on understanding key business decisions and the information necessary to take them. Analytic capabilities do not focus on what people want or need, leading to poor adoption and benefits not being achieved.
  9. Lack of leverage of new data capabilities in front-end / digital systems. These systems are less effective. The data team is jealous about its own approved capabilities being the only way that users should get information, rather than adopting a more pragmatic and value-added approach.
  10. Education is an afterthought, training is technology- rather than business-focused. People neither understand the capabilities of new analytical tools, nor how to use them to derive business value. Again this leads to poor adoption and little return on investment.

 

Most companies don’t realize the effort involved in Business Intelligence. Often they just throw semi-technical business analysts at a problem instead of committing dedicated technical resources to Data Management and Analytics Reporting.

Posted in Business Intelligence | Tagged , | Leave a comment

The reality of a data worker.

data_worker

Taken from a Dataiku meetup slide.  This picture hit close to home.

Image | Posted on by | Tagged | Leave a comment

Things to note when migrating web hosts

  • Backup everything! all the files on the hosts. Hostmonster, my previous domain had this thing about increasing the monthly fee $1 per yearly renewal.
  • On domain manger set dns redirect 301 to new host ASAP
  • Find cheap (free hosting) gitpages is nice option for a mostly static website.
  • Don’t forget to change gmail’s mx servers for email redirects or your emails break.
  • I switched to wordpress.com as my blog host instead running my own LAMP wordpress instance. Less work. Export xml for all the posts.
  • That’s all;  Little harder than squarespace but still smarter than myspace.
  • Cons: I’m missing many google analytics tracking and SEO extensions from running my own wordpress.
Posted in Uncategorized | Leave a comment