Notes from a Data Warehousing workshop with PragmaticWorks hosted by Microsoft in Reston, VA

The morning started with a quirk: I had decided to ask the PragmaticWorks presenter the favor of inquiring if any of the attendees could give me a ride to the metro. There was an awkward moment when he said “I cannot do that, sorry” but when I asked the audience a nice pair of techies had mercy on me and my carlessness. To be fair, no one should expect a presenter to be a cab driver as well, of course; the gentleman was clearly knowledgeable making the whole seminar was very informative.

As I was sitting grateful for the assurance of a ride, a thought crossed my mind that I might win the random attendee reward, a Microsoft Surface 2 tablet. Who knew that not only this would happen but that my ride to and on the metro would turn into a fascinating conversation with a hyper intelligent and equally attractive geek with beautiful green eyes.

But let’s get back to the topic of…

What exactly is modern data warehousingaccording to Microsoft?

Not surprisingly, the presenter started with a sample of the old / new questions businesses are asking:

  • What’s the social sentiment for my brand or products? – Social and web analytics
  • How do I optimize my fleet based on weather and traffic patterns? – Live data feeds
  • How do I better predict future outcomes? – Advanced analytics

Referring to the #iatethebones social campaign during which KFC pulled data through Hadoop and stayed ahead of the million of submissions by using the social media response to predict their sales.

An interesting, and unexpected illustration of the use of predictive analytics to identify top items purchased before hurricane hits was shared. Whether it is an illustration of the power of big data or of the unpredictable nature of humans, you be the judge, but the top of the list is surely surprising:

  1. Strawberry Pop Tarts
  2. Bottled water
  3. Bleach, mops and other cleaning supplies
  4. Flashlights / candles
  5. First-aid supplies
  6. Generators
  7. Batteries
  8. Ice

Then we moved to the claim that 75% of CIO desire to worry less about systems and more about innovation. I can only imagine the question being asked: “What would you like to worry about more?…” But let’s continue

Microsoft’s Data Warehousing Proposition

Microsoft aims to make SQL Server the fastest and most affordable database for customers of all sizes. The goal is to offer a simplified data warehouse management through:

  • Massive scalability at a low cost
  • Flexibility and Choice
  • Complete Data Warehousing Solution

Introducing Parallel Data Warehousing

  • Pre-build hardware + software appliance
  • Co-engineered with HP and Dell
  • Pre-built hardware
  • Appliance installed in 1-2 days
  • Microsoft provides full call support
  • Hardware partner provides onsite break/fix support

Microsoft Data Warehousing Solutions

Of course, at the center of anything Microsoft offers is SQL Server 2012 and its enterprise stack of data management technologies (SQL Server, SSRS, SSIS, SSAS):

  • Scalable / reliable SMP platform for data warehousing or any hardware
  • Ideal for data marts or small to mid-side

MS-BI-Architecture

Where things got interesting is when the talk turned to the notion of a data warehousing appliance built upon Microsoft technology as a relatively affordable implementation of a parallel data warehouse.

Why Parallel Data Warehouse (PDW):

  • Scalable
  • Reliable
  • Fast

Symmetric Multi-Processing (SMP)

  • Multiple CPUs used to complete individual processes simultaneously
  • All CPUs share the same memory, disks, and network controllers
  • All SQL Server implementations up until now have been SMP

Massively Parallel Processing (MPP)

  • Uses many separate CPUs running in parallel to execute a single program
  • Each CPU has its own memory
  • Applications must be segmented, using high-speed communications between nodes

What intelligence driven organizations are focusing on:

  • Retail services: what customers do we want to attract and how do we maximize our profitability?
  • Healthcare: how do we better analyze census, clinical and provider data together for an integrated view of our business?
  • Data services: how do we combine data from new sources faster and provide a platform for growth to our clients?
  • Major credit card firm: how does Hadoop better improve our ability to provide analysis to our analysis?

Seamlessly add capacity:

  • Smallest (53 TB) to largest (6 PB)
  • Start small linear, scale out
  • Add capacity up to 6 petabytes.

Designed for parallel processing

Data loads in parallel into multiple instances of SQL Server.

Next-Gen performance

  • xVelocity – Fast data query processing
  • Columnstore provides dramatic performance
  • Updatable and clustered xVelocity columnstore
  • Stores data in columnar format
  • Memory-optimized for next-generation performance
  • Updatable to support bulk and / or trickle loading
  • Up to 50X faster
  • Up to 15X compression
  • Save time and costs
  • Real-time DW

Design goals

  • Load data efficiently and non-obtrusively, respecting concurrent queries and loads
  • Reduce table fragmentation
  • Provide system recovery capabilities
  • Providing multiple data loading options

Data loading options

  • DWLoader Utility
  • SQL Server Integration Services (SSIS)
  • Create table as select (CTAS)
  • Standard SQL DML Statements (Insert / Select)

Hadoop ecosystem with its capability of handling vast unstructured data sets, creates a challenge of integrating the noSQL data into a SQL Server-based data warehousing solution. Microsoft’s answer is a very intriguing and is called Polybase but is, sadly, not available yet.

What is Polybase and what are its goals?

  • Seamless integration with Hadoop via regular T-SQL
  • Unifies relational  and non-relational data
  • Enables in-place queries and familiar BI tools
  • Part of an overall Microsoft Bid Data Story

Polybase Details

  • External tables and full SQL query access to data stored in HDFS
  • Imposing “structure” on the unstructured data in HDFS
  • Joining ‘on-the-fly’ PDW data with data from HDFS

The conclusion

Having attended a number of conferences (TDWI, Strata) and seminars, I take a very pragmatic approach to the choice of IT architecture, familiar as I am with the technical preferences and overall reality of my work place. I tend to gravitate towards Microsoft’s BI solution because so far it has proven very successful and relatively affordable answer to my team’s challenges.

Thus I was delighted to hear the fellow who would give me a ride, Jonathan, ask directly the most relevant question: on the approximate cost of the PDW solution. The answer — more of a informed guess was: a quarter rack would apparently priced around $250,000 not including discounts and excluding the data loading services.Let’s just say that while this is not pocket money, it is not Exadata either.

Having a previous Microsoft agreement would be helpful, of course. The maintenance agreement offered has two levels, and two providers — with Microsoft for the software, and with HP or Dell, correspondingly, on the appliance (the Dell solution is packaged and architected a bit differently than the one from HP but is available nevertheless.)

All I can say so far is that to spend this type of money on a Parallel Data Warehousing appliance at the non-profit where I manage the Web & Data Services team, I would need executive support. When we get there though, it would be a move in the right direction — architecturally and technologically — for my team!

In the mean time, I will try not to be tempted too much by the Surface 2, and will enjoy this gorgeous first day of spring! For this beautiful day, thank you, PragmaticWorks and Ramsey Informatics!

Leave a Reply

Your email address will not be published. Required fields are marked *