OTOlabs

OTOlabs // otolabs-posts

Posts

An update of my SQL Server “Bad Habits” series

Tuesday, March 9th, 2010

badhabit

Last October, I started a series of blog posts over on sqlblog.com that highlighted some bad practices I observe from time to time.  Earlier this year, I felt motivated to pick the series up again.
(more…)

Upcoming Webinar: Customer Experience Among the Top 8 Travel Sites

Wednesday, February 24th, 2010

(more…)

Injection is not always about SQL

Sunday, February 7th, 2010

I think anybody even remotely involved with databases these days has seen the xkcd comic about Bobby Tables : http://xkcd.com/327/ 

Basically, the comic warns against SQL injection, and reminds you to sanitize your database inputs rather than blindly append incoming data to queries.  There are more elaborate discussions about this concept in these articles on MSDN, and of course your favorite search engine will have many results as well:

    Books Online : SQL Injection

    MSDN Mag : Stop SQL Injection Attacks Before They Stop You

It is important to remember, though, that not all injection attacks are intended to screw up your database or gain unauthorized access to your data.  Some vectors are a little more subtle, and while the damage isn’t always as severe, it is still something you need to be aware of to prevent fraudulent activity or even just to keep egg off your face.  I’ll illustrate with two examples:

 
Letting customers dictate your pricing

I remember several years ago, I demonstrated to an online shop why they shouldn’t add items to a shopping cart by passing (among other things) price information in the query string.  I showed them how I was able to buy a bunch of T-shirts at $0.01 each, and encouraged them to verify whether my cart would have gone through like that.  As it turned out, it would have.  For pointing out this fundamental flaw, they added a free t-shirt to my order, and eliminated the shipping charge.  I am curious how quickly they would have caught this - or if some folks even got away with it unnoticed - had I not pointed it out. I think it would be really hard to trace back and prove the buyer to be at fault.

 
Letting customers change your messaging

This morning, fellow MVP Simon Sabin (@simon_sabin) pointed out that Toyota Ireland had a similar flaw in a site used for their recent recall issues.  Basically you could change their message to you .  Now arguably this doesn’t really harm anyone (because you are only able to change your own messaging), except it was fun to get a screen shot of something like this before they fixed the exploit (remarkably quickly, I might add):



 click to embiggen


Summary

So just as a reminder, when you’re accepting input from users, validate and sanitize!  And when you’re accepting input from your own site, make sure it really comes from your site - there is no reason to expose contextual information in the query string, as this just invites users to tinker.  Unless you have a really good reason to use QueryString parameters, protect your data and variables by passing them via post or session.

Mixing OLTP and reporting using indexed views

Tuesday, December 29th, 2009

I’m not a big fan of denormalizing, nor of repeating redundant redundant information in a database when those facts can already be derived from other information.  A classic example of the latter is when I see questions on newsgroups, forums or StackOverflow that ask how they can update a table’s rank column to reflect the current rank based on some other criteria in the table.  The problem with this is that you have to run the update every time any DML operation touches any row in the table, and if you can calculate that rank in an UPDATE query, you can also calculate that rank in a SELECT query, so why not just figure it out in real time?  An example of the former is when we want to maintain a second table with aggregates from another table.  This is where I want to spend my time today.

In reality, many of us deal with OLTP systems that must also serve as the reporting source, so it’s not always feasible to calculate aggregates in real time against a constantly moving target.  In several of these scenarios I’ve used indexed views, where we have tables that are inserted often, read often for statistics, but rarely or never updated… sure, you pay a little hit up front on the insert, but the benefit achieved during real-time reporting was worth it.  Let me give you an idea of what I was dealing with initially, and how I have fixed it over time to perform even better.

One of our systems is an SaaS e-mail platform, where we offer the ability for our customers to distribute messaging to their customers (mostly via e-mail, but we also support fax, SMS, widgets/gadgets, and social media).  For brevity, let’s call these people “recipients.”  If you stripped the schema down to the bare essentials for metadata, you’d see this (I’m going to leave out the foreign key and other constraints, as they are either obvious or irrelevant):

CREATE TABLE dbo.Domains
(
  DomainID INT PRIMARY KEY,
  DomainName VARCHAR(255)
);

CREATE TABLE dbo.Recipients
(
  RecipientID INT PRIMARY KEY,
  DomainID INT,
  LocalPart VARCHAR(64)
  /* , ... other columns ... */
); 

 
Obviously we store many other details about recipients, such as demographic information, preferences, opt-in status, subscriptions, etc.  But for this discussion, the above is sufficient.  Notice that for space savings we don’t actually store the e-mail address of the recipient, though it is required and is enforced to be unique via a constraint on (DomainID, LocalPart); this information can easily be derived using a view (you can even make this an indexed view if you want quicker access to e-mail addresses, but I’ll leave that for another day):

CREATE VIEW dbo.vRecipientDetails
AS
    SELECT
        r.RecipientID,
        EmailAddress = r.LocalPart '@' d.DomainName
    FROM
        dbo.Recipients AS r
    INNER JOIN
        dbo.Domains AS d
        ON r.DomainID d.DomainID;

 
RecipientID is the primary key because it is used in a lot of related tables, most importantly, stats tables.  We record every single transaction for a recipient: when a message is attempted to be sent to them; when a successful delivery occurs; when a bounce occurs; when the recipient opens the message or clicks on a link in the message; when the recipient changes their preferences; or, when the user declines to receive further communications from our customer.  The thought of storing LocalPart + DomainID (or the fully composed e-mail address) in all of these other tables that would grow and grow over time made a surrogate representation a very easy choice for the primary key.  I didn’t want to make the discussion revolve around this, but I’ve seen several raised eyebrows in the past about surrogate primary keys and wanted to assure you that, in this system, I think it is the best choice.  (If you want to debate that, let’s have a different discussion, as long as it doesn’t turn religious.)

Okay, so I mentioned this is an OLTP system, and I mentioned that we record all of the individual transactions for a recipient in various stats tables.  Let’s take a quick look at how one of these stats tables looked for the first, oh, 5 years of their existence:

CREATE TABLE dbo.Deliveries
(
  MessageID INT-- references a table dbo.Messages
  RecipientID INT,
  EventDate SMALLDATETIME
  /* , ... other columns... */
);

 
So a customer would send out a message targeted to, say, 50,000 recipients.  During the next few hours we would fill up this table with successful deliveries (and other similar tables with bounces, opens, clicks, etc).  Soon after inception of the system, we found that our customers wanted to pull real-time statistics on how the messages were going… mostly on the lookout for high bounce rates (well, low delivery rates too, I guess).  Of course this was very contentious if they pulled reports while the tables were loading.  So in order to give a much more satisfactory experience to the user pulling reports in our web UI, we created indexed views that would automatically maintain the number of deliveries, bounces, etc.:

CREATE VIEW dbo.vMessageDeliveries
WITH SCHEMABINDING
AS
  SELECT
    MessageID,
    c = COUNT_BIG(*)
  FROM
    dbo.Deliveries
  GROUP BY
    MessageID;
GO

CREATE UNIQUE CLUSTERED INDEX m
  ON dbo.vMessageDeliveries(MessageID);

 
So now the inserts were a little more expensive but, as mentioned above, this was much better for overall system performance (at least perceived performance) and, more importantly, customer happiness.

Then customers wanted more details on their statistics.  They wanted to see how their messages were getting into, say, Hotmail compared to AOL.  We have this information, but it was quite expensive to retrieve.  Remember earlier I noted that we don’t typically store redundant information, but this means that in cases like this, we need to do a lot of joins.  So for example, to provide a responsive report that showed real-time message status per domain, we created indexed views like this:

CREATE VIEW dbo.vMessageDomainDeliveries
WITH SCHEMABINDING
AS
  SELECT
    d.MessageID,
    r.DomainID,
    c = COUNT_BIG(*)
  FROM
    dbo.Deliveries AS d
  INNER JOIN
    dbo.Recipients AS r
    ON r.RecipientID d.RecipientID
  GROUP BY
    d.MessageID,
    r.DomainID;
GO

CREATE UNIQUE CLUSTERED INDEX md
  ON dbo.vMessageDomainDeliveries(MessageIDDomainID);

 
Of course this again added more strain to the insert process on the deliveries table, since it now had to maintain two indexed views.  In isolation, this still yielded better overall performance than essentially expanding that view and trying to get at the domain aggregates in real time.  But it introduced a new problem to the system: severe blocking.  Picture the case where deliveries are being recorded while a recipient is trying to update their preferences or change their e-mail address.  Since both processes need to lock and potentially update the index on the indexed view, one has to wait for the other.  This can lead to a sad face on an end user.

What I ended up doing was to recant my disdain for storing redundant information, and store the domain information in the deliveries table.  Yes, this was an extra 4 bytes, but as we were moving the whole system to a new 2008 cluster with a much faster I/O subsystem, we would get much more than that back with page compression.  So starting over, we still have the same Domains and Recipients tables, but now domain-based reports are driven from objects that look like these:

CREATE TABLE dbo.Deliveries
(
  MessageID INT,
  RecipientID INT,
  DomainID INT,
  EventDate SMALLDATETIME,
  /* , ... other columns ... */
);
GO

CREATE VIEW dbo.vMessageDomainDeliveries
WITH SCHEMABINDING
AS
  SELECT
    MessageID,
    DomainID,
    c = COUNT_BIG(*)
  FROM
    dbo.Deliveries
  GROUP BY
    MessageID,
    DomainID;
GO

CREATE UNIQUE CLUSTERED INDEX md
  ON dbo.vMessageDomainDeliveries(MessageIDDomainID);

 
The up-front cost of calculating the domain of each recipient during insert is likely very equivalent to the system doing the same lookup while maintaining the indexed view (except we can use snapshot isolation in our query, but can’t really enforce the same during clustered index updates).  And in addition to no longer causing blocking on updates to the recipients table, we also get a much more accurate representation of history: the delivery is marked with the domain the recipient had on that day, since their e-mail address could have changed 20 times since then.  All in all it has turned out that storing the domain information multiple times has helped contribute to making a very busy system suddenly seem almost idle.

The primary lesson learned here: indexed views can be a very helpful tool in your arsenal, but they can bring you down if you try to over-use them.  Another lesson learned: do not assume that redundancy is the devil.  It can help out in many scenarios, even if it feels a little dirty.

How do I chat thee? Let me count the ways.

Saturday, November 14th, 2009

How do I chat thee?
Let me count the ways:

I can chat thee old school
face to face and toe to toe
our meat mouths moving
air vibes chafing the drums in our ears.

I can chat thee through chat boards
Do they still have chat boards?
Do they still know if you’re a dog?
Do the pheromones still hang as sweet and loud?

I can chat thee on AOL
the running man messenger
claims my home page as his own
and where did that tool bar come from?

I can chat thee on Twitter
but does everyone see our repartee?
Does our use of hash tags
mean it’s for their benefit?

I can chat thee on Facebook
in so many ways
wall to wall or that box in the corner
or we both can make comments on something we Like.

I can chat thee on Google
by email, by Wave, by another box in the corner.
Don’t mind that they’re watching
every thing that we say, every thing that we might do.

I can chat thee on Mobile
with Facebook and Twitter and old AOL
with SMS we can text our innermost thoughts
to each others mobile phone, with pictures.

How did I chat thee?
On what channel did we parse
our arrows of connection?
And did we leave a trace?

Email is dead - long live email!

Tuesday, October 13th, 2009

This Wall Street Journal article is significant not so much for it’s insight into the changing face of internet messaging (which boils down to “Kids these days!”) but rather for the acknowledgment that email is losing (lost?) it’s place as the primary communication channel. Now any article that ends with a whinge about why not try just “talking to someone in person” isn’t likely to have much to say about how to manage your place in this new world of communications, but it does do a good job of describing the angst that many people have about a world where Twitter, Facebook and Yammer are where the real conversations are happening.  The article also recognizes the increased importance of defining, managing and projecting your identity (but misses the critical role that email has in managing your identity).

 

Helping people kick bad SQL Server habits

Saturday, October 10th, 2009
Bad Habits in T-SQL

(more…)

Owyang and Li on Google’s Stealth Social Network Play

Wednesday, October 7th, 2009

Jermiah Owyang and Charlene Li have written an spot-on post about Google’s stealth social media implementation. Google has quietly rolled out all of the components of a social media presence including profiling, trusted communication, sharing, and content management. It has done so one element at a time, with each element powerful enough to stand on its own.

 

 

The rise and fall of TV advertising?

Tuesday, October 6th, 2009
baldwin

(more…)

Sometimes a service outage can be a good thing

Saturday, September 26th, 2009

Gmail Service Disruption

(more…)

Learn More About Our Services

To learn more about our services, please fill out and submit the form below:

 
 
 
 
 

I'm interested in the following:

 

 

 
 

 

OTOlabs Postings