Blog Home  Home Feed your aggregator (RSS 2.0)  
What did you learn today? - SQL
Phil Denoncourt's Technology Rants
 
 Wednesday, March 21, 2007
Reviewing Databases by phildenoncourt

There is plenty of guidance on reviewing code.  (Look at CodeFrisk.Com/Guidance.aspx for some links).   I’ve found that the one absolutely most critical piece of a traditional application never undergoes any type of review.  Most applications live and die by their database, and yet it rarely gets reviewed.  If you’re lucky (although you might not think so), you’ll have a DBA check things over, but they are mostly concerned with performance.  Maintainability and adherence to standards are not at the top of their list.  

 

I’ve listed some of the things I look at when I review a database:

-Appropriate Normalization

It’s tough to say what is and isn’t appropriate normalization.  You’ll know it when you see it.  Stuff like having a separate table for the US states or one called Gender is usually a red flag of having gone too far.  Conversely, having the same field in multiple tables is usually a sign that normalization hasn’t gone far enough. 

 

--Is all access through stored procedures

Most development shops have a policy that says, “All data access shall be done through stored procedures”.  If you have that policy, check to make sure this is being adhered to.  An easy way to check this is to deny access to the tables and allow access to the stored procedures.

 

--Look for premature optimization

I’ve seen times when developers are using Join hints or locking hints in their query.  “If I force the query to use a merge join, it goes 5x faster”.  Meaning it goes 5 times faster on the development box which has far fewer rows and 3 fewer processors than the production machine.  Once it goes the to production machine, there’s a good chance that SQL will decide upon a different plan to execute the statement, making the hint destructive on the production machine

 

--Are objects secured and scripted

Eventually, when you roll the database into a production environment, access to the database will (or should) be locked down.  Is the development environment the same?  Are the scripts that you have under source control (you do have the database creation scripts under source control don’t you?)  include securing of the object.

 

--Are updates applied in the same order

I can think of no better way to create an application that suffers from chronic deadlock situations than to have one stored procedure update tables A, B, & C (in that order) and have another stored procedure update tables C, B, & A (in that order).  Updates should always be applied in the same order.  Otherwise, you’re increasing the chance that a row in A will be locked by one stored proc while waiting for a row in table C locked by the other stored proc.  Generally the order is enforced naturally because you have to respect foreign keys, but every now and then I come across this.

 

--Biblical stored procedures

By biblical, I mean volume, not divinely inspired.  I’m referring to stored procedures that use up more than one printer cartridge if you were to print it.  Nowadays when you see a large stored procedure, it’s one of two things:

1) Embedded business logic – This seems to be a remnant of client server programming where you were forced to put your business logic in the stored procedure.  It’s arguable about whether or the database is the best place to keep your business logic.  Most people (including me) believe it’s not.  Business logic should be kept in a business logic component that validates, enforces and calculates.

2) Poorly defined schema.  Most validations that you see taking place in stored procs could be controlled by using schema constructs like check constraints, triggers, foreign keys and defaults.  Put that type of information in the schema where it can be enforced consistently everywhere rather than burdening the stored procedure.

 

--Does the Development database match what’s in source control

To reinforce this, database schema objects need to be kept under source control.  Having a database backup plan isn’t sufficient.  Putting the objects in source control allow you to manage changes and releases much more effectively.  Otherwise you’ll be struggling to determine which objects have changed in the development database and need to be deployed, and figuring out who added a field to table.  Keeping the schema under source control used to be a very manual process, but now tools like Visual Studio for Database Professionals make this painless.

 

--Cloned stored procedures / Views.

This is very common.  When you need data, most developers don’t look to see if there is an existing stored procedure or view that satisfies their needs.   They’ll create a brand new one.  Then you end up with a database that has 3 different stored procedures that get a customer record by its ID.  Now all these duplicated procs have to be maintained.  Best prevention for this is to have a strict naming convention for your procedures. 

 

Those are some of the quick patterns I look for when reviewing a database.  What are the kind of things that stand out when you look over a database?

 

Wouldn’t you feel comfortable having your code reviewed by an expert?  Go to CodeFrisk.com to see how I can proofread your code at a reasonable price.

Wednesday, March 21, 2007 3:59:42 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]   Development | SQL  | 
 Sunday, November 19, 2006
Sample Databases by phildenoncourt

For the past month, I've been focusing on getting up to speed on the data mining features of SQL 2005. Really amazing stuff. I'll be giving a presentation on this in the near future. One of the things that took a significant amount of time was putting together a database that I could use for testing. I didn't want to use AdventureWorks or something like that because I wanted something that had more data and more "real world".

So I downloaded the past 20 years of stock prices from a public quote server, the past 10 years of foreign currency prices, and a slew of economical data from Federal Reserve. (See picture below). The database has over 7,300 companies (I tried to get all NASDAQ, NYSE, and AMEX tickers) with over 16 million quotes.

Now I'm creating models and running predictions. For the most part, I'm able to exercise all of the data mining algorithms. I haven't found the secret formula to the stock market yet, but someday…

If anybody else is interested getting a copy of this database, let me know via the contact link. Because it's over a gigabyte in size and I don't have massive bandwidth allowances in my hosting account, it has to be transported via postal mail. PayPal me $30 to cover the cost of burning a DVD and sending it, and I'll get it out to you.

Sunday, November 19, 2006 3:45:32 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]   SQL  | 
 Wednesday, September 13, 2006

I was sitting in this meeting in which I was only peripherally involved and there's only so much doodling you can do.  So I did what most other people in the meeting were doing, I started daydreaming. 

I'm a big fan of GUIDs, but there's all this noise about how slow they are because they're bigger than an int making for inefficient searching.  So I started to wonder, how much slower are GUIDs than ints? So I started to do the math, and for a table with 4 billion rows, I was coming out with only one more logical read if the key was a GUID than an int.... I know.. Didn't make sense to me.  So I put it to a practical test.

I created two tables.  One called IntWithData and the other called GuidWithData.


I then populated them with a million rows.  --As a side note, I started the scripts to populate the tables at about the same time, and they finished at the same time.  Something I think I will research further is how much slower it is to insert a record in a GUID table compared to an INT table.  Common sense tells you it's supposed be way slower to populate a GUID--

I took a look at the statistics for the index by using the following statements:
SELECT * FROM sys.dm_db_index_physical_stats(db_id(),object_id('dbo.GuidWithData'),null,null,'DETAILED')
SELECT * FROM sys.dm_db_index_physical_stats(db_id(),object_id('dbo.IntWithData'),null,null,'DETAILED')

That produced these results:

database_id object_id index_id partition_number index_type_desc alloc_unit_type_desc index_depth index_level avg_fragmentation_in_percent fragment_count avg_fragment_size_in_pages page_count avg_page_space_used_in_percent record_count ghost_record_count version_ghost_record_count min_record_size_in_bytes max_record_size_in_bytes avg_record_size_in_bytes forwarded_record_count
6 2105058535 1 1 CLUSTERED INDEX IN_ROW_DATA 3 0 99.037 5919 1 5919 68.85677 1000000 0 0 31 31 31 NULL
6 2105058535 1 1 CLUSTERED INDEX IN_ROW_DATA 3 1 96.2963 27 1 27 67.68663 5919 0 0 23 23 23 NULL
6 2105058535 1 1 CLUSTERED INDEX IN_ROW_DATA 3 2 0 1 1 1 8.314801 27 0 0 23 23 23 NULL

 

database_id object_id index_id partition_number index_type_desc alloc_unit_type_desc index_depth index_level avg_fragmentation_in_percent fragment_count avg_fragment_size_in_pages page_count avg_page_space_used_in_percent record_count ghost_record_count version_ghost_record_count min_record_size_in_bytes max_record_size_in_bytes avg_record_size_in_bytes forwarded_record_count
6 2073058421 1 1 CLUSTERED INDEX IN_ROW_DATA 3 0 0.423403 317 8.195584 2598 99.84113 1000000 0 0 19 19 19 NULL
6 2073058421 1 1 CLUSTERED INDEX IN_ROW_DATA 3 1 0 9 1 9 46.33886 2598 0 0 11 11 11 NULL
6 2073058421 1 1 CLUSTERED INDEX IN_ROW_DATA 3 2 0 1 1 1 1.420806 9 0 0 11 11 11 NULL

This is tough to read, but what you should be getting out of this is: Guid table is very fragmented.  Int table is not.  Also the Guid table had about 2.25 more pages than the Int table.  Ideally you would be running an overnight process that reorganized pages in your DB, and that would help with the fragmentation.

So now the test query.  The question I wanted to answer was:  How many logical reads are needed to read a record out of the Int table and how many are needed to read out of the Guid table?

SET STATISTICS IO ON

SELECT * FROM IntWithData WHERE ID = 619284

SELECT * FROM GuidWithData
WHERE ID = '7DD78950-D3CD-4016-8D92-738A6E0666F2' -- I had to find this value in my data

The results:
....

(1 row(s) affected)

Table 'IntWithData'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

Table 'GuidWithData'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

3 logical reads. That means in a table with 1 million rows, there should be no difference between a table with Guid as a key and one that has an int.  This is because the records are searched for using a binary search algorithm.  In a perfectly balanced tree, it shouldn't take the algorithm more than 20 tries to find the key with many of those "tries" are contained in the same index page.  At some volume, and I haven't calculated exactly where, there will be a difference of one logical read, but that's it.

Hopefully, I haven't misinterpreted the results of this test.  I don't think I have.  The bottom line is that Guids take up more physical space, but you're able to find them just as fast as ints.

 

Wednesday, September 13, 2006 9:51:55 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]   Development | SQL  | 
 Friday, August 18, 2006

I was preparing for the SQL Server Exam 70-441 (Designing Database Solutions by Using Microsoft® SQL Server™ 2005) and I kept coming across a whole bunch of small features that you don’t hear a lot about.  I’m not talking about the big ones everyone already knows about like CLR Integration, DDL triggers, XML, and ranking functions.  These are ones that took me by pleasant surprise.  I think they’re really cool, so I wanted to point them out.

 

Included Columns / Covering Indexes

Say you had a critical/oft-used report that ran a statement like:

 

SELECT Name, ProductNumber

FROM Production.Product

WHERE Style='U' AND Class='L'

 

If you wanted to put an index on the table to speed things up, you’d put an index on Style, Class, Name, and ProductNumber.  That would remove the need to go from the index page to the table page to get all the information for the query.  SQL 2005 has added the include clause to the Create Index statement.

 

CREATE INDEX IX_Product_Class_Style

ON Production.Product (Class,Style)

INCLUDE (Name, ProductNumber);

 

What this is doing is indexing just the Class and Style fields of the table.  The Name and ProductNumber are stored at the leaf nodes of the index.  This saves space in the index pages, meaning you can get more rows in an index page, reducing I/O and Disk Space.

 

Dynamic Management Views

Seriously, these are pretty cool.  Really.  If you haven’t looked at these, take the time to play around with a few. 

 

With these new views, you are able to get high visibility into the SQL Server engine.  Wondering what’s about to be written to the disk?  Query sys.dm_io_pending_io_requests.  Troubleshooting execution plans?  Use sys.dm_exec_cached_plans and sys.dm_exec_plan_attributes.  Need to see what’s running on the server right now?  sys.dm_os_tasks.  How often is an index used?  There’s sys.dm_db_index_usage_stats that will answer that question.  There are over 80 views that provide very useful information when you are troubleshooting problems or just curious as to how SQL Server organizes itself.

 

sp_create_plan_guide

Every now and then, I have a query that runs like a jackrabbit on my development machine, but runs like 3 legged turtle on a production machine.  This is usually because it decided not to use an index on the production machine due to differences in data density or because running the query with parallelism is slower with some statements.  To troubleshoot these situations, I end up modifying the stored procedure to add various hints.

 

Rather than modify the stored procedure, you can specify a hint for an existing query plan using the sp_create_plan_guide stored proc.  This way you can experiment without modifying the stored procedure and the hint doesn’t become part of your codebase.  MSDN documentation can be found here.

 

OUTPUT Clause

A lot of people already know about this pretty cool feature, but I didn’t.  In the past, if you do an insert to a table, and the table has an identity column, after the insert, you would look at @@IDENTITY or scope_identity().  That always seemed hackish to me.

 

You can now add an OUTPUT clause to the end of an INSERT, UPDATE, or DELETE statement.  That will return the rows changed by the statement.

 

INSERT INTO Person.Address

      (AddressLine1,AddressLine2,City

      ,StateProvinceID,PostalCode,rowguid,ModifiedDate)

VALUES(@AddressLine1,@AddressLine2,@City

      ,@StateProvinceID,@PostalCode,@rowguid,@ModifiedDate)

OUTPUT INSERTED.AddressID;

 

You use the pseudo tables INSERTED and DELETED, just like in triggers.

 

Storing BLOBs in Table page

Inevitability, on each project I work on, there is a requirement to have a field capable of storing more than 8k in a table.  It’s usually a field that the user can enter unlimited comments about a customer.  So I define it as a text field.  I always feel guilty about it because the user’s text will rarely be more than 100 bytes and I know it’s going to be stored on a different page.

 

In SQL 2005, if the data will fit in the same page as the rest of the record, it will place it in the same page.  This works with varchar/varbinary/nvarchar(max) and XML datatypes.  The behavior is described here at MSDN.

Persistence of Computed Columns

This is kind of neat.  We’ve had the ability to define computed columns in tables for a while.  They worked exactly as they’re named.  When the column was needed, it was computed.  Now the results of the computed column can be saved to disk.  This allows you to index computed columns.  You can also feel free to create calculated columns that can be computationally intense.

 

CREATE TABLE LineItem (

      LineItemID int IDENTITY(1,1),

      OrderID int,

      ItemDesc varchar(512),

      Qty int,

      Cost money,

      LineTotal as Qty * Cost PERSISTED

)

 

Those are all the ones I found.  What you have noticed as small new features in SQL that make life easier?

Friday, August 18, 2006 9:01:27 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]   Development | SQL  | 
Copyright © 2008 Phil Denoncourt III. All rights reserved.
DasBlog 'Portal' theme by Johnny Hughes.
Pick a theme: