There is plenty of guidance on reviewing code. (Look at CodeFrisk.Com/Guidance.aspx for some links). I’ve found that the one absolutely most critical piece of a traditional application never undergoes any type of review. Most applications live and die by their database, and yet it rarely gets reviewed. If you’re lucky (although you might not think so), you’ll have a DBA check things over, but they are mostly concerned with performance. Maintainability and adherence to standards are not at the top of their list.
I’ve listed some of the things I look at when I review a database:
-Appropriate Normalization
It’s tough to say what is and isn’t appropriate normalization. You’ll know it when you see it. Stuff like having a separate table for the US states or one called Gender is usually a red flag of having gone too far. Conversely, having the same field in multiple tables is usually a sign that normalization hasn’t gone far enough.
--Is all access through stored procedures
Most development shops have a policy that says, “All data access shall be done through stored procedures”. If you have that policy, check to make sure this is being adhered to. An easy way to check this is to deny access to the tables and allow access to the stored procedures.
--Look for premature optimization
I’ve seen times when developers are using Join hints or locking hints in their query. “If I force the query to use a merge join, it goes 5x faster”. Meaning it goes 5 times faster on the development box which has far fewer rows and 3 fewer processors than the production machine. Once it goes the to production machine, there’s a good chance that SQL will decide upon a different plan to execute the statement, making the hint destructive on the production machine
--Are objects secured and scripted
Eventually, when you roll the database into a production environment, access to the database will (or should) be locked down. Is the development environment the same? Are the scripts that you have under source control (you do have the database creation scripts under source control don’t you?) include securing of the object.
--Are updates applied in the same order
I can think of no better way to create an application that suffers from chronic deadlock situations than to have one stored procedure update tables A, B, & C (in that order) and have another stored procedure update tables C, B, & A (in that order). Updates should always be applied in the same order. Otherwise, you’re increasing the chance that a row in A will be locked by one stored proc while waiting for a row in table C locked by the other stored proc. Generally the order is enforced naturally because you have to respect foreign keys, but every now and then I come across this.
--Biblical stored procedures
By biblical, I mean volume, not divinely inspired. I’m referring to stored procedures that use up more than one printer cartridge if you were to print it. Nowadays when you see a large stored procedure, it’s one of two things:
1) Embedded business logic – This seems to be a remnant of client server programming where you were forced to put your business logic in the stored procedure. It’s arguable about whether or the database is the best place to keep your business logic. Most people (including me) believe it’s not. Business logic should be kept in a business logic component that validates, enforces and calculates.
2) Poorly defined schema. Most validations that you see taking place in stored procs could be controlled by using schema constructs like check constraints, triggers, foreign keys and defaults. Put that type of information in the schema where it can be enforced consistently everywhere rather than burdening the stored procedure.
--Does the Development database match what’s in source control
To reinforce this, database schema objects need to be kept under source control. Having a database backup plan isn’t sufficient. Putting the objects in source control allow you to manage changes and releases much more effectively. Otherwise you’ll be struggling to determine which objects have changed in the development database and need to be deployed, and figuring out who added a field to table. Keeping the schema under source control used to be a very manual process, but now tools like Visual Studio for Database Professionals make this painless.
--Cloned stored procedures / Views.
This is very common. When you need data, most developers don’t look to see if there is an existing stored procedure or view that satisfies their needs. They’ll create a brand new one. Then you end up with a database that has 3 different stored procedures that get a customer record by its ID. Now all these duplicated procs have to be maintained. Best prevention for this is to have a strict naming convention for your procedures.
Those are some of the quick patterns I look for when reviewing a database. What are the kind of things that stand out when you look over a database?
Wouldn’t you feel comfortable having your code reviewed by an expert? Go to CodeFrisk.com to see how I can proofread your code at a reasonable price.
On Wednesday, March21st, I'll be giving an introductary talk on Data Mining for the Boston Access User's Group. I was scheduled to do this last month, but I got sick.
Introduction to Data Mining in SQL 2005 Data mining is the act of deriving patterns and trend that exist in your data. SQL 2005 provides a host of different algorithms that allow you to explore your data for hidden patterns and trends. Popular uses of data mining models are used to predict future sales, target customers for marketing, examine patterns of navigation through a website, and group products that are likely to be purchased together.
Phil will give an introduction to data mining concepts, an overview of the different algorithms, and how to incorporate the data mining results into your applications.
Phil Denoncourt is a .NET consultant, who over the past 5 years has developed a wide range of .NET applications and has over 20 years experience writing software. He is the leader of the New England C# User Group and has acquired the MCPD, MCTS, MCSD, MCDBA, MCSE, MCAD, MCSA and MCP+SB certifications. When not coding, or spending time with his wife and 5 children, Phil is an avid fan of the Boston Red Sox and New England Patriots.
As a developer, I love to write code. Getting involved with a problem, putting my head down, coming up a solution and getting immersed in implementing it is an awesome feeling.
I also enjoy reading code. I like to look at a set of code and try to understand what the developer was thinking. You can tell a lot about how a person thinks by reading their code. To me, participating in code reviews is a lot of fun.
I originally wrote these articles 6 months ago when I was about to launch a new service to review code (For details, go to http://www.codefrisk.com/). I’m now much too busy to review code, but these articles might be of value, so I’m uploading them now. I’ll be posting a set of articles about code reviews, best practices, coding standards… Here’s the first article in the series:
Coding Standards – How to write a useful Coding Standard document
Coding standards are an agreement in a development team on how the code for an application will be structured. They promote consistency and easier maintenance. A well-written and enforced standards policy will reduce the time that it takes for new developers to feel comfortable with an application.
I’ve worked with many organizations over the years, most of which have some sort of coding standards documentation. The needs of each organization are very different. Some organizations sold their source code, so they needed to make sure there was a very high level of consistency. Others were in regulated industries and needed to conform to various practices. Others were small shops that didn’t see tremendous value in having standards.
The standards document had varying degrees of success. Some negative factors in coding standards I’ve seen are:
- Aggressively guarded by one person
- Enforcing rules that were necessary in older technologies, but no longer hold relevance (Especially if they were burned by the problem)
- Too slim
- Too thick
- Too strict
- Begun, but never completed.
- Finished, but never enforced.
I’ve written a few coding standards docs and want to share what I’ve found useful, and what you should try to avoid.
First of all, if you’re the person tasked with writing the documentation, make sure that you involve the rest of the development team as much as possible. You shouldn’t take the Moses approach; climb a mountain and come down with a set of commandments. If the developers on your team aren’t consulted on the coding standards, you’ll see a lot of them not adhering to them. These aren’t your coding standards; they are the teams coding standards.
Recognize that you aren’t writing a document with a stone and chisel. Coding standards must evolve as your team has learned new techniques. It’s also not uncommon to see coding standard docs that never get finished because the author feels the document must be 100% “complete” before they release it. Take an iterative approach and get the document out as soon as possible and continue to refine it. It’s impossible to get it all right the first, second, or even tenth time. You’ll probably find that most code reviews will result in changes to the standards document.
Kick off a discussion for an hour or two and brainstorm with the development team about possible standards. There will be contention. If the team can’t come to an agreement after 5 minutes, table the discussion point for later. If people have time to think about something, they will usually be able to justify or back off on their opinion. The important thing is to limit the time of the discussion in the interests of getting an initial version within the same day. Good developers will talk about coding all day.
Areas to discuss would include:
- Naming standards – Camel, Pascal, Hungarian, Dewey Decimal System?
- Documentation expectations – XML Documentation, when/where to put comments, class headers…
- Design Patterns that are expected to be used – Factories, Observers, Decorators…
- Automated Unit Testing strategy – TDD? Tool type, Code Coverage…
- Exception handling – Handling, Reporting…
- Use of constants/resource files – Yes, No? One resource for the entire project, one constant class?
- Complexity limits for methods - > 25 is a lot of branches
- Data access methods - Stored Procs, Dynamic SQL, DataSets, ORM, Async, Transaction strategy…
- Common components to use – Enterprise Library, Log4net, Infragistics…
- Localization strategy - Needed? Roll your own or use .NET framework...
- Versioning – Versioning assemblies, strong name, signing, delay signing…
- Build strategy – Nightly builds, one solution or many solutions, creating references
- !Code Formatting
For the most part try to stay away from talking about code formatting. Formatting standards inevitably come down to one developer’s preference over another. Capable developers will be able to read the code regardless of how it is formatted.
After you’ve had your discussion with the team, hole yourself up in your office/cube and write up the standards doc. Unless you’re a dreadfully slow typist, this shouldn’t take more than 1 – 3 hours.
The important thing is to make sure you get feedback from your team on the document. If they don’t feel involved and disagree with a particular piece, they will exhibit a little bit of passive-aggressive behavior and not adhere to the standard.
The Coding Standards document sets up expectations at code reviews. It is a declaration of what is expected from a developer when they submit code. Without some set of coding standards, code reviews can quickly become counterproductive.
Coding standard docs aren’t hard, aren’t intrusive, and are a valuable resource. When there is a set of guidelines to use when coding, it is much easier to bring new developers into the project. The benefits justify the one day of effort it takes to put one in place.
Wouldn’t you feel comfortable having your code reviewed by an expert? Go to CodeFrisk.com to see how I can proofread your code at a reasonable price.
On Wednesday, February 21st, I'll be giving an introductary talk on Data Mining for the Boston Access User's Group.
Introduction to Data Mining in SQL 2005 Data mining is the act of deriving patterns and trend that exist in your data. SQL 2005 provides a host of different algorithms that allow you to explore your data for hidden patterns and trends. Popular uses of data mining models are used to predict future sales, target customers for marketing, examine patterns of navigation through a website, and group products that are likely to be purchased together.
Phil will give an introduction to data mining concepts, an overview of the different algorithms, and how to incorporate the data mining results into your applications.
Phil Denoncourt is a .NET consultant, who over the past 5 years has developed a wide range of .NET applications and has over 20 years experience writing software. He is the leader of the New England C# User Group and has acquired the MCPD, MCTS, MCSD, MCDBA, MCSE, MCAD, MCSA and MCP+SB certifications. When not coding, or spending time with his wife and 5 children, Phil is an avid fan of the Boston Red Sox and New England Patriots.
I'll be speaking at two user groups in the upcoming weeks:
Introduction to WCF (Windows Communication Foundation).
.NET 3.0 was released in November. WCF is one piece of the 3.0 puzzle and is a substantial improvement over ASP.NET web services.
Phil will show the overall architecture and how easy it is to develop WCF Services.
Phil Denoncourt is a .NET consultant, who over the past 5 years has developed a wide range of .NET applications and has over 20 years experience writing software. He is the leader of the New England C# User Group and has acquired the MCPD, MCTS, MCSD, MCDBA, MCSE, MCAD, MCSA and MCP+SB certifications.
When not coding, or spending time with his wife and 5 children, Phil is an avid fan of the Boston Red Sox and New England Patriots.
Jan 11 2007 at New England C# Users Group
Jan 18 2007 at New Hampshire .NET Users Group
For the past month, I've been focusing on getting up to speed on the data mining features of SQL 2005. Really amazing stuff. I'll be giving a presentation on this in the near future. One of the things that took a significant amount of time was putting together a database that I could use for testing. I didn't want to use AdventureWorks or something like that because I wanted something that had more data and more "real world".
So I downloaded the past 20 years of stock prices from a public quote server, the past 10 years of foreign currency prices, and a slew of economical data from Federal Reserve. (See picture below). The database has over 7,300 companies (I tried to get all NASDAQ, NYSE, and AMEX tickers) with over 16 million quotes.
Now I'm creating models and running predictions. For the most part, I'm able to exercise all of the data mining algorithms. I haven't found the secret formula to the stock market yet, but someday…
If anybody else is interested getting a copy of this database, let me know via the contact link. Because it's over a gigabyte in size and I don't have massive bandwidth allowances in my hosting account, it has to be transported via postal mail. PayPal me $30 to cover the cost of burning a DVD and sending it, and I'll get it out to you.
I was sitting in this meeting in which I was only peripherally involved and there's only so much doodling you can do. So I did what most other people in the meeting were doing, I started daydreaming.
I'm a big fan of GUIDs, but there's all this noise about how slow they are because they're bigger than an int making for inefficient searching. So I started to wonder, how much slower are GUIDs than ints? So I started to do the math, and for a table with 4 billion rows, I was coming out with only one more logical read if the key was a GUID than an int.... I know.. Didn't make sense to me. So I put it to a practical test.
I created two tables. One called IntWithData and the other called GuidWithData.
 
I then populated them with a million rows. --As a side note, I started the scripts to populate the tables at about the same time, and they finished at the same time. Something I think I will research further is how much slower it is to insert a record in a GUID table compared to an INT table. Common sense tells you it's supposed be way slower to populate a GUID--
I took a look at the statistics for the index by using the following statements: SELECT * FROM sys.dm_db_index_physical_stats(db_id(),object_id('dbo.GuidWithData'),null,null,'DETAILED') SELECT * FROM sys.dm_db_index_physical_stats(db_id(),object_id('dbo.IntWithData'),null,null,'DETAILED')
That produced these results:
| database_id |
object_id |
index_id |
partition_number |
index_type_desc |
alloc_unit_type_desc |
index_depth |
index_level |
avg_fragmentation_in_percent |
fragment_count |
avg_fragment_size_in_pages |
page_count |
avg_page_space_used_in_percent |
record_count |
ghost_record_count |
version_ghost_record_count |
min_record_size_in_bytes |
max_record_size_in_bytes |
avg_record_size_in_bytes |
forwarded_record_count |
| 6 |
2105058535 |
1 |
1 |
CLUSTERED |
INDEX |
IN_ROW_DATA |
3 |
0 |
99.037 |
5919 |
1 |
5919 |
68.85677 |
1000000 |
0 |
0 |
31 |
31 |
31 |
NULL |
| 6 |
2105058535 |
1 |
1 |
CLUSTERED |
INDEX |
IN_ROW_DATA |
3 |
1 |
96.2963 |
27 |
1 |
27 |
67.68663 |
5919 |
0 |
0 |
23 |
23 |
23 |
NULL |
| 6 |
2105058535 |
1 |
1 |
CLUSTERED |
INDEX |
IN_ROW_DATA |
3 |
2 |
0 |
1 |
1 |
1 |
8.314801 |
27 |
0 |
0 |
23 |
23 |
23 |
NULL |
| database_id |
object_id |
index_id |
partition_number |
index_type_desc |
alloc_unit_type_desc |
index_depth |
index_level |
avg_fragmentation_in_percent |
fragment_count |
avg_fragment_size_in_pages |
page_count |
avg_page_space_used_in_percent |
record_count |
ghost_record_count |
version_ghost_record_count |
min_record_size_in_bytes |
max_record_size_in_bytes |
avg_record_size_in_bytes |
forwarded_record_count |
| 6 |
2073058421 |
1 |
1 |
CLUSTERED |
INDEX |
IN_ROW_DATA |
3 |
0 |
0.423403 |
317 |
8.195584 |
2598 |
99.84113 |
1000000 |
0 |
0 |
19 |
19 |
19 |
NULL |
| 6 |
2073058421 |
1 |
1 |
CLUSTERED |
INDEX |
IN_ROW_DATA |
3 |
1 |
0 |
9 |
1 |
9 |
46.33886 |
2598 |
0 |
0 |
11 |
11 |
11 |
NULL |
| 6 |
2073058421 |
1 |
1 |
CLUSTERED |
INDEX |
IN_ROW_DATA |
3 |
2 |
0 |
1 |
1 |
1 |
1.420806 |
9 |
0 |
0 |
11 |
11 |
11 |
NULL |
This is tough to read, but what you should be getting out of this is: Guid table is very fragmented. Int table is not. Also the Guid table had about 2.25 more pages than the Int table. Ideally you would be running an overnight process that reorganized pages in your DB, and that would help with the fragmentation.
So now the test query. The question I wanted to answer was: How many logical reads are needed to read a record out of the Int table and how many are needed to read out of the Guid table?
SET STATISTICS IO ON
SELECT * FROM IntWithData WHERE ID = 619284
SELECT * FROM GuidWithData WHERE ID = '7DD78950-D3CD-4016-8D92-738A6E0666F2' -- I had to find this value in my data
The results: .... (1 row(s) affected)
Table 'IntWithData'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
Table 'GuidWithData'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
3 logical reads. That means in a table with 1 million rows, there should be no difference between a table with Guid as a key and one that has an int. This is because the records are searched for using a binary search algorithm. In a perfectly balanced tree, it shouldn't take the algorithm more than 20 tries to find the key with many of those "tries" are contained in the same index page. At some volume, and I haven't calculated exactly where, there will be a difference of one logical read, but that's it.
Hopefully, I haven't misinterpreted the results of this test. I don't think I have. The bottom line is that Guids take up more physical space, but you're able to find them just as fast as ints.
I was preparing for the SQL Server Exam 70-441 (Designing Database Solutions by Using Microsoft® SQL Server™ 2005) and I kept coming across a whole bunch of small features that you don’t hear a lot about. I’m not talking about the big ones everyone already knows about like CLR Integration, DDL triggers, XML, and ranking functions. These are ones that took me by pleasant surprise. I think they’re really cool, so I wanted to point them out.
Included Columns / Covering Indexes
Say you had a critical/oft-used report that ran a statement like:
SELECT Name, ProductNumber
FROM Production.Product
WHERE Style='U' AND Class='L'
If you wanted to put an index on the table to speed things up, you’d put an index on Style, Class, Name, and ProductNumber. That would remove the need to go from the index page to the table page to get all the information for the query. SQL 2005 has added the include clause to the Create Index statement.
CREATE INDEX IX_Product_Class_Style
ON Production.Product (Class,Style)
INCLUDE (Name, ProductNumber);
What this is doing is indexing just the Class and Style fields of the table. The Name and ProductNumber are stored at the leaf nodes of the index. This saves space in the index pages, meaning you can get more rows in an index page, reducing I/O and Disk Space.
Dynamic Management Views
Seriously, these are pretty cool. Really. If you haven’t looked at these, take the time to play around with a few.
With these new views, you are able to get high visibility into the SQL Server engine. Wondering what’s about to be written to the disk? Query sys.dm_io_pending_io_requests. Troubleshooting execution plans? Use sys.dm_exec_cached_plans and sys.dm_exec_plan_attributes. Need to see what’s running on the server right now? sys.dm_os_tasks. How often is an index used? There’s sys.dm_db_index_usage_stats that will answer that question. There are over 80 views that provide very useful information when you are troubleshooting problems or just curious as to how SQL Server organizes itself.
sp_create_plan_guide
Every now and then, I have a query that runs like a jackrabbit on my development machine, but runs like 3 legged turtle on a production machine. This is usually because it decided not to use an index on the production machine due to differences in data density or because running the query with parallelism is slower with some statements. To troubleshoot these situations, I end up modifying the stored procedure to add various hints.
Rather than modify the stored procedure, you can specify a hint for an existing query plan using the sp_create_plan_guide stored proc. This way you can experiment without modifying the stored procedure and the hint doesn’t become part of your codebase. MSDN documentation can be found here.
OUTPUT Clause
A lot of people already know about this pretty cool feature, but I didn’t. In the past, if you do an insert to a table, and the table has an identity column, after the insert, you would look at @@IDENTITY or scope_identity(). That always seemed hackish to me.
You can now add an OUTPUT clause to the end of an INSERT, UPDATE, or DELETE statement. That will return the rows changed by the statement.
INSERT INTO Person.Address
(AddressLine1,AddressLine2,City
,StateProvinceID,PostalCode,rowguid,ModifiedDate)
VALUES(@AddressLine1,@AddressLine2,@City
,@StateProvinceID,@PostalCode,@rowguid,@ModifiedDate)
OUTPUT INSERTED.AddressID;
You use the pseudo tables INSERTED and DELETED, just like in triggers.
| |