A Simple Question
One of the most frequent questions that comes up in the realm of enterprise applications is:
Who has touched/modified this data?
Short questions usually have long answers, and this one is no exception. To understand the implications of this question, one has to understand the principle of temporal versioning. This essentially means that time is treated as a dimension over which changes to data are persisted and observed. A full discussion on this topic can be found here. This post identifies a new feature in SQL Server 2016 called temporal tables and contrasts it with the manual process of creating and managing temporal tables in previous versions of SQL Server. The following example provides a basis for further discussion which in turn answers the question.
Imagine a table with the following schema:
CREATE TABLE [dbo].[Orders]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Amount] [money] NOT NULL,
[Comments] [nvarchar](4000) NULL,
[LastModifiedBy] [nvarchar](255) NOT NULL,
CONSTRAINT [PK_Orders] PRIMARY KEY CLUSTERED ([Id] ASC)
)
A simple orders table with Id
as the primary key, Amount to track the money spent on an order, a Comments column to track qualitative details about the order, and a LastModifiedBy
column to track who modified it.
In the process of consuming this table, imagine that a row was inserted:
INSERT INTO [dbo].Orders ([Amount], [Comments], [LastModifiedBy])
VALUES (5.32, 'Tall Caramel Macchiato', 'bob.blue@example.com')
If we ask the question who modified this data, the answer is simple: the user bob.blue@example.com
was the person who last touched this row. Now assume that this row was inserted with an error and that the amount needs to be updated from 5.32
to 4.89
. Also assume that the correction was completed by a different user as follows:
UPDATE [dbo].[Orders]
SET [Amount] = 4.89, [LastModifiedBy] = 'alice.green@example.com'
WHERE [Id] = 1 -- 1 being the ID of theorder row we just inserted.
If we ask the same question again, the answer from the system's perspective is simple: alice.green@example.com
is the user that last touched this row. While the answer from the system's perspective is simple, it is also incomplete. More often than not, due to compliance reasons, businesses not only want to know who modified the data the last time, but also who modified it since it was inserted into the table.
A naive implementation would be to simply attach a timestamp column with well-defined semantics of how to determine valid rows versus invalid ones. But in the process of doing this, we cannot maintain Id as a primary key because we end up violating the uniqueness constraint. Including the timestamp column in the primary key is also just as bad because it confounds the process of maintaining foreign key relationships against the table.
So what is the solution then?
Prior to SQL Server 2016
In earlier versions of SQL Server, the solution was to create a table with a similar schema that was identified as a history table. The following schema defines a history table that corresponds to the Orders
table that we established earlier:
CREATE TABLE [dbo].[Orders_History]
(
[Id] [int] NOT NULL,
[Amount] [money] NOT NULL,
[Comments] [nvarchar](4000) NULL,
[LastModifiedBy] [nvarchar](255) NOT NULL,
[ValidFrom] [datetime2] NOT NULL,
[ValidTo] [datetime2] NOT NULL
)
Note that Id
is not a primary key and that this schema has two additional columns ValidFrom
and ValidTo
. These additional columns help identify the validity of the row over time. I have omitted additional DML triggers that would have to be written against the Orders
table. However, the entire process would work as follows:
-
When a row is inserted into the
Orders
table, a row would also be inserted into for theOrders_History
table with the same values forId
,Amount
,Comments
, andLastModifiedBy
. Furthermore,ValidFrom
would have a value of the time of insertion into the Orders table andValidTo
would have the value ofEndOfTime
(a symbolic value that indicates that this row is still valid, you can use a value like December 31st 9999 23:59:59). -
If a row is modified in the
Orders
table, a row in theOrders_History
table with a matching primary key, that is still valid based on theValidTo
value, is expired by updating itsValidTo
value to the time of when the update occurred on the Orders table. After this, step 1 is repeated for theOrders_History
table with the new values of the updated row. -
If a row is deleted in the
Orders
table, we would perform the same procedure as step 2, except theValidTo
value would be updated to the time of deletion from theOrders
table.
So for the series of events described earlier (just an insert and update on the Orders
table), we would have to do something like this via triggers:
-- Initial insert
INSERT INTO [dbo].Orders_History
(
[Id],
[Amount],
[Comments],
[LastModifiedBy],
[ValidFrom],
[ValidTo]
)
VALUES
(
1,
5.32,
'Tall Caramel Macchiato',
'bob.blue@example.com',
GETDATE(),
'9999.12.31 23:59:59'
)
-- Update
DECLARE @now datetime2 = GETDATE()
UPDATE [dbo].Orders_History
SET [ValidTo] = @now
WHERE [Id] = 1 AND [ValidTo] = '9999.12.31 23:59:59'
INSERT INTO [dbo].Orders_History
(
[Id],
[Amount],
[Comments],
[LastModifiedBy],
[ValidFrom],
[ValidTo]
)
VALUES
(
1,
4.89,
'Tall Caramel Macchiato',
'alice.green@example.com',
@now,
'9999.12.31 23:59:59'
)
At this point, if we are asked the same question again, we have the ability to dive into the history table and obtain information about all users that have modified a certain row of data in the OLTP Orders
table.
While this feature allows us to provide a richer experience for end users of an enterprise application, it is non-trivial, very tedious, and stressful to manage especially if you're trying to evolve your schema.
SQL Server 2016 CTP 2
As of SQL Server 2016 CTP 2, all this has been taken care of for us. We now have the ability to define a history table without directly having to manage the schema of the history table (even though we have the option / flexibility to do so). This in turn allows us to work with OLTP tables / data transparently and develop richer features more efficiently.
For the Orders
table we can simply create the table as follows:
CREATE TABLE [dbo].[Orders]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Amount] [money] NOT NULL,
[Comments] [nvarchar](4000) NULL,
[LastModifiedBy] [nvarchar](255) NOT NULL,
[ValidFrom] [datetime2] GENERATED ALWAYS AS ROW START NOT NULL,
[ValidTo] [datetime2] GENERATED ALWAYS AS ROW END NOT NULL
PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)
CONSTRAINT [PK_Orders] PRIMARY KEY CLUSTERED ([Id] ASC)
)
WITH (SYSTEM_VERSIONING = ON)
That's it! No complicated triggers or managing history tables independently. You can essentially set up a history table with 3 extra lines of code as shown below, and that is extremely powerful!
...
[ValidFrom] [datetime2] GENERATED ALWAYS AS ROW START NOT NULL,
[ValidTo] [datetime2] GENERATED ALWAYS AS ROW END NOT NULL
PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)
...
For more information on how you can use this feature to your advantage, see Getting Started with System-Versioned Temporal Tables.
The Path Forward
With this new feature, it's going to be so much easier to report on historical artifacts in database, from an audit perspective or a statistical perspective. One last note about the temporal feature is the general size of the database. Even though this feature is meant to capture data by slowly changing dimensions (SCD), slowly for any given businesses is a relative term, and database sizes can grow to be very large relatively quickly. While there is not much on the topic yet, I suspect that Stretch Database, another feature of SQL Server 2016, will play a key role in mitigating the size of databases with large history tables.