More Normal Data
Last week we normalized the most import bits of my reading data. This week I plan to ramble a bit about how we might add some more data and the various complications that come from that.
The primary purpose (tracking books read) is served by the final data model from last week but maybe we want to some extra metadata around our books (or readers). Some of that will likely end up directly in our relational model but as we add meta-data we need to start making decisions about how our database will be modeled and it will affect the performance whether that’s adding stored data or putting data in places that work better for some use cases than others.
The first order of business is adding unique usernames as well as passwords so our readers can log in and access their reading records. As this is a a social application we want to give users the ability to share (or not their reading data). The changes here are simple and don’t change the model or it’s overall efficiency.
Maybe we want to be more specific about the edition or version of the book they read (or listened to). We now need to make some decisions about what we are going to optimize for: do we keep edition and format information stored with the single book record? Or do we start adding more records to the book table itself? Either way we need to better adopt the standard way of recording information about books. ISBN numbers would allow us to track specific media (a particular edition, audio book etc.) but there are many books that were published before the ISBN standard was adopted so it’s not a good candidate key for our table because a book may not have one. (I have a very lovely copy of the Complete Works of Shakespeare that was published in late 1800s: should I be able to add this book to my collection in the application?) Do we create our own standard or is there or resource we can leverage to import data when it is missing?
I hate re-inventing the wheel so I am going to go down the road of simply allowing data from WorldCat or Open Library. For the sake of this thought experiment let’s decide we’ll be getting our book records from WorldCat. Both APIs return JSON data that we can parse and store as needed. Currently I believe a single record for a book with edition information stored in the same record makes sense. We don’t have (or want) an unlimited number of columns to store this information so we’ll be added a JSON column to our books table where we can store multiple JSON records. We can store up to 2GB of JSON data if we implement the column as NVARCHAR(MAX). For performance reasons Microsoft recommends limiting the column to NVARCHAR(4000) if know for sure you won’t grow the JSON beyond 8kb. We can return specific parts of the JSON document via queries or simply return the document in its entirety to allow the application to do whatever it needs to do with the information. At the database level we can ensure the data is formatted correctly with a constraint and data in the document is available for indexing as well. This is feeling like a good choice. I’d absolutely plan on adding an index of the media id from the source to the table. The columns for media and format are now folded into our JSON column and go away. We can optimize our JSON by ensuring we condense our record appropriately. Looking at the information stored in WorldCat a book record might look something like this:
{
    "Summary" : "\"A Prayer for the Crown-Shy is a story of kindness and love from one of the foremost practitioners of hopeful SF. After touring the rural areas of Panga, Sibling Dex (a Tea Monk of some renown) and Mosscap (a robot sent on a quest to determine what humanity really needs) turn their attention to the villages and cities of the little moon they call home. They hope to find the answers they seek, while making new friends, learning new concepts, and experiencing the entropic nature of the universe. Becky Chambers's new series continues to ask: in a world where people have what they want, does having more even matter?\"-- Provided by publisher",
    "Series" : "Monk & Robot",
    "Genre" : ["Science Fiction", "Dystopian Fiction"],
    "Subjects" : ["Being in love", "Dystopian fiction", "Fiction", "Friendship", "Friendship Fiction", "Gender-nonconforming people", "Gender-nonconforming people", "Fiction", "Monks", "Monks Fiction", "Robots", "Robots Fiction", "Science fiction", "Self-consciousness (Awareness)", "Self-consciousness (Awareness) Fiction"],
    "Editions" : [
         {      
            "editionID" : "1",
            "Physical Description" : "152 pages; 21 cm.",
            "ISBN" : ["9781250236234", "1250236231"],
            "OCLC Number" : "1300756362",
            "Language" : "English",
            "Year" : "2022",
            "Publisher" : "Tordotcom, a Tom Doherty Associates Book, New York"
        },
        {
            "editionID" : "2",
            "Physical Description" : "eBook",
            "ISBN" : ["9781250236241", "125023624X"],
            "OCLC Number" : "1300754050",
            "Language" : "English",
            "Year" : "2022",
            "Publisher" : "Tordotcom, a Tom Doherty Associates Book, New York"
        },
        {
            "editionID" : "3",
            "Physical Description" : "1 Volume",
            "ISBN" : ["9781250891266", "1250891264"],
            "OCLC Number" : "1337962228",
            "Language" : "English",
            "Year" : "2022",
            "Publisher" : "Pan Macmillan Australia"            
        }
    ]
}
We can store a fair amount of information very compactly (and as raggedly as necessary depending on the quality of the edition record). We get to collapse the existing genre and format columns into the JSON document. We have to traverse the JSON document for edition information but we can also index our table for common searched values. At a minimum the OCLC identifier (so we can quickly determine if we have already imported a record), the genre and subject fields are also likely candidates for indexing for users to discover new books. Thankfully our overall model hasn’t changed much!
Our readers will likely want to rate and review books and we’ll want to allow this as we’ll maybe able to monetize this by selling books that users recommend to each other. We’ll need a place to store reviews as well as collecting and storing some information about what sort of books they are interested in. Zooming in a bit we just need a few columns to hold reviews: we could create a unique reviewId but our application won’t hold multiple reviews for the same title by the same reader so we can expect the combination of readerID and bookId to serve as a primary key for this table. We’ll also add a column to our books table to hold the average rating for the book.
Lastly we want to add the ability for our users to add some profile information. Let’s look at what information we might need and/or want: records that make it easy or discuss or address someone (pronouns & name). A biography so users can describe themselves to other users. That’s pretty much it. An argument could be made that as a company we need to capture a birthdate that was provided to us as proof they can consent to use the service so let’s add that as well:
And there we have it. Our basic reading and book discovery data set. It’s got everything to it needs to be functional plus some nice to have extra information as well and we’ve limited the amount of radioactive data (PII) to just two fields: email and birthdate.
As a thought experiment it performs fantastic! At scale it may still run into some issues however. Here are some further considerations off the top of my head as this model gets implemented physically on the disk:
- Can we be more smarter about data types?
- As the data scales up would it make sense to create some more entities?
- We have a number of potential many to many relationships would further research into implementing a graph database make sense? Are our needs simple enough they can be implemented inside of SQL Server or is another solution required1?
- What other data/functionality is going to be duct taped to the our original application2.