Project
GSoC 2009 – Taxonomy Architecture
5 years, 3 months ago Posted in: Project 1

Objectives

The primary objective of classification frameworks is to decouple categorization from object creation and management and provide a human friendly, visually recognizable, flexible alternative to search indices of objects.

Starting from directory architecture used by virtually any operating system, methods of classification have in been in use for long time, under the principle of decoupling classification from creation, management, storage and identification of objects.

classification, and its currently popular variation – tagging are widely used in many web frameworks and desktops. Gmail started it with replicating a folder structure, with the concept of placing an object in virtually into more than one bucket perhaps following soft linking that existed in file systems for decades.

This has been given a new look when web services like Gmail and Flickr adopted taxonomy classification for a very specific purpose of classifying only few selected objects. For example, we could choose to apply the classification or live without it. Later similar classification concept has been adopted to many desktop applications such as iTunes, Explorer in Microsoft Vista OS, etc and web frameworks.

Apart from classification framework’s primary objective of decoupling the categorization logic from content management, we will focus on the following indispensable entities in a web framework. This must also take the frequency of operations into account in determining the crucial factors. Implementation specific details are discussed under design considerations.

Database is normalized to 3NF to ensure extensibility while being scalable. Frequently used queries like relationships, object mapping and leaf membership are designed to be scalable.

Reliability is achieved through both hooks on different states of an object and vice versa and possibly through cron tasks.

Vocabularies

  • Leaf : The atomic unit of taxonomy labeling, that will be associated with the items to categorize them. Here it is referred to as terms and tags interchangeably.�Leaves might be called with similar names and we name them alias
  • Alias: For example, university and college may mean the same thing and it is redundant and misleading to have two different labeling for a single term. Hence university can be aliased with college so that there will be only one term that is university is used and whenever user specifies the term college, it will be interpreted as university.
  • Tree: This is an umbrella unit for leaves that act as a bucket and define the properties, rules and interaction leaves underneath. It will have the following properties
    • Hierarchy : The type of hierarchy involved (refer structure below)
    • Controlled: Whether leaves can be created outside the admin forms (com_taxonomy)
    • Relations: what type of relations are allowed. Only parent-child or peer-to-peer as well.
  • Tree mapping : Trees are useless unless it is mapped to an object manager, such as a component. A particular mapping involves a tree and an extension. Further it defines the following rules
    • Required: Whether each object must be associated with a leaf from this tree under this association
    • Multiple: Whether multiple leaves can be associated with a single object
    • Weight: The factor that determines the priority for a tree when an extension is calling for all associated trees or all associated terms for an object under that extension. Heavier weights sink.
  • Leaf mapping: Guided by the tree mapping rules and the properties of trees, a leaf is associated to an object coming under the extension given in the tree mapping. This is also known as labeling or tagging.

Although a unified tree assimilating the structure of domain names will be adequate we consider a forest of trees for the reasons explained earlier. The top level member will be a tree, whose attributes define the usage of the tree and its structure. Leaves are members of a particular tree, albeit an unrestrained membership is considered for many reasons explained under design considerations a leaf will be confined to only one tree, but can be used by multiple extensions, still abiding by the concept of define once, use everywhere.

There are several kind of structures that can be built under a taxonomy framework. However there are three fundamental structures on top of whom a complex design can be built on and we call them hierarchies.

Hierarchies

1. Free terms

This is also known as flat structure, tagging, floating terms, etc. The tree built with such a structure will be forwarding a quite straightforward simple taxonomy system to its end users.

Flat hierarchy

Flat hierarchy

2. Single Hierarchy

This is commonly understood as a taxonomy tree, where a leaf is either the root leaf or a child of another leaf. Therefore, there can be only one parent, but many children resembling top-down hierarchy.

Single Hierarchy

Single Hierarchy

3. Multiple Hierarchies

The requirement flexibility may bring some rarely used features. Multiple hierarchy is such a feature, nonetheless still essential for completeness. It allows multiple parents , so that branch off is possible upwards and downwards

Multiple hierarchies

Multiple Hierarchy

The intention of separating the logic of structure from its building block is to maintain the maximum flexibility that is being able to virtually achieve any level of complexity in building a taxonomy system, while ensuring usability that users are not lead into wilderness. Under this design, structures will be built on the aforementioned blocks seamlessly .

Popular structures

  1. Free tagging – similar to the tagging feature available in WordPress, Gmail, or Flickr. This is of flat hierarchy, multiple, uncontrolled tree, and possibly not required.
  2. Categories – Similar to categories used in ! Or in wordpress. Hierarchical trees with single hierarchy. Mostly controlled, and possibly required and single. Which means user must opt for one and only one leaf per object
  3. User groups – Similar to Organic Groups used in Drupal. Hierarchical trees with single hierarchy, controlled, single, and required.
  4. Book – Hierarchical trees with single hierarchy, not controlled, required and single.
  5. Channels – Multiple hierarchies, not controlled, multiple and possibly required.

Above examples emphasize the coordination required between the creation of a tree and mapping it with an extension. Although user or an extension could do that by directly entering in the table or using the backend forms and build a model, well thought planning is indeed needed.

Database Structure

Both the ability to handle large number of records and extensibility are considered in the design and hence the schema is normalized to 3NF. �The following tables build the taxonomy framework

  • tree : It builds the forest of trees that stores the complete information necessary to build a complete tree. All the attributes are editable through management component, however, a quick build will be made available for common types of trees.
  • tree_map : It stores the correspondence between a tree and another extension, for example, content component. The tree can be completely linked to an extension. such a normalization would enable reuse of a single tree for multiple applications. Weight determines the priority a particular linkage should take over a similar linkage.
  • leaf : Contains the atomic information about the term, the tree it belongs to, etc. Also weight is added to ensure, the preference order when multiple leaves line up for a particular request.
  • leaf_map : It is used to match the terms with an item, for example a content post
  • leaf_hierachy : It contains the relationship between two leaves. The type of relationships currently supported are parent and peer, which will translate to parent-child relationship that makes up a hierarchical tree and peer-to-peer relationship that makes up a cluster tree respectively.
  • leaf_alias : It contains the list of alias for a leaf. It is essential when taxonomy framework is used for components that interact with human directly where responsibility of remembering the right term cannot be enforced. Under such circumstances the ability to handle aliases, for example, words like university, universities, etc could all mean the same term – college.

Design Considerations

Hierarchy / free terms :

There are two popular taxonomy structures. Hierarchical tree (mostly single hierarchy) and free tagging (flat hierarchy). Although it is quite sufficient to content management tasks, complex hierarchies like albums with free tagging, user groups, user roles (for privilege granting purposes), etc will require variety of features. This is achieved by isolating structure from its building blocks – hierarchies, extension mapping (and properties) and tree properties.

Concentrated / Distributed

Sometimes it is necessary to focus on 90% used features while giving up the total flexibility in order to boost performance.

In design there is no need for a tree, as implemented in domain names hierarchy where everything starts at 0th level – root domain – the dot, and expanded by 1st and 2nd level domains, all are equal in representation. Similarly only leaves could have formed the taxonomy framework with the ability to map the extension to particular leaf which will govern all the children.

However it is chosen to implement multiple trees, sacrificing such flexibility for the gains in performance and more importantly support multiple hierarchies or more precisely multiple parents. It allows greater extensibility in breadth and depth.

Define here, use everywhere

Following the popular paradigm “write once, use many” taxonomy framework is expected to be the unified solution for all classification requirements. Thus must be able link with any implementation seamlessly. This is achieved by mapping a tree with a particular extension, thus make all the terms underneath available at the extension’s disposal.

Related Posts

One Response

  1. ssnobben says:

    Great great great thoughts !

    I really love this work you put into Joomla and hope that core main dev will allow this as soon as possible for Joomla!

Leave a Reply