第12章 MediaWiki之一
Sumana Harihareswara 和 Guillaume Paumier From the start, MediaWiki was developed specifically to be Wikipedia's software. Developers have worked to facilitate reuse by third-party users, but Wikipedia's influence and bias have shaped MediaWiki's architecture throughout its history.
Wikipedia is one of the top ten websites in the world, currently getting about 400 million unique visitors a month. It gets over 100,000 hits per second. Wikipedia isn't commercially supported by ads; it is entirely supported by a non-profit organization, the Wikimedia Foundation, which relies on donations as its primary funding model. This means that MediaWiki must not only run a top-ten website, but also do so on a shoestring budget. To meet these demands, MediaWiki has a heavy bias towards performance, caching and optimization. Expensive features that can't be enabled on Wikipedia are either reverted or disabled through a configuration variable; there is an endless balance between performance and features.
The influence of Wikipedia on MediaWiki's architecture isn't limited to performance. Unlike generic content management systems (CMSes), MediaWiki was originally written for a very specific purpose: supporting a community that creates and curates freely reusable knowledge on an open platform. This means, for example, that MediaWiki doesn't include regular features found in corporate CMSes, like a publication workflow or access control lists, but does offer a variety of tools to handle spam and vandalism.
So, from the start, the needs and actions of a constantly evolving community of Wikipedia participants have affected MediaWiki's development, and vice versa. The architecture of MediaWiki has been driven many times by initiatives started or requested by the community, such as the creation of Wikimedia Commons, or the Flagged Revisions feature. Developers made major architectural changes because the way that MediaWiki was used by Wikipedians made it necessary.
MediaWiki has also gained a solid external user base by being open source software from the beginning. Third-party reusers know that, as long as such a high-profile website as Wikipedia uses MediaWiki, the software will be maintained and improved. MediaWiki used to be really focused on Wikimedia sites, but efforts have been made to make it more generic and better accommodate the needs of these third-party users. For example, MediaWiki now ships with an excellent web-based installer, making the installation process much less painful than when everything had to be done via the command line and the software contained hardcoded paths for Wikipedia.
Still, MediaWiki is and remains Wikipedia's software, and this shows throughout its history and architecture.
This chapter is organized as follows:
Historical Overview gives a short overview of the history of MediaWiki, or rather its prehistory, and the circumstances of its creation. MediaWiki Code Base and Practices explains the choice of PHP, the importance and implementation of secure code, and how general configuration is handled. Database and Text Storage dives into the distributed data storage system, and how its structure evolved to accommodate growth. Requests, Caching and Delivery follows the execution of a web request through the components of MediaWiki it activates. This section includes a description of the different caching layers, and the asset delivery system. Languages details the pervasive internationalization and localization system, why it matters, and how it is implemented. Users presents how users are represented in the software, and how user permissions work. Content details how content is structured, formatted and processed to generate the final HTML. A subsection focuses on how MediaWiki handles media files. Customizing and Extending MediaWiki explains how JavaScript, CSS, extensions, and skins can be used to customize a wiki, and how they modify its appearance and behavior. A subsection presents the software's machine-readable web API.
12.1. Historical Overview Phase I: UseModWiki Wikipedia was launched in January 2001. At the time, it was mostly an experiment to try to boost the production of content for Nupedia, a free-content, but peer-reviewed, encyclopedia created by Jimmy Wales. Because it was an experiment, Wikipedia was originally powered by UseModWiki, an existing GPL wiki engine written in Perl, using CamelCase and storing all pages in individual text files with no history of changes made.
It soon appeared that CamelCase wasn't really appropriate for naming encyclopedia articles. In late January 2001, UseModWiki developer and Wikipedia participant Clifford Adams added a new feature to UseModWiki: free links; i.e., the ability to link to pages with a special syntax (double square brackets), instead of automatic CamelCase linking. A few weeks later, Wikipedia upgraded to the new version of UseModWiki supporting free links, and enabled them.
While this initial phase isn't about MediaWiki per se, it provides some context and shows that, even before MediaWiki was created, Wikipedia started to shape the features of the software that powered it. UseModWiki also influenced some of MediaWiki's features; for example, its markup language. The Nostalgia Wikipedia contains a complete copy of the Wikipedia database from December 2001, when Wikipedia still used UseModWiki.
Phase II: The PHP Script In 2001, Wikipedia was not yet a top ten website; it was an obscure project sitting in a dark corner of the Interwebs, unknown to most search engines, and hosted on a single server. Still, performance was already an issue, notably because UseModWiki stored its content in a flat file database. At the time, Wikipedians were worried about being inundated with traffic following articles in the New York Times, Slashdot and Wired.
So in summer 2001, Wikipedia participant Magnus Manske (then a university student) started to work on a dedicated Wikipedia wiki engine in his free time. He aimed to improve Wikipedia's performance using a database-driven app, and to develop Wikipedia-specific features that couldn't be provided by a "generic" wiki engine. Written in PHP and MySQL-backed, the new engine was simply called the "PHP script", "PHP wiki", "Wikipedia software" or "phase II".
The PHP script was made available in August 2001, shared on SourceForge in September, and tested until late 2001. As Wikipedia suffered from recurring performance issues because of increasing traffic, the English language Wikipedia eventually switched from UseModWiki to the PHP script in January 2002. Other language versions also created in 2001 were slowly upgraded as well, although some of them would remain powered by UseModWiki until 2004.
As PHP software using a MySQL database, the PHP script was the first iteration of what would later become MediaWiki. It introduced many critical features still in use today, like namespaces to organize content (including talk pages), skins, and special pages (including maintenance reports, a contributions list and a user watchlist).
Phase III: MediaWiki Despite the improvements from the PHP script and database backend, the combination of increasing traffic, expensive features and limited hardware continued to cause performance issues on Wikipedia. In 2002, Lee Daniel Crocker rewrote the code again, calling the new software "Phase III" (http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/2794). Because the site was experiencing frequent difficulties, Lee thought there "wasn't much time to sit down and properly architect and develop a solution", so he "just reorganized the existing architecture for better performance and hacked all the code". Profiling features were added to track down slow functions.
The Phase III software kept the same basic interface, and was designed to look and behave as much like the Phase II software as possible. A few new features were also added, like a new file upload system, side-by-side diffs of content changes, and interwiki links.
Other features were added over 2002, like new maintenance special pages, and the "edit on double click" option. Performance issues quickly reappeared, though. For example, in November 2002, administrators had to temporarily disable the "view count" and "site" statistics which were causing two database writes on every page view. They would also occasionally switch the site to read-only mode to maintain the service for readers, and disable expensive maintenance pages during high-access times because of table locking problems.
In early 2003, developers discussed whether they should properly re-engineer and re-architect the software from scratch, before the fire-fighting became unmanageable, or continue to tweak and improve the existing code base. They chose the latter solution, mostly because most developers were sufficiently happy with the code base, and confident enough that further iterative improvements would be enough to keep up with the growth of the site.
In June 2003, administrators added a second server, the first database server separate from the web server. (The new machine was also the web server for non-English Wikipedia sites.) Load-balancing between the two servers would be set up later that year. Admins also enabled a new page-caching system that used the file system to cache rendered, ready-to-output pages for anonymous users.
June 2003 is also when Jimmy Wales created the non-profit Wikimedia Foundation to support Wikipedia and manage its infrastructure and day-to-day operations. The "Wikipedia software" was officially named "MediaWiki" in July, as wordplay on the Wikimedia Foundation's name. What was thought at the time to be a clever pun would confuse generations of users and developers.
New features were added in July, like the automatically generated table of contents and the ability to edit page sections, both still in use today. The first release under the name "MediaWiki" happened in August 2003, concluding the long genesis of an application whose overall structure would remain fairly stable from there on.
12.2. MediaWiki Code Base and Practices PHP PHP was chosen as the framework for Wikipedia's "Phase II" software in 2001; MediaWiki has grown organically since then, and is still evolving. Most MediaWiki developers are volunteers contributing in their free time, and there were very few of them in the early years. Some software design decisions or omissions may seem wrong in retrospect, but it's hard to criticize the founders for not implementing some abstraction which is now found to be critical, when the initial code base was so small, and the time taken to develop it so short.
For example, MediaWiki uses unprefixed class names, which can cause conflicts when PHP core and PECL (PHP Extension Community Library) developers add new classes: MediaWiki Namespace class had to be renamed to MWNamespace to be compatible with PHP 5.3. Consistently using a prefix for all classes (e.g., "MW") would have made it easier to embed MediaWiki inside another application or library.
Relying on PHP was probably not the best choice for performance, since it has not benefitted from improvements that some other dynamic languages have seen. Using Java would have been much better for performance, and simplified execution scaling for back-end maintenance tasks. On the other hand, PHP is very popular, which facilitates recruiting new developers.
Even if MediaWiki still contains "ugly" legacy code, major improvements have been made over the years, and new architectural elements have been introduced to MediaWiki throughout its history. They include the Parser, SpecialPage, and Database classes, the Image class and the FileRepo class hierarchy, ResourceLoader, and the Action hierarchy. MediaWiki started without any of these things, but all of them support features that have been around since the beginning. Many developers are interested primarily in feature development and architecture is often left behind, only to catch up later as the cost of working within an inadequate architecture becomes apparent.
Security Because MediaWiki is the platform for high-profile sites such as Wikipedia, core developers and code reviewers have enforced strict security rules. (See the detailed guide.) To make it easier to write secure code, MediaWiki gives developers wrappers around HTML output and database queries to handle escaping. To sanitize user input, a develop uses the WebRequest class, which analyzes data passed in the URL or via a POSTed form. It removes "magic quotes" and slashes, strips illegal input characters and normalizes Unicode sequences. Cross-site request forgery (CSRF) is avoided by using tokens, and cross-site scripting (XSS) by validating inputs and escaping outputs, usually with PHP's htmlspecialchars() function. MediaWiki also provides (and uses) an XHTML sanitizer with the Sanitizer class, and database functions that prevent SQL injection.
Configuration MediaWiki offers hundreds of configuration settings, stored in global PHP variables. Their default value is set in DefaultSettings.php, and the system administrator can override them by editing LocalSettings.php.
MediaWiki used to over-depend on global variables, including for configuration and context processing. Globals cause serious security implications with PHP's register_globals function (which MediaWiki hasn't needed since version 1.2). This system also limits potential abstractions for configuration, and makes it more difficult to optimize the start-up process. Moreover, the configuration namespace is shared with variables used for registration and object context, leading to potential conflicts. From a user perspective, global configuration variables have also made MediaWiki seem difficult to configure and maintain. MediaWiki development has been a story of slowly moving context out of global variables and into objects. Storing processing context in object member variables allows those objects to be reused in a much more flexible way.
12.3. Database and Text Storage MediaWiki has been using a relational database backend since the Phase II software. The default (and best-supported) database management system (DBMS) for MediaWiki is MySQL, which is the one that all Wikimedia sites use, but other DBMSes (such as PostgreSQL, Oracle, and SQLite) have community-supported implementations. A sysadmin can choose a DBMS while installing MediaWiki, and MediaWiki provides both a database abstraction and a query abstraction layer that simplify database access for developers.
Figure 12.1: Database schema The current layout contains dozens of tables. Many are about the wiki's content (e.g., page, revision, category, and recentchanges). Other tables include data about users (user, user_groups), media files (image, filearchive), caching (objectcache, l10n_cache, querycache) and internal tools (job for the job queue), among others, as shown in Figure 12.2. (Complete documentation of the database layout in MediaWiki is available.) Indices and summary tables are used extensively in MediaWiki, since SQL queries that scan huge numbers of rows can be very expensive, particularly on Wikimedia sites. Unindexed queries are usually discouraged.
The database went through dozens of schema changes over the years, the most notable being the decoupling of text storage and revision tracking in MediaWiki 1.5.
Figure 12.2: Main content tables in MediaWiki 1.4 and 1.5 In the 1.4 model, the content was stored in two important tables, cur (containing the text and metadata of the current revision of the page) and old (containing previous revisions); deleted pages were kept in archive. When an edit was made, the previously current revision was copied to the old table, and the new edit was saved to cur. When a page was renamed, the page title had to be updated in the metadata of all the old revisions, which could be a long operation. When a page was deleted, its entries in both the cur and old tables had to be copied to the archive table before being deleted; this meant moving the text of all revisions, which could be very large and thus take time.
In the 1.5 model, revision metadata and revision text were split: the cur and old tables were replaced with page (pages' metadata), revision (metadata for all revisions, old or current) and text (text of all revisions, old, current or deleted). Now, when an edit is made, revision metadata don't need to be copied around tables: inserting a new entry and updating the page_latest pointer is enough. Also, the revision metadata don't include the page title anymore, only its ID: this removes the need for renaming all revisions when a page is renamed
The revision table stores metadata for each revision, but not their text; instead, they contain a text ID pointing to the text table, which contains the actual text. When a page is deleted, the text of all revisions of the page stays there and doesn't need to be moved to another table. The text table is composed of a mapping of IDs to text blobs; a flags field indicates if the text blob is gzipped (for space savings) or if the text blob is only a pointer to external text storage. Wikimedia sites use a MySQL-backed external storage cluster with blobs of a few dozen revisions. The first revision of the blob is stored in full, and following revisions to the same page are stored as diffs relative to the previous revision; the blobs are then gzipped. Because the revisions are grouped per page, they tend to be similar, so the diffs are relatively small and gzip works well. The compression ratio achieved on Wikimedia sites nears 98%.
On the hardware side, MediaWiki has built-in support for load balancing, added as early as 2004 in MediaWiki 1.2 (when Wikipedia got its second server—a big deal at the time). The load balancer (MediaWiki's PHP code that decides which server to connect to) is now a critical part of Wikimedia's infrastructure, which explains its influence on some algorithm decisions in the code. The system administrator can specify, in MediaWiki's configuration, that there is one master database server and any number of slave database servers; a weight can be assigned to each server. The load balancer will send all writes to the master, and will balance reads according to the weights. It also keeps track of the replication lag of each slave. If a slave's replication lag exceeds 30 seconds, it will not receive any read queries to allow it to catch up; if all slaves are lagged more than 30 seconds, MediaWiki will automatically put itself in read-only mode.
MediaWiki's "chronology protector" ensures that replication lag never causes a user to see a page that claims an action they've just performed hasn't happened yet: for instance, if a user renames a page, another user may still see the old name, but the one who renamed will always see the new name, because he's the one who renamed it. This is done by storing the master's position in the user's session if a request they made resulted in a write query. The next time the user makes a read request, the load balancer reads this position from the session, and tries to select a slave that has caught up to that replication position to serve the request. If none is available, it will wait until one is. It may appear to other users as though the action hasn't happened yet, but the chronology remains consistent for each user.