Ideas for analyzing and visualizing Mediawiki
Version analyzed: 1.26.2
Release date: Nov. 2015
Folder Size: 96.5 Mb.
1. Source code
Size of PHP-only files: 17.6 Mb.
Analysis approach: Static program analysis
Besides the defalut Mediawiki statistics (https://www.openhub.net/p/mediawiki) we have conducted in here some experiments for depicting the programming code as text.
Figure 01. Word cloud of the eight first-level php files, 3044 lines of code aprox.
Generated with wordle.net (150 max words, no English common words)
The most frequent word is “DIR”, which appears 1418 times, mainly in the autoload.php file. DIR makes reference to the one of the eight “magical constants” provided by PHP. __DIR__ was introduced in version 5.3 (June, 2009), it points to the directory of the file.
Figure 2 shows the word cloud generated if we remove manually “DIR” from wordle. Figure 3 shows the result if we re-generate the whole cloud without the autoload.php file:
Figure 2. Removing manually the word DIR from the word cloud (left-click over word within Wordle.net) | Figure 3. Regenerating the word cloud without DIR word instances |
Table 1 explains the occurrences of the six most frequent words in Figure 3:
Word | Occurrences | Use |
File | 99 | Mainly in comments, as __FILE__ and as variable $file |
PHP | 71 | Mainly in comments or as document type, ie. <?php |
Return | 71 | Mainly as expected result of a command, i.e. return; |
Params | 68 | Mainly in comments and as a variable $params |
Name | 61 | In comments, as variable ($name) and echoed message |
Array | 58 | As data structuring and handling of parameters |
Relationships among words in code
The following word tree depicts the default visualization of branches.
Figure 04. Word tree of Mediawiki code source.
https://www.jasondavies.com/wordtree/
If we search for the word “file” (99 occurrences), we obtain the following tree (figure 4-b), that represents more clearly the different uses of the word by its context in the code phrase.
Figure 04-b. Word tree of Mediawiki code source, starting with the word “file”.
https://www.jasondavies.com/wordtree/
The source code that served as entry data for these text visualizations was, as mentioned before, PHP. Such code is processed in the server side and users do not see variables or functions names in the rendered code by the browser.
We can compare relationships between pre and post-compiled code. In this manner, the following picture show the HTML source code of a given Wikipedia article, namely 4chan:
Figure 5
From the image we identify the HTML elements <span> and <cite>; the HTML attributes class and href, and the HTML character sign & (for text encoding). amp could be related to the word params in the source code because the platform is highly customizable and, more interestingly, oriented towards external tools for handling and managing content. Regarding <cite> and href, they both relate to external links that ensure the verifiability principle championed by Wikipedia.
Maintainability and cyclomatic complexity
Using existing methods and tools for software metrics (most notably http://phpmetrics.org), we could analyse the PHP code of core components of Mediawiki. These metrics represent mainly two values for each file analyzed:
Due to technical limitations (memory allocated to run phar script from a Terminal interface) we splitted files into folders. Each figures show the results:
Fig. 05 Red: thumb, profileinfo, vectortemplate, monobook, skincologne Yellow: api.php, img_auth.php, skinmodern Green: opensearch.php, load.php, index.php, autoload.php | Fig. 06 Red: includes/actions/RawAction, includes/actions/InfoAction, includes/actions/HistoryAction Yellow:/actions/CreditsAction /actions/Action /actions/RollbackAction /actions/RevertAction /actions/WatchAction Green: /actions/SpecialPageAction /actions/PurgeAction /actions/MarkpatrolledAction | Fig. 07 Red: EditPage.php GlobalFunctions.php Title.php OutputPage.php Linker.php Revision.php |
Fig. 08 Red: maintenance/Maintenance maintenance/CopyFileBackend maintenance/importImages maintenance/namesapcesDupe | Fig. 09 Red: extensions/ParserFunctions/Expr extensions/ParserFunctions/ParserFunctions_body extensions/Cite/Cite_body ConfirmEdit/Simple/Captcha |
The same tool, phpmetrics, allows to visualize a relation map of classes. In this case, the selected element is the class on the top of the image, EditPage, which is used by all other highlighted classes across the six files analyzed: EditPage.php, GlobalFunctions.php, Title.php, OutputPage.php, Linker.php, Revision.php.
Figure 10. Relational map of PHP classes
To use and test the above PHP metrics, a logical hypothesis would be to formulate that those files represented as small-yellow-or-green circles are less maintained and edited in the source code than those bigger-and-red. We can verify this idea thanks to Phabricator, a collaborative platform for Mediawiki contributors.
We selected a couple of PHP files from the source code: index.php and includes/EditPage.php. As it can be observed, the first file was last edited 4 months ago, while the latter one day before the analysis. So we conclude there is certain correlation between metrics and developers.
File: index.php Cyclomatic complexity = 1 Maintainability index = 121 Last modified: Nov 15 2015, 8:14 PM https://phabricator.wikimedia.org/diffusion/MW/browse/master/ | File: includes/EditPage.php Cyclomatic complexity = 555 Maintainability index = 35 Last modified: Sat, Mar 19, 12:20 AM https://phabricator.wikimedia.org/diffusion/MW/history/master/includes/EditPage.php |
The complexity of the Mediawiki source code extends beyond the mere access via web browsers. Indeed, the software feeds other tools that have been developed for multiple purposes: from monitoring and tracking changes to visualizing statistics and even to geolocating in real time the place where content edits come from.
From the standpoint of Wikipedia content, there is a difference between common users and renowned Wikipedia editors. Aaron Swartz conducted experiments to show how a small part of anonymous contributors are responsible for the core of substantial modifications, while the troops of registered editors concentrate on making content comply to editorial standards.
These editors are also known as the Recent Changes Patrol. Currently there is a list of more than 30 tools of several types (desktop, mobile, web-based) dedicated to easier their editorial work (https://en.wikipedia.org/wiki/Wikipedia:Recent_changes_patrol#Tools), for example: Wikipedia Vision: http://www.lkozma.net/wpv/index.html, Snuggle: https://snuggle-en.wmflabs.org/, and Vandal Fighter. These tools are part of the Wikimedia Labs, which contains almost 300 tools, instantiated almost 900 times.
In the side of developers, there are also differences and gaps. First, in order to become a developer, besides coding in PHP and mastering the magical words (https://www.mediawiki.org/wiki/Help:Magic_words), we only get to the status of “maintainer” after a long process of reconnaissance. The following step would be to become a system administrator (there are 95, where only 10 act as volunteers) but at this level one has to be appointed by the board of trustees (i.e. Jimmy Wales himself).
In conclusion, Mediawiki as software and Wikipedia as content are neither neutral nor entirely transparent.
2. Graphical user interface
2.1. Wikipedia template
Mediawiki offers a predefined template for articles, talk, history, user pages, etc.
We can simplify such templates as follows:
Wikipedia template in early 2004
Wikipedia template in early 2016
It is possible to appreciate that larger empty spaces are dedicated to content, while panes and lateral menus are dedicated to options and actions.
Among the main differences between versions, we believe it is important to mention:
A first conclusive intuition would point to the fact that “View source” and “View history” are more relevant to potential contributors, so the options for them to log in and review their contribution activity appear closer, ready-to-hand or simply as recall statement.
2.2. Different views of the “normal” page
Because Mediawiki/Wikipedia are web software, they depend on the rendering environment. We point three cases:
2.2.1. The “normal”l view
This is the Wikipedia page as seen by most users that have modern desktop web browsers running on modern operating systems.
2.2.2. Portable device view
This is the Wikipedia page as seen by most users using portable devices. One main difference to highlight is that “View history” is not natively accessible from smartphones (it can be accessed only if users change to desktop mode within the portable device web browser).
2.2.3. No style view
As it is well known, web users might choose not to display a web page with CSS styles. This has been performed mostly to test layout for simple print purposes or to check the structure of HTML documents.
It is interesting to note that, in the case of Wikipedia, when we disable CSS styles the link to access “View history” and “Talk” are situated in the lower part of the content. This might be frustrating for occasional content editors.
2.2.4. “Listen to this article”
Wikipedia offers an spoken word version of articles. This option can be accessed by clicking the “speaker” icon .
In the case of visually impaired users, there is a fundamental difference between recent edited versions of an article and the spoken audio track. Specifically for the 4chan article, the latest audible version was recorded on October 10th, 2010.
Please refer to: https://en.wikipedia.org/wiki/File:4chan.ogg
2.2.5. Comparisons between views
Normal view | No-style view | Code source view |
2.3. Graphical interface interventions