Ideas for analyzing and visualizing Mediawiki

Version analyzed: 1.26.2

Release date: Nov. 2015

Folder Size: 96.5 Mb.

1. Source code

Size of PHP-only files: 17.6 Mb.

Analysis approach: Static program analysis

Besides the defalut Mediawiki statistics (https://www.openhub.net/p/mediawiki) we have conducted in here some experiments for depicting the programming code as text.

Figure 01. Word cloud of the eight first-level php files, 3044 lines of code aprox.

Generated with wordle.net (150 max words, no English common words)

The most frequent word is “DIR”, which appears 1418 times, mainly in the autoload.php file. DIR makes reference to the one of the eight “magical constants” provided by PHP. __DIR__ was introduced in version 5.3 (June, 2009), it points to the directory of the file.

Figure 2 shows the word cloud generated if we remove manually “DIR” from wordle. Figure 3 shows the result if we re-generate the whole cloud without the autoload.php file:

Figure 2.

Removing manually the word DIR from the word cloud (left-click over word within Wordle.net)

Figure 3.

Regenerating the word cloud without DIR word instances

Table 1 explains the occurrences of the six most frequent words in Figure 3:

Word

Occurrences

Use

File

99

Mainly in comments, as __FILE__ and as variable $file

PHP

71

Mainly in comments or as document type, ie.  <?php

Return

71

Mainly as expected result of a command, i.e. return;

Params

68

Mainly in comments and as a variable $params

Name

61

In comments, as variable ($name) and echoed message

Array

58

As data structuring and handling of parameters

Relationships among words in code

The following word tree depicts the default visualization of branches. 

Figure 04. Word tree of Mediawiki code source.

https://www.jasondavies.com/wordtree/ 

If we search for the word “file” (99 occurrences), we obtain the following tree (figure 4-b), that represents more clearly the different uses of the word by its context in the code phrase.

Figure 04-b. Word tree of Mediawiki code source, starting with the word “file”.

https://www.jasondavies.com/wordtree/ 

The source code that served as entry data for these text visualizations was, as mentioned before, PHP. Such code is processed in the server side and users do not see variables or functions names in the rendered code by the browser.

We can compare relationships between pre and post-compiled code. In this manner, the following picture show the HTML source code of a given Wikipedia article, namely 4chan:

Figure 5

From the image we identify the HTML elements <span> and <cite>; the HTML attributes class and href, and the HTML character sign &amp (for text encoding). amp could be related to the word params in the source code because the platform is highly customizable and, more interestingly, oriented towards external tools for handling and managing content. Regarding <cite> and href, they both relate to external links that ensure the verifiability principle championed by Wikipedia.

Maintainability and cyclomatic complexity

Using existing methods and tools for software metrics (most notably http://phpmetrics.org), we could analyse the PHP code of core components of Mediawiki. These metrics represent mainly two values for each file analyzed:

Due to technical limitations (memory allocated to run phar script from a Terminal interface) we splitted files into folders. Each figures show the results:

Fig. 05

Red: thumb, profileinfo, vectortemplate, monobook, skincologne

Yellow: api.php, img_auth.php, skinmodern

Green: opensearch.php, load.php, index.php, autoload.php

Fig. 06

Red: includes/actions/RawAction, includes/actions/InfoAction, includes/actions/HistoryAction

Yellow:/actions/CreditsAction

/actions/Action

/actions/RollbackAction

/actions/RevertAction

/actions/WatchAction

Green:

/actions/SpecialPageAction

/actions/PurgeAction

/actions/MarkpatrolledAction

Fig. 07

Red:

EditPage.php

GlobalFunctions.php

Title.php

OutputPage.php

Linker.php

Revision.php

Fig. 08

Red: maintenance/Maintenance

maintenance/CopyFileBackend

maintenance/importImages

maintenance/namesapcesDupe

Fig. 09

Red: extensions/ParserFunctions/Expr

extensions/ParserFunctions/ParserFunctions_body

extensions/Cite/Cite_body

ConfirmEdit/Simple/Captcha

The same tool, phpmetrics, allows to visualize a relation map of classes. In this case, the selected element is the class on the top of the image, EditPage, which is used by all other highlighted classes across the six files analyzed: EditPage.php, GlobalFunctions.php, Title.php, OutputPage.php, Linker.php, Revision.php.

Figure 10. Relational map of PHP classes

To use and test the above PHP metrics, a logical hypothesis would be to formulate that those files represented as small-yellow-or-green circles are less maintained and edited in the source code than those bigger-and-red. We can verify this idea thanks to Phabricator, a collaborative platform for Mediawiki contributors.

We selected a couple of PHP files from the source code: index.php and includes/EditPage.php. As it can be observed, the first file was last edited 4 months ago, while the latter one day before the analysis. So we conclude there is certain correlation between metrics and developers.

File: index.php

Cyclomatic complexity = 1

Maintainability index = 121

Last modified: Nov 15 2015, 8:14 PM

https://phabricator.wikimedia.org/diffusion/MW/browse/master/

File: includes/EditPage.php

Cyclomatic complexity = 555

Maintainability index = 35

Last modified: Sat, Mar 19, 12:20 AM

https://phabricator.wikimedia.org/diffusion/MW/history/master/includes/EditPage.php

The complexity of the Mediawiki source code extends beyond the mere access via web browsers. Indeed, the software feeds other tools that have been developed for multiple purposes: from monitoring and tracking changes to visualizing statistics and even to geolocating in real time the place where content edits come from.

From the standpoint of Wikipedia content, there is a difference between common users and renowned Wikipedia editors. Aaron Swartz conducted experiments to show how a small part of anonymous contributors are responsible for the core of substantial modifications, while the troops of registered editors concentrate on making content comply to editorial standards.

These editors are also known as the Recent Changes Patrol. Currently there is a list of more than 30 tools of several types (desktop, mobile, web-based) dedicated to easier their editorial work (https://en.wikipedia.org/wiki/Wikipedia:Recent_changes_patrol#Tools), for example: Wikipedia Vision: http://www.lkozma.net/wpv/index.html, Snuggle: https://snuggle-en.wmflabs.org/, and Vandal Fighter. These tools are part of the Wikimedia Labs, which contains almost 300 tools, instantiated almost 900 times.

In the side of developers, there are also differences and gaps. First, in order to become a developer, besides coding in PHP and mastering the magical words (https://www.mediawiki.org/wiki/Help:Magic_words), we only get to the status of “maintainer” after a long process of reconnaissance. The following step would be to become a system administrator (there are 95, where only 10 act as volunteers) but at this level one has to be appointed by the board of trustees (i.e. Jimmy Wales himself).

In conclusion, Mediawiki as software and Wikipedia as content are neither neutral nor entirely transparent.

2. Graphical user interface

2.1. Wikipedia template

Mediawiki offers a predefined template for articles, talk, history, user pages, etc.

We can simplify such templates as follows:

Wikipedia template in early 2004

Wikipedia template in early 2016

It is possible to appreciate that larger empty spaces are dedicated to content, while panes and lateral menus are dedicated to options and actions.

Among the main differences between versions, we believe it is important to mention:

A first conclusive intuition would point to the fact that “View source” and “View history” are more relevant to potential contributors, so the options for them to log in and review their contribution activity appear closer, ready-to-hand or simply as recall statement.

2.2. Different views of the “normal” page

Because Mediawiki/Wikipedia are web software, they depend on the rendering environment. We point three cases:

2.2.1. The “normal”l view

This is the Wikipedia page as seen by most users that have modern desktop web browsers running on modern operating systems.

2.2.2. Portable device view

This is the Wikipedia page as seen by most users using portable devices. One main difference to highlight is that “View history” is not natively accessible from smartphones (it can be accessed only if users change to desktop mode within the portable device web browser).

2.2.3. No style view

As it is well known, web users might choose not to display a web page with CSS styles. This has been performed mostly to test layout for simple print purposes or to check the structure of HTML documents.

It is interesting to note that, in the case of Wikipedia, when we disable CSS styles the link to access “View history” and “Talk” are situated in the lower part of the content. This might be frustrating for occasional content editors.

2.2.4. “Listen to this article”

Wikipedia offers an spoken word version of articles. This option can be accessed by clicking the “speaker” icon Listen to this article.

In the case of visually impaired users, there is a fundamental difference between recent edited versions of an article and the spoken audio track. Specifically for the 4chan article, the latest audible version was recorded on October 10th, 2010.

Please refer to: https://en.wikipedia.org/wiki/File:4chan.ogg

2.2.5. Comparisons between views

Normal view

No-style view

Code source view

2.3. Graphical interface interventions