Home | Jantrid
The Firefox accessibility engine is responsible for providing assistive technologies like screen readers with the information they need to access web page content.
For the past couple of years, the Firefox accessibility team have been working on a major re-architecture of the accessibility engine to significantly improve its speed, reliability and maintainability.
We call this project “Cache the World”.
In this post, I explain the reasons for such a massive undertaking and describe how the new architecture solves these issues.
The need for speed
The biggest motivation for this project is to make Firefox faster when used with screen readers and other assistive technologies, particularly on Windows.
Let’s start by taking a look at some numbers.
The table below provides the approximate time taken to perform various tasks with Firefox and the NVDA screen reader, both before and after this re-architecture.
||Before (no cache)
||After (with cache)
|Load nsCSSFrameConstructor.cpp on Searchfox, which contains a table with over 12000 rows
|Load the WHATWG HTML spec, a very large document
|Open a Gmail message from the inbox
|Close a Gmail message, returning to the inbox
|Switch Slack channels
These times will differ widely depending on the computer used, whether the page has been loaded before, network speed, etc.
However, the relative comparison should give you some idea of the performance improvements provided by the new architecture.
So, why were things so slow in the first place?
To understand that, we must first take a little trip through browser history.
Note that I’ve glossed over some minor details below for the sake of brevity and simplicity.
In the beginning
Once upon a time, browsers were much simpler than they are now.
The browser was a single operating system process.
Even if there were multiple tabs or documents with iframes, everything still happened within a single process.
This worked reasonably well for assistive technologies, which use the accessibility tree to get information about the user interface and web content.
Operating system accessibility APIs were already being used to expose and query accessibility trees in other applications.
Although these APIs had to be extended somewhat to expose the rich semantics and complex structure of web content, browsers used them in fundamentally the same way as any other application: a single accessibility tree exposed from a single process.
Assistive technologies sometimes need to make large numbers of queries to perform a task; e.g. locating the next heading on a page.
However, making many queries across processes can become very slow due to the overhead of context switching, copying and serialising data, etc.
To make this faster, some assistive technologies and operating system frameworks ran their own code inside the browser process, known as in-process code.
This way, large batches of queries could be executed very fast.
In particular, Windows screen readers query the entire accessibility tree and build their own representation of the document called a virtual buffer.
As the web grew rapidly in usage and complexity, so too did the risk of security exploits.
To improve performance, stability and security, browsers started to move different web pages into separate processes.
Internet Explorer 8 used different processes for different tabs, but a web page was still drawn in the same process in which the page was loaded.
The accessibility tree was also exposed from that same process and assistive technologies could still inject code into that process.
This meant that there was no change for assistive technologies, which could still get direct, fast access to the content.
To further improve security, Chrome took a stricter approach as a fundamental part of its foundational design.
Web content processes were sandboxed so that they had as little access as possible, delegating tasks requiring more privileges to other processes through tightly controlled communication channels.
This meant that assistive technologies could not access the web content process containing the accessibility tree, nor could they inject code into that process.
Several years later, Firefox adopted a similar design, resulting in similar problems for accessibility.
The discovery of the Meltdown and Spectre attacks led both browsers to go even further and isolate iframes in their own processes, which made the situation even more complicated for accessibility.
At first, Chrome experimented with handling accessibility queries in the main UI process and relaying them to the appropriate web content process.
Because accessibility API queries are synchronous, the entire UI and the web content process were blocked until each accessibility query completed and returned its result.
This made this unacceptably slow, especially for large batches of queries as described above.
This also caused some obscure stability and reliability issues.
Chrome abandoned that approach in favour of caching the accessibility trees from all other processes in the main UI process.
Rather than synchronous queries between processes, Chrome asynchronously pushes the accessibility trees from each web content process.
This does require some additional time and processor power when pages load and update, as well as using extra memory for the cache.
On the other hand, it means that assistive technologies have direct, in-process, fast access to the content as they did before in other browsers.
Firefox’s solution, take 1
Firefox was designed long before Chrome and long before the complex world which necessitated multiple, sandboxed processes.
This meant that re-architecting Firefox to use multiple processes was a massive undertaking which took years and a great deal of resources.
Great care had to be taken to ensure that Firefox remained reliable for the hundreds of millions of users who depended on it every day.
Firefox built a very minimal cache in the main process containing only the tree structure and the role (button, heading, etc.) of each node.
All other queries were relayed synchronously to the appropriate web content process.
On Linux and Mac, where large batches of queries are far less common and virtual buffers aren’t used, this was acceptable for the most part.
On Windows, as Chrome discovered, this was completely unacceptable.
Not only was it unusably slow, it was very unstable due to the fact that COM (the Windows communication mechanism used by accessibility) allows re-entry; i.e. another call can be handled while an earlier call is still running.
The Firefox multi-process communication framework wasn’t designed to handle re-entry.
Thus, another approach was required on Windows.
The accessibility team considered implementing a full cache of all accessibility trees in the main process.
However, Mozilla needed to ship multi-process Firefox as soon as possible.
It was believed that it would take too long to implement a full cache and get it right.
Getting anything wrong could result in the wrong information being communicated to assistive technologies, which could be potentially disastrous for users who already depended on Firefox.
There are also other downsides to a full cache as outlined earlier.
Instead, Firefox used some advanced (and somewhat obscure) features of COM to allow assistive technologies to communicate with the accessibility tree in content processes.
To mitigate the performance problems caused by large batches of queries, a partial cache was provided for each node.
Querying a node still required a synchronous, cross-process call, but instead of just returning one piece of information, the cache was populated with other commonly retrieved information for that single node.
This meant that some subsequent queries for that node were very fast, since they were answered from the cache.
All of this was done using a COM lightweight client-side handler.
The entire cache for all nodes was invalidated whenever anything changed.
While naive, this reduced the risk of stale information.
The performance with assistive technologies took a massive step backwards when this was first released in Firefox 57.
Over time, we were able to improve this significantly by extending the COM handler cache.
Eventually, we reached a point where we could not improve the speed any further with the current architecture.
Because software other than assistive technology uses accessibility APIs (e.g. Windows touch, East Asian input methods and enterprise SSO tools), this was even impacting users without disabilities in some cases.
Furthermore, COM was never designed to handle the massive number of objects in many web accessibility trees, resulting in severe stability problems that are difficult or even impossible to fix.
The complexity of this architecture and the need for different implementations on different operating systems made the accessibility engine overly complex and difficult to maintain.
This is particularly important given the small size of our team.
When we revamped our Android and Mac implementations in 2019 and 2020, we had to implement more operating system specific tweaks to ensure decent performance, which took time and further complicated the code.
This wouldn’t have been necessary with the full cache.
Of course, maintaining the caching code has its own cost.
However, this work can be more easily distributed across the entire team, rather than relying on the specific expertise of individual team members in particular operating systems.
Enter Cache the World
Our existing architecture served us well for a few years.
However, as the problems began to mount, we decided to go back to the drawing board.
We concluded that the downsides of the full cache were far outweighed by the growing problems with our existing architecture and that careful design could help us mitigate those downsides.
Thus, the Cache the World project was born to re-architect the accessibility engine.
In the new architecture, similar to Chrome, Firefox asynchronously pushes the accessibility trees from each web content process to the main UI process.
When assistive technologies query the accessibility tree, all queries are answered from the cache without any calls between Firefox processes.
When a page updates, the content process asynchronously pushes a cache update to the main process.
The speed improvement has far exceeded our expectations, and unlike the old architecture, we still have a great deal of room to improve further, since we have complete control over how and when the cache is updated.
As for code maintenance, once this is fully released, we will be able to remove around 20,000 lines of code, with the majority of that being operating system specific.
The journey to the world of caching
Aside from the code needed to manage the cache and update it for many different kinds of changes, this project required a few other major pieces of work worth mentioning.
The cache isn’t needed for this, but we wanted to share as much code as possible between the cached and non-cached implementations.
In particular, there is a layer of code to support the accessibility APIs specific to each operating system and we didn’t want to maintain two completely separate versions of this.
So, we created a unified accessibility tree, with a base
Accessible class providing an interface and functionality common to both implementations (
Other code, especially operating system specific code, then had to be updated accordingly to use this unified tree.
Second, the Windows specific accessibility code was previously entangled with the core accessibility code.
Rather than being a separate class hierarchy, Windows functionality was implemented in subclasses of what is now called
This made it impossible for the Windows code to support the separate cached implementation.
Fixing this involved separating the Windows implementation into a separate class hierarchy (
Third, the code which provided access to text (words, lines, formatting, spelling errors, etc.) depended heavily on being able to query Firefox’s layout engine directly.
It dealt with text containers rather than individual chunks of text, which was not ideal for efficient caching.
There were also a lot of bugs causing asymmetric and inconsistent results.
We replaced this with a completely new implementation based on text ranges called
It still needs to use the layout engine to determine line boundaries, but it can do this for individual chunks of text and it provides symmetric, consistent results.
TextLeafRange is far better suited to the Mac text API and will make the Windows UI Automation text pattern much easier to implement when we get to that.
We also replaced the code for handling tables, which similarly depended heavily on the layout engine, with a new implementation called
Fourth, our Android accessibility code required significant re-design.
On Android, unlike other operating systems, the Firefox browser engine lives in a separate thread from the Android UI.
Since accessibility queries arrive on the UI thread, we had to provide thread-safe access to the accessibility cache on Android.
Finally, screen coordinates and hit testing, which is used to figure out what node is at a particular point on the screen, were an interesting challenge.
Screen positioning on the modern web can be very complicated, involving scrolling, multiple layers, floating content, transforms (repositioning/translation, scaling, rotation, skew), etc.
We cache the coordinates and size of each node relative to its parent and separately cache scroll positions and transforms.
This minimises cache updates when content is scrolled or moved.
Using this data, we then calculate the absolute coordinates on demand when an assistive technology asks for them.
For hit testing, we use the layout engine to determine which elements are visible on screen and sort them from the top layer to the bottom layer.
We cache this as a flat list of nodes called the viewport cache.
When an assistive technology asks for the node at a given point on the screen, we walk that list, returning the first node which contains the given point.
How is Firefox’s cache different to Chrome’s?
While Firefox’s cache is similar to (and inspired by) Chrome’s, there are some interesting differences.
First, to keep its cache up to date, Chrome has a cache serialiser which is responsible for sending cache updates.
It starts by notifying the serialiser that a node has changed.
The specific change is mostly irrelevant to the serialiser; it just re-serialises the entire node.
The serialiser keeps track of what nodes have already been sent.
When walking the tree, it sends any new nodes it encounters and ignores any nodes that were already sent and haven’t been changed.
In contrast, Firefox uses its existing accessibility events and specific cache update requests to determine what changes to send.
When a node is added or removed, Firefox fires a show or hide event.
This event is used to send information about a subtree insertion or removal to the main process.
The web content process doesn’t specifically track what nodes have been sent, but rather, it relies on the correctness of the show and hide events.
For other changes to nodes, Firefox uses existing events to trigger cache updates where possible.
Where it doesn’t make sense to have an event, code has been added to trigger specific cache updates.
The cache updates only include the specific information that changed.
We’ve spent years refining the events we fire, and incorrect events tend to cause problems for assistive technologies and thus need to be fixed regardless, so we felt this was the best approach for Firefox.
Second, Chrome includes information about word boundaries in its cache.
In contrast, Firefox calculates word boundaries on demand, which saves memory and reduces cache update complexity.
We can do this because we have access to the code which calculates word boundaries in both our main process and our content processes, since our main process renders web content for our UI.
Third, hit testing is implemented differently.
I described how Firefox implements hit testing earlier.
Rather than maintaining a viewport cache, Chrome first gets an approximate result using just the cached coordinates in the tree.
It then sends an asynchronous request to cache a more accurate result for the next query at a nearby point on the screen.
Our hope is that the viewport cache will make initial hit testing more accurate in Firefox, though this strategy may well need some refinement over time.
So, when can I use this awesomeness?
I’m glad you asked!
The new architecture is already enabled in Firefox Nightly.
So far, we’ve received very positive feedback from users.
Assuming all continues to go well, we plan to enable this for Windows and Linux users in Firefox 110 beta in January 2023.
After that, we will roll this out in stages to Windows and Linux users in Firefox 111 or 112 release.
There is still a little work to do on Mac to fully benefit from the cache, particularly for text interaction, but we hope to release this for Mac soon after Windows.
Now, go forth and enjoy a cached world!
I recently got a new laptop: Dell XPS15 9510.
While this is a pretty nice machine overall, its audio drivers are an abomination.
Among other things, the Waves MaxxAudio software it ships with eventually leaks all of your system memory if you use audio constantly for hours, which is the case for screen reader users.
I eventually got fed up and disabled the Waves crap, but this makes it impossible for me to use the headset mic on my Earpods.
To work around that, I bought an Apple USB-C to 3.5-mm Headphone Jack Adapter.
As well as supporting the mic on the Earpods, this adapter also supports the volume and play/pause buttons!
However, play/pause only plays or pauses.
In contrast, on the iPhone, pressing it twice skips to the next track and pressing it thrice skips to the previous track.
I discovered that these buttons simply get sent as media key presses.
So, I wrote a little AutoHotkey script to intercept the play/pause button and translate double and triple presses into next track and previous track, respectively.
It also supports holding the button.
Single press and hold currently does nothing, but you could adjust this to do whatever you want.
On the iPhone, double/triple press and hold fast forwards/rewinds, respectively.
Support for triggering fast forward/rewind in Windows apps is sketchy - I couldn’t get it to work anywhere - so these are currently mapped to shift+nextTrack and shift+previousTrack.
This way, you have the option of binding those keys in whatever app you’re using.
You can get the code or download a build.
I’m finally learning to cook some decent food, so I need to be able to read recipes.
For a while, I was reading them out of Simplenote on my iPhone.
However, I encountered several frustrations with this approach (and this applies to any notes or text app really):
- When you’re not editing, Simplenote shows the note such that each line is an item for VoiceOver; i.e. you flick right to read the next line.
However, if you bump the screen or perform the wrong gesture accidentally, you can easily lose your position.
If the screen locks or you have to switch apps, you lose your position completely, since VoiceOver doesn’t restore focus to the last focused item in apps.
- When editing, you can review the note line by line using the rotor.
The advantage here is that the cursor doesn’t get lost when you switch apps or the screen locks.
However, lines can be smaller than is ideal due to the screen size, so one recipe instruction might get split across multiple lines.
Also, moving the editing cursor with VoiceOver is notoriously buggy, often getting stuck, etc.
Finally, again, if you bump the screen or perform the wrong gesture, you can lose your position (or worse, accidentally type text into the document).
- The screen lock problem could be solved by disabling auto lock, but that obviously has an impact on battery.
- Having to repeatedly take my phone out of my pocket to read the next instruction was impractical, especially given the risk of losing my spot in the recipe.
Leaving it on a bench somewhere meant I had to keep walking back to wherever my phone was located, which was similarly tedious.
This might seem simple enough, but when you’re moving around a lot, using your hands for other things, getting your hands dirty, etc., it just isn’t efficient.
I considered a couple of solutions:
- I tried looking for an iOS app that could read recipes using Siri.
If such an app exists, I couldn’t find it.
Any normal recipe app would likely have the same problems as above for VoiceOver users.
- Google Home and Amazon Alexa can read recipes interactively using voice commands.
I don’t own either of those, but I was willing to consider the purchase.
However, they can only read recipes from partner sites.
This means you can’t read recipes from other sources or recipes you’ve customised… and I tend to tweak recipes quite a bit for my own convenience.
So, I resigned myself to developing my own solution to read recipes with Siri.
This isn’t specific to recipes.
It can be any line based text.
For example, it could be equally useful for other kinds of instructions where you need to be able to move step by step, but might have delays (maybe many minutes) between reading each step.
How it Works
As explained above, I need to be able to edit and read customised recipes.
I find it much easier to edit long text on my laptop.
So, my solution takes the text from a simple text file stored on iCloud Drive.
This way, I can edit the text on my laptop, save it directly to iCloud Drive and have it reflected almost immediately in my reader solution without any extra effort.
Recipes usually have at least two sections (e.g. Ingredients and Method).
It’s sometimes helpful to be able to jump between those.
The solution allows me to use a Markdown style heading (
# heading text) to mark section headings.
The solution can be used while the phone is locked.
An added advantage of this is that it can even be triggered from Homepod with the responses read on the homepod, though I usually prefer to use my Airpods.
I can then use these Siri commands (i.e. after saying “Hey Siri”):
- Read next: Read the next line of text.
- Read previous: Read the previous line of text.
- Read repeat: Repeat the line of text that was last read.
- Read next section: Jump to the next section heading and read it.
- Read previous section: Jump to the previous section heading and read it.
In all cases, the solution keeps track of the last line that was read until you next use a command.
Even if I wait an hour, I’ll still be exactly where I last left it.
Sometimes, I want to be able to quickly review many instructions; e.g. if I’m looking for multiple ingredients or reading ahead to see what’s coming up.
In this case, I can use the Siri command “read browse” while the phone is unlocked.
This presents the instructions in a WebView so I can flick right and left between them with VoiceOver.
I can also use the headings rotor to move between headings.
When it opens, it focuses the line I last read.
Furthermore, if I double tap one of the lines, it sets that as the “bookmark”; i.e. the last line read.
For example, if I double tap the second instruction in the method section of a recipe and later say “Hey Siri, read next”, Siri will read me the third instruction in the method section.
While in this browsing view, each line occupies almost the entire screen.
This might be useful if you’re reading notes for a live talk you’re giving and don’t want to risk losing your spot if you bump the screen.
Reading on Apple Watch
A few weeks ago, I bought an Apple Watch.
I began to wonder: could I somehow make use of the Apple Watch for something similar to “read browse”?
The Apple Watch has three nice advantages here:
- It’s on your wrist, so you don’t have to worry about locating it, pickig it up, accidentally dropping it, etc.
- Although it does go to sleep, you don’t have to unlock it once it’s on your wrist and unlocked.
- It’s much more water resistant than phones, so I’m less worried about sticking my grubby hands all over it.
Now, the solution works on Apple Watch too.
For technical reasons, it unfortunately can’t focus the last line I read with Siri and doesn’t support heading navigation.
When the Apple Watch wakes up after going to sleep, VoiceOver doesn’t restore focus to the line I last read.
However, because the screen is so small and the scroll position is kept, I can just tap the screen to read the last line (or at least one very nearby), so this isn’t a real problem on the watch.
Interestingly, I find I now use the watch to read recipes far more than Siri.
Before I got the watch, I’d been using the Siri solution with my phone for a few months and was reasonably happy with it.
However, speaking Siri commands can be slow if you’re reading several instructions in quick succession.
Also, Siri would sometimes misunderstand my commands; e.g. trying to read text messages instead of “read next”.
Also, I found I wanted to read ahead more often with some recipes and having to find my phone, pick it up, unlock it, etc. to use “read browse” was slightly annoying.
That said, I suspect I’ll still use Siri in some cases.
It’s useful being able to interactively read instructions in multiple ways depending on what I need at the time.
The current solution is implemented using iOS Shortcuts and Scriptable.
Here’s the code for the Scriptable script.
Unfortunately, getting this set up is pretty tedious because you have to manually create a bunch of Siri shortcuts.
- Import the script into Scriptable.
The easiest way to do this is to copy the file into the Scriptable folder on iCloud Drive.
- In the Shortcuts app, create a shortcut for “Read next”:
- Add the Scriptable “Run Script” action and choose the SiriInteractiveReader script.
- Tap the “Show More” button for that action.
- Under “Texts”, add a new item and enter the text: nextLine
- Ensure “Run In App” and “Show when Run” are both off.
- Add the “Show Result” action.
- Name the shortcut “Read next”.
- Duplicate this shortcut for the rest of the Siri reading commands.
Aside from the name of the shortcut, the difference in each shortcut will be the text entered under “Texts” in the “Run Script” action:
- For Read previous: previousLine
- For Read repeat: repeatLine
- For Read next section: nextSection
- For Read previous section: previousSection
- For the “Read browse” shortcut, the text under “Texts” should be “browse”.
“Run In App” must be on.
The “Show Result” action should be removed (so there’s only the “Run Script” action).
- If you have an Apple Watch, you can add a shortcut to support this.
- Again, you need the “Run Script” action with the script set to SiriInteractiveReader.
- Under Texts, enter the text: list
- Add the “Get Dictionary Value” action.
- Set the key for that action to: list
- Add the “Choose Item from List” action.
- Name the shortcut “Read watch” and ensure “Show on Apple Watch” is on.
- Note that you should run this action from the Shortcuts app on the watch, not from Siri.
The text you want to read should be placed in a file called
SiriInteractiveReader.txt in the
Scriptable folder on iCloud Drive.
I learned a great deal throughout the process of implementing this.
Here are some learnings that might be of interest to others working with iOS Shortcuts and Scriptable.
- If you want to have Siri read a response without also saying “That’s done” or similar, you need to turn off “Show when Run”, return the text as output from your Scriptable script and use the “Show Result” action in Shortcuts.
The intuitive way to speak text using Siri is to use Scriptable’s Speech.speak function.
If you do this, Siri seems to want to speak “That’s done” or similar.
In contrast, this doesn’t happen when you use “Show Result”.
The added advantage is that the shortcut will display the text on screen if run outside of Siri.
- If you have Scriptable present a WebView with “Run In App” turned off, you won’t be able to activate anything in the WebView.
window.close functions don’t work in Scriptable WebViews.
I was hoping to use this to handle moving the reading bookmark when the user taps a line.
Instead, I used a
scriptable:// URL to open the script with a specific parameter, which also dismisses the WebView.
- There is no Scriptable app for Apple Watch (yet).
However, as long as you don’t try to present any UI whatsoever from within Scriptable, you can still make use of Scriptable in shortcuts run on the watch.
The script will run on the phone.
You can still present UI, but you have to do it with Shortcuts actions, which are able to run on the watch.
You can get creative here to present UI based on output from a Scriptable script, as I do using the “Choose Item from List” in the watch shortcut above.
Ideally, it’d be good to develop this into an app so it’s not so tedious (probably impossible for many users) to install.
I considered this, but I’m one of these strange developers that still uses a text editor and prefers to design GUI using markup languages or code.
The prospect of learning and using Xcode and having to use a GUI builder is not something I’m at all motivated to do in my spare time.
I’ve read you can design iOS GUI in code to some extent, but it looks super painful.
An awesome feature in Firefox that has existed forever is the ability to assign keywords to bookmarks.
For example, I could assign the word “bank” to take me directly to the login page for my bank.
Then, all I have to do is type “bank” into the address bar and press enter, and I’m there.
Another awesome feature in Firefox is the ability to use the address bar to switch to open tabs.
For example, if I want to switch to my Twitter tab, I can type “% twitter” into the address bar, then press down arrow and enter, and I’m there.
Inspired by these two features, I started to wonder: what if you could have tab keywords to quickly switch to tabs you use a lot?
If you only have 8 tabs you use a lot, you can switch to the first 8 tabs with control+1 through control+8.
If you have more than that, you can search them with the address bar, but that gets messy if you have multiple pages with similar titles or a page title doesn’t contain keywords that are quick to search.
For example, if you have both Facebook Messenger and Twitter Direct Messages open, you can’t just type “% mes” because that will match both.
If you’re on bug triage duty and have a bug list open, the list might not have a uesful title.
Wouldn’t it be nice to just type “tm” to switch to Twitter Direct Messages or “fm” to switch to Facebook Messenger?
Now you can!
Trying to integrate this into the Firefox address bar seemed pretty weird.
Among other things, it wasn’t clear what a good user experience would be for setting a keyword for a tab.
So, I decided to do this in an add-on.
Rather than writing my own from scratch, I found Fast Tab Switcher and contributed the feature to that.
Version 2.7.0 of Fast Tab Switcher has now been released which includes this feature.
How it Works
First, install Fast Tab Switcher if you haven’t already.
Then, to assign a keyword to a tab:
- Switch to the tab.
- Press control+space to open Fast Tab Switcher.
- Type the
= (equals) character into the text box.
- Type the keyword to assign and press enter.
For example, to assign the keyword “fm”, you would press control+space, type “=fm” and press enter.
To switch to a tab using its keyword, press control+space, type the keyword and press enter.
Note that the keyword must be an exact match.
Keywords stay assigned to tabs even if you close firefox, as long as you have Firefox set to restore tabs.
Enjoy super fast tab switching!
At CSUN this year, I attended the open source math accessibility sprint co-hosted by the Shuttleworth Foundation and Benetech, where major players in the field gathered to discuss and hack on various aspects of open source math accessibility. My team, which also included Kathi Fletcher, Volker Sorge and Derek Riemer, tackled reading of mainstream math with open source tools.
Last year, NVDA introduced support for reading and interactive navigation of math content in web browsers and in Microsoft Word and PowerPoint. To facilitate this, NVDA uses MathPlayer 4 from Design Science. While MathPlayer is a great, free solution that is already helping many users, it is closed source, proprietary software, which severely limits its future potential. Thus, there is a great need for a fully open source alternative.
Some time ago, Volker Sorge implemented support for math in ChromeVox and later forked this into a separate project called Speech Rule Engine (SRE). There were two major pieces to our task:
Our goal for the end of the day was to present NVDA and SRE reading and interactively navigating the quadratic equation in Microsoft Word using ClearSpeak, including one pause in speech specified by ClearSpeak. (ClearSpeak uses pauses to make reading easier and to naturally communicate information about the math expression.) I'm pleased to say we were successful! Obviously, this was very much a "proof of concept" implementation and there is a great deal of further work to be done, both in NVDA and SRE. Thanks to my team for their excellent work and to Benetech and the Shuttleworth Foundation for hosting the event and inviting me!
- One of the things that sets MathPlayer above other math accessibility solutions is its use of the more natural ClearSpeak speech style. In contrast, MathSpeak, the speech style used by SRE and others, was designed primarily for dictation and is not well suited to efficient understanding of math, at least without a great deal of training. So, we needed to implement ClearSpeak in SRE. Because this is a massive task that would take months to complete (and this was a one day hackathon!), we chose to implement just a few ClearSpeak rules, just enough to read the quadratic equation.
As a result of this work, I was subsequently nominated by Kathi Fletcher for a Shuttleworth Foundation Flash Grant. In short, this is a small grant I can put towards a project of my choice, with the only condition being to "live openly" and share it with the world. And I figured polishing NVDA's integration with SRE was a fitting project for this grant. So, in the coming months, I plan to release an NVDA add-on package which allows users to easily install and use this solution. Thanks to Kathi for nominating me and to the Shuttleworth Foundation for supporting this! Watch this space for more details.
IntroductionIn my last post, I waxed lyrical about the surprising complexity of the seemingly simple aria-label/ledby. Thanks to those who took the time to read it and provide their valuable thoughts. In particular, Steve Faulkner commented that he’d started working on “doc/test files suggesting what screen readers should announce from accname/description info“. Talk about responsive! Thanks Steve! His inclusion of description opened up another can of worms for me, so I thought I’d continue the trend and let the worms spill out right here. Thankfully, this particular can is somewhat smaller than the first one!
What are you on about this time?Steve’s new document suggests that for an
a tag with an
href, screen readers should:
Announce accname + accdescription (if present and different from acc name), ignore element content.I don’t agree with the “ignore element content” bit in all cases; see the “Why not just use the accessible name?” section of my label post for why. However, the bit of interest here is the suggestion that accDescription should be reported.
Well, of course it should! The spec says!The spec allows elements to be described, so many argue that it logically follows that a supporting screen reader should always read the description. I strongly disagree.
While the label is primary information for many elements (including links), I believe the description is “secondary” information. The ARIA spec says that:
a label should be concise, where a description is intended to provide more verbose information.“More verbose information” is the key here. It is reasonable to assume that users will not always be interested in this level of verbosity. If the information was important enough to be read always, why not just stick it in the label?
What on earth do you mean by secondary information?I think of descriptions rather like tooltips. A tooltip isn’t always on screen, but rather, appears only when, say, the user moves their mouse over the associated element. The information is useful, but the user doesn’t always need to see it. They only need to see it if the element is of particular interest.
title attribute is most often presented as a tooltip and… wait for it… is usually presented as the accessible description (unless there’s no name).
But most screen reader users don’t use a mouse!Quite so. But moving the mouse to an element can be generalised: some gesture that indicates the user is specifically interested in/wishes to interact with this element. When a user is just reading, they’re not doing this.
Why is this such a big deal?Imagine you’re reading an article about the changing landscape of device connectors in portable computers over the years:
(I use the title attribute here because it’s easier than aria-describedby, but the same could be done with aria-describedby.)
<p>There have been many different types of connections for peripheral devices in portable computers over the years: <a href="pcmcia" title="Personal Computer Memory Card International Association">PCMCIA</a>, <a href="usb" title="Universal Serial Bus">USB</a> and <a href="sata" title="Serial ATA">E-SATA</a>, just to name a few.</p>
Imagine you’re reading this as a flat document, either line by line or all at once. Let’s check that out with all descriptions reported:
There have been many different types of connections for peripheral devices in portable computers over the years: link, PCMCIA, Personal Computer Memory Card International Association, link, USB, Universal Serial Bus, and link, E-SATA, Serial ATA, just to name a few.Wow. That’s insanely verbose and not overly useful unless I’m particularly interested in the linked article. And that’s just one small sentence! If sighted users don’t have to see this all the time, why should I as a screen reader user?
Here’s another example based loosely on an issue item in the NVDA GitHub issue list:
Let’s read that entire item with descriptions:
<a href="issue/5612">Support for HumanWare Brailliant B using USB HID</a>
<a href="label/Braille" title="View all Braille issues">Braille</a>
<a href="label/enhancement" title="View all enhancement issues">enhancement</a><br>
opened <span title="16 Dec. 2015, 9:49 am AEST">2 days ago</a>
by <a href="user/jcsteh" title="View all issues opened by jcsteh">jcsteh</a>
link, Support for HumanWare Brailliant B using USB HID, link, Braille, View all Braille issues, link, enhancement, View all enhancement issues, #5612 opened 2 days ago, 16 Dec. 2015, 9:49 am AEST, by jcsteh, View all issues opened by jcstehIn what universe is that efficient?
Slight digression: complete misunderstanding of descriptionAs an aside, GitHub’s real implementation of this is actually far worse because they incorrectly use the
aria-label attribute where I’ve used the
title attribute, so you lose the real labels altogether. You get something like this:
link, Support for HumanWare Brailliant B using USB HID, link, View all Braille issueswhich doesn’t even make sense. David MacDonald outlined this exact issue in his comment on my label post:
The most common mistake I’m correcting for aria-label/ledby is when it over rides the text in the element, or associated label and when that text or associated html label is important. For instance, a bit of help text on an input. They should use describedby but they don’t understand the difference between accName and accDescription.Still, the spec is fairly clear on this point, so I guess this one is just up to evangelism.
So are you saying description should never be read? What’s the point of it, then?Not at all. I’m saying it shouldn’t “always” be read.
When, then?When there is “some gesture that indicates the user is specifically interested in/wishes to interact with this element”. For a screen reader, simply moving line by line through a document doesn’t satisfy this. Sure, the user is interacting with the device, but that’s because screen readers inherently require interaction; they aren’t entirely passive like sight. For me (and, surprise surprise, for NVDA), this “gesture” means something like tabbing to the link, moving to it using single letter navigation, using a command to query information about the current element, etc.
But VoiceOver reads it!With VoiceOver, you usually move to each element individually. You don’t (at least not as often) move line by line (like you do with NVDA), where there can be several elements reported at once. With the individual element model, it makes sense to read the description because you’re dealing with a single element at a time and the user may well be interested in that specific element. And if the user really doesn’t care about it, they can always just move on to the next element early.
So now you’re saying we can’t have interoperability. Dude, make up your mind already!Recall this from my last post:
If we want interoperability, we need solid rules. I’m not necessarily suggesting that this be compulsory or prescriptive; different AT products have different interaction models and we also need to allow for preferences and innovation.This is one of those “different interaction models” examples.
Rich Schwerdtfeger commented on my last post:
The problem we have with AT vendors is that many have lobbied very hard for us to NOT dictate what they should do.Examples like these are one reason AT vendors push back on this.
So, uh, what are we supposed to do?I’m optimistic that there’s a middle ground: guidelines which allow for reasonable interoperability without restricting AT’s ability to innovate and best suit their users’ needs. As in software development, a bit of well-considered abstraction goes a long way to ensuring future longevity.
In this case, perhaps the guidelines could use the “secondary content” terminology I used above or something similar. They might say that for an
a tag with an
href, the name should be presented as the primary content if overridden using aria-label/ledby and the description should be treated as secondary content. This leaves it up to the AT vendor to decide exactly when this secondary content is presented based on the interaction model, while still providing some idea of how to best ensure interoperability.
But sometimes, I hate ARIA. Yes, you heard me. I said it. Sometimes, it drives me truly insane.
Let’s take aria-label and aria-labelledby. They’re awesome. Authors can just use them to make screen readers speak the right thing. Simple, right?
The most frustrating part is that people frequently argue that assistive technology products aren’t following the spec when their particular use case doesn’t work as expected. Others bemoan the lack of interoperability between AT products and often blame the AT vendors. But actually, the ARIA spec and guidelines don’t say (not even in terms of recommendations) anything about what ATs should do. They talk only about what browsers should expose, and herein begins a great deal of misunderstanding, argument and confusion. And when we do try to fix one seemingly obvious use case, we often break another seemingly obvious use case.
In this epic ramble, I’ll attempt to explain just how complicated this supeficially trivial issue is, primarily so I can avoid having this argument over and over and over again. While this is specifically related to aria-label/aria-labelledby, it’s worth noting there are similar cans of worms lurking in many other aspects of ARIA. Also, I specifically discuss screen readers with a focus on NVDA in particular, but some of this should still be relevant to other AT.
Why not just use the accessible name?
Essentially, aria-label/ledby alters what a browser exposes as the “name” of an element via accessibility APIs. Furthermore, ARIA specifies when the name should be calculated from the “text” of descendant elements. So before we even get into aria-label/ledby, let’s address the question: why don’t screen raeders just use the name wherever it is present?
The major problem with this is that the “name” is just text. It doesn’t provide any semantic or formatting information.
Take this example:
<a href="foo"><em>bar</em> bas</a>
A browser will expose “bar bas” as the name of the link exactly as you might expect. But that “bar bas” is just text. What about the fact that “bar” was emphasised? If we just take the name, that information is lost. In this example:
<a href="foo"><img src="bar.png" alt="bar"> bas</a>
the name is again “bar bas”. But if we just take the name, the fact that “bar” is a graphic is lost.
These are overly simple, contrived examples, but imagine how this begins to matter once you have more complex content.
In short, content is more than just the name.
Just use it when aria-label/ledby is present.
Okay. So we can’t always use the name. But if aria-label/ledby is present, then we can use the name, right?
Wrong. To disprove this, all we have to do is take a landmark:
<div role="navigation" aria-label="Main">Lots of navigation links here</div>
Now, our screen reader comes along looking for content and sees there’s a name, which it happily uses as the content for the entire element. Oops. All of our navigation links just disappeared. All we have left is “Main”. (Of course, no screen reader actually does or has ever done this as far as I'm aware.)
That’s just silly. You obviously don’t do it for landmarks!
Well, sure, but this raises the question: when do we use it and when don’t we? “Common sense” isn’t sufficient for people, let alone computers. We need clear, unambiguous rules. There is no document which provides any such guidance for AT, so each product has to try to come up with its own rules. And thus, the cracks in the mythical utopia of interoperability begin to emerge.
That really sucks. But enough doom and gloom. Let’s try to come up with some rules here.
Render aria-label/ledby before the real content?
Yup, this would fix the landmark case. It is bad for a case like this, though:
That “X” is meaningless semantically, so the author thoughtfully used aria-label. If we use both the name and content, we’ll get “Close X”. Yuck!
Landmarks are just special. You can still use aria-label/ledby as content for everything else.
Not so much. Consider this tweet-like example:
<li tabindex="-1" aria-labelledby="user message time">
<a id="user" href="alice">@Alice</a>
<a id="time" href="6min">6 minutes ago</a>
<span id="message">Wow. This blog is horrible: <a href="http://blog.jantrid.net/">http://blog.jantrid.net/</a></span>
<a href="conv">View conversation</a>
Twitter.com uses this technique, though the code is obviously nothing like this. The “li” element is the tweet. It’s focusable and you can move between tweets by pressing j and k. The aria-labelledby means you get a nice, efficient summary experience when navigating between tweets; e.g. the time gets read last, the View conversation and Reply controls are excluded, etc. But if we used the name as content, we’d lose the formatting, links in the message, and the View conversation and Reply controls. If we render the name before the content, we end up with serious duplication.
Can I at least label links and buttons?
Believe it or not, I actually have good news this time: yes, you can. But why links and buttons? And what else falls into this category? We need a proper rule here, remember.
There are certain elements such as links, buttons, graphics, headings, tabs and menu items where the content is always what makes sense as the label. While it isn’t clear that it can be used for this determination, the ARIA spec includes a characteristic of “Name From: contents” which neatly categorises these controls.
Thus, we reach our first solid rule: if the ARIA characteristic “Name From: contents” applies, aria-label/ledby should completely override the content.
What about check boxes and radio buttons?
Check boxes and radio buttons don’t quite fit this rule. The problem is that the label is often (but not always) presented separately from the check box element itself, as is the case with the standard HTML input tag:
<input id="Cheese" type="checkbox"><label for="cheese">Cheese</label>
The equivalent using ARIA would be:
<div role="checkbox" aria-labelledby="cheeseLabel"> </div><div id="cheeseLabel">Cheese</div>
In most cases, a screen reader will see both the check box and label elements separately. If we say the name should always be rendered for check boxes, we’ll end up with double cheese: the first instance will be the name of the check box, with the second being the label element itself. Duplication is evil, primarily because it causes excessive verbosity.
Okay, so we choose one of them. But which one?
Ignore the label element, obviously. Duh.
Perhaps. In fact, WebKit and derivatives choose to strip out the label element altogether as far as accessibility is concerned in some cases. But what about the formatting and other semantic info?
Let’s try this example in Google Chrome, which has its roots in WebKit:
<input type="checkbox" id="agree"><label for="agree">I agree to the <a href="terms">terms and conditions</a></label>
The label element gets stripped out, leaving a check box and a link. If I read this in NVDA browse mode, I get:
check box not checked, I agree to the terms and conditions, link, Terms and conditions
Ug. That’s horrible. In contrast, this is what we get in Firefox (where the label isn’t stripped):
check box not checked, I agree to the, link, Terms and conditions
Ignoring the label element means we also lose its original position relative to other content. Particularly in tables, this can be really important, since the position of the label in the table might very much help you to understand the structure of the form or aid in navigation of the table.
Fine. So use the label element and ignore the name of the check box.
Great. You just broke this example:
<div role="checkbox" aria-label="Muahahaha"> </div>
Make up your mind!
I know, right? The problem is that both of these suck.
The solution I eventually implemented in NVDA is that for check boxes and radio buttons, if the label is invisible, we do render the name as the content for the check box. Finally, another solid rule.
Sweet! And this applies to other form controls too, yeah?
Alas, no. The trouble with other form controls like text boxes, list boxes, combo boxes, sliders, etc. is that their label could never be considered their “content”. Their content is the actual stuff entered into the control; e.g. the text typed into a text box.
If the label is visible, it’s easy: we render the label element and ignore the name of the control. If it isn’t visible, currently, NVDA browse mode doesn’t present it at all.
To solve this, we need to present the label separately. For a flat document representation such as NVDA browse mode, this is tricky, since the label isn’t the “content” of anything. I think the best solution for NVDA here is to present the name of the control as meta information, but only if the label isn’t visible. I haven’t yet implemented this.
Rocking. Can the label override the content for divs, spans and table cells?
No, because if it did, again, we’d lose formatting and semantic info. These elements in particular can contain just about any amount of anything. Do we really want to risk losing that much formatting/info? See the Twitter example above for just a taste of what we might lose.
Another problem with this is the title attribute. Remember I mentioned that aria-label/ledby just alters what the browser exposes as the “name”? The problem is that other things can be exposed as the name, too. If there is no other name, the title attribute will be used if present. I’d say it’s quite likely that the title attribute has been used on quite a lot of divs and spans in the wild, perhaps even table cells. If we replaced the content in this case, that would be… rather unfortunate.
Some have argued that for table cells, we should at least append the aria-label/ledby
. Aside from the nasty duplication that might result, this raises a new category of use cases: those where the label should be appended to the content, not overide it. With a new category begin the same questions: what are the rules for this category? And would this make sense for all use cases? It certainly seems sketchy to me, and sketchy just isn’t okay here. Again, we need solid, unambiguous rules.
Stop! Stop! I just can’t take it any more!
Yeah, I hear you. Welcome to my pain! But seriously, I hope this has given some insight into why this stuff is so complicated. It seems so simple when you consider a few use cases, but that simplicity starts to fall apart once you dig a little deeper. Trying to produce “common sense” behaviour for the multitude of use cases becomes extremely difficult, if not downright impossible.
If we want interoperability, we need solid rules. I’m not necessarily suggesting that this be compulsory or prescriptive; different AT products have different interaction models and we also need to allow for preferences and innovation. Right now, though, there’s absolutely nothing.
I recently had to deploy a Flask web app for Hush Little Baby Early Childhood Music Classes (shameless plug) with uWSGI. (Sidenote: Flask + SQLAlchemy + WTForms = awesome.) I ran into an extremely exasperating issue which I thought I'd document here in case anyone else runs into it.
Despite the fact that uWSGI recommends that you run a separate instance for each app, I prefer the dynamic app approach. While i certainly understand why separate instances are recommended, I think per-app instances waste resources, especially when they have a lot of common dependencies, including Python itself. I also set uWSGI to use multiple threads. Unfortunately, with Flask, this is a recipe for disaster.
As soon as Flask is imported by a dynamic app in this configuration, uWSGI instantly hangs and stops responding altogether. The only option is to kill -9. After hours of late night testing, debugging, muttering, cursing, finally going to bed and then more of the same the next day, I finally thought to try disabling threads in uWSGI. And it… worked.
Still, I needed a little bit of concurrency, didn't want to use multiple processes and didn't want to abandon the dynamic app approach. It occurred to me that if it worked fine with per-app instances (I didn't actually test this, but surely someone would have reported such a problem) and a single thread, then it should work if flask were imported before the threading stuff happened. This led me to discover the shared-pyimport option. Sure enough, if I specify flask as a shared import (though a non-shared miport might work just as well), it works even with threads > 1. Horray!
I still don't know if this is a bug in Flask, a Flask dependency or uWSGI or whether it's just a configuration that can never work for reasons I don't understand. I don't really have time to debug it, so I'm just happy I found a solution.
Josh is fairly clingy with Jen at the moment, especially at night. One evening last week, Jen, Josh and I were all lying in bed, with me cuddling Josh. We were wondering whether Josh would be happy with that this night. Soon after, Josh, who has been babbling mamama for a while now, said with total clarity, "mmmuummm." Surely that was just amusing but unintentional? He doesn't know what Mum means yet. A few seconds later, "mmmmmuuuuummmmm." Right. Even if it was unintentional, how could we resist that? Jen took him and he settled without further protest.
For a while now, we've had a Samsung LCD TV, Samsung Blueray player and Palsonic PVR in our lounge room, as well as an old 2005 notebook running Windows 7 connected to the TV for watching video files, media on the web, etc. I've recently made some major enhancements to this setup. I think they're pretty cool, cost effective and don't require lots of different devices, so I thought I'd document them here.
TV speakers really suck. For a while now, we've wanted to be able to listen to audio, particularly music, in decent quality. So, after my usual several months of research and deliberation, I bought a set of Audioengine A5+ powered bookshelf speakers. They cost around AU$400 and we're very much loving them. They're quite small and the amp is built into the left speaker, which suits well given the limited space on the TV cabinet. They have dual inputs, enabling both the notebook and TV to be connected simultaneously.
I've used foobar2000 as my audio player for years and saw no reason to diverge from that here. Our music library is now on the notebook and added to foobar2000. In addition, I'm gradually building playlists for various occasions/moods.
Having to interact with the notebook to control music sucks, so I installed the TouchRemote plugin for foobar2000. This enables us to control everything, including browsing and searching the entire library, from our iPhones and iPad using the Remote iOS app. (I could have used iTunes for this, but I despise iTunes. :))
We don't own a digital radio. However, we mostly listen to ABC radio stations, which all have internet streams. I added all of these internet streams to a separate "Radio Stations" playlist in foobar2000. This shows up in Remote, so listening to radio can be controlled from there too.
Although our music library is on the notebook, there are times when we might have audio on one of our iOS devices which we want to hear on the lounge room speakers. Of course, we could connect the device to the speakers, but that's inconvenient and sooo 20th century. Apple AirPlay allows media from iOS devices to be streamed wirelessly to a compatible receiver. I installed Shairport4w on the notebook, which enables it to be used as an AirPlay audio receiver.
This has already been useful in a way I didn't initially consider. Michael and Nicole were over for dinner and Michael wanted to play us an album he had on his iPhone. He was able to simply stream it using AirPlay without even getting up from the couch and his glass of red wine. Facilitating laziness is awesome. :)
For video files, we use Media Player Classic - Home Cinema. We don't watch too many of these, so a proper library, etc. isn't important. However, we can't currently control it remotely, which is a minor annoyance. There are several ways we could do this such as the RemoteX Premium iOS app or a web server plugin, but requiring yet another app or web browser is ugly. I wish there were a way to control this using the iOS Remote app. :(
This isn't entertainment, but it hardly warranted a separate post. We own a Canon MP560 printer/scanner, which we're very happy with. It has built-in Wi-Fi, which is nice because it means the printer can live in a separate room and we can print from anywhere in the house. Unfortunately, it doesn't support Apple AirPrint, which means Jen, who primarily uses her iPad, can't print to it. To solve this, I set up the printer on the notebook, shared the printer and installed AirPrint for Windows. It works very nicely.