Robot journalism: The difference between scraping the internet vs using structured data

There’s an important discussion going on in our industry around the use of robots to produce journalism. Having worked in this field for several years now, we’ve heard sceptics question how it’s possible to build algorithms which create consistently reliable and correct articles. This concern is valid when the underlying data is collected by scraping the internet. But when you build your automated content on structured data sets, the risk of error is minimal.  

It’s easy to understand the disquiet in the debate about automated journalism, and the confusion at the root of it. As with all technology, it can be used for nefarious ends. With modern computer programming it’s possible to create an algorithm that will go out and scrape the internet for just the kind of data or content you’re looking for, subsequently writing stories to suit someone’s political purposes – for example.

But this technology can also be used for good. There are lots of serious news publishers – with journalistic principles at the core – who use automated journalism to strengthen their business, some of whom work with United Robots. The process they use is very different to what we described above.

First of all, there’s the data. In order to produce reliable, factual texts, you need to work from structured sets of quality data, such as land registry data or sports results, which are not only correct, but which will be consistently available over time. The subsequent automation workflow then includes careful analysis of the data, a verified language process and finally distribution on the right platform, to the right audience.

With United Robots’ technology, the algorithms are managed by man and machine in tandem. The journalists and editors we work with determine how an article should be constructed; that is should include e g a headline of a certain type and/or length, a stand-first to some specification, a number of facts, a summary at the end etc.

The structure of the texts and rules around what angles to look for can be defined in quite some detail by the newsroom. So for example, when we build articles about property sales, a property may be defined as a “mansion” if the house is X large and the land is minimum Y square meters or acres. Or if we generate sports texts, let’s say football, a “sensational” turn of events may be if XYZ happens. And a “crushing defeat” may only require being beaten by three goals in the top league, but by six in division 4. The work in the newsroom to determine what is what is an interesting process in itself, and forces editors and reporters to really think through the language and values they use.

Once the rules for the text structure and angles are established by man, machine takes over. And what machines – the robots – contribute is that they never make factual or logical errors. If a fact is in the data, it’s correct and may be included (and consequently, if it’s not in the data, it will not feature). In other words, we build text on insights gleaned from the data analysis alone, with the rule system set in line with the journalistic principles and style sheets of the newsroom in question.

From a business perspective, working from structured data sets consistently published over time – as opposed to scraping for data – means a guaranteed volume of articles will be regularly generated, without which you can’t build sustainable news products and services. And only with structured data can you ensure the quality and reliability necessary to maintain trust in your journalism.