An abundance of text: Five reasons for establishing a national text labelling and annotation platform for the education sector.

An abundance of Text

Introduction
In a previous article I described the reasons for establishing a national text labelling and annotation platform for the education sector. The basic premise for the idea is a simple one. Firstly, establish a national text labelling and annotation platform for the education sector, whose capabilities expand over a period of time. Secondly, build a corpus of labelled and annotated text that is specific to the education sector and to all points on the student life cycle. And thirdly, use this large body of text to create AIED products and services to meet the needs of students, teachers, campus support teams and other stakeholders within the education sector. All of these endeavours will be underpinned by appropriate ethical and governance arrangements, and a sustainable business model.

The reasons for establishing a national text labelling and annotation platform are listed as follows:

Schools, colleges and universities generate and hold a very large volume of textual data that can be pooled and labelled.
An opportunity to establish good ethical practices that can be applied to the collection, annotation and use of training data.
An opportunity to establish good governance arrangements for managing training data across the education sector.
A large and growing library of annotated text could support the creation of multiple AIED services that address common problems faced by educational institutions.
The establishment of a national text labelling and annotation service supports the creation of new AIED services through a shared, participative and collaborative model.

As a follow up to the previous article I would like to take the opportunity to elaborate on my original thoughts and to give a more detailed rationale for establishing a national text labelling and annotation platform for the education sector.

The case for establishing a national text labelling and annotation platform.

Schools, colleges and universities generate and hold a very large volume of textual data that can be pooled and labelled.
Schools, colleges and universities generate and hold a large volume of data in the form of text. This could be online course materials on a learning management system, student work, a teacher's feedback to his or her students, comments made by teachers and campus support teams as they support students as they journey through their studies, the documents on an intranet, student complaints, student responses to survey questions which record their opinions and views about their courses and campus services that they have used; and so much more. At the present moment in time the vast majority of this textual data is not labelled or annotated and it remains dormant on the platforms that act as their stores. Further still, there are large stores of text that are held by awarding bodies in the form of student responses to open-ended exam questions or by educational publishers.

When you examine the text that is held by each school, college, university, awarding body or educational publisher there is a lot of overlap. If subsets of this large body of text is labelled and annotated, value could be derived from the creation of new AIED services that may support students, teachers and campus support teams along all points on the student-life-cycle. Some of the services that could be developed are listed in the commentary around the fourth reason for establishing a national text labelling and annotation platform.

Since the text that is held by educational institutions is not labelled, annotated or classified one possible implication of this is that the value of this data may not be fully realised, as it may be difficult for educators and students to find and access relevant information. In addition, the lack of labels and annotations may make it difficult to understand the context and significance of that data. One potential solution to this issue is to establish a national text labelling and annotation platform whose systems and processes are designed to support the labelling and annotation of text that is generated, processed and held by schools, colleges, and universities. This will involve using tools such as natural language processing and machine learning to automatically label and annotate text, or it could involve manual annotation by teachers, administrators and the wider education community.

By adding labels and annotations to text data, it can be easier for students and educators to search and find relevant information. This is needed more so with the proliferation of textual data in our campuses. In addition, labels and annotations can help to provide context and structure to the data, making it easier to organise, navigate, understand, to find insights and use. Labelling and annotating text data in a standardised manner can also make it easier for different systems and platforms to communicate with one another. This can facilitate the sharing and exchange of data between different institutions and organisations within the education sector. Greater interoperability refers to the ability of different systems and platforms to communicate and exchange data with one another; especially if institutions use the same conventions, labels and annotations to describe the data.

In some cases, labelling and annotating text data may be required for regulatory or legal purposes. By ensuring that text data is properly labelled and annotated, schools, colleges, and universities can help to ensure compliance with relevant laws and regulations that may pertain to data privacy or accessibility. Furthermore, the labelling and annotation of text, improves the accuracy and reliability of AIED services; especially when the data that is used to train the systems are representative of real-world educational contexts.

An opportunity to establish good ethical practices that can be applied to the collection, annotation and use of training data.
At first sight, the activities that support data labelling and text annotation may appear to be simple but there are many issues that need to be addressed if the work is to be conducted ethically. For example, we need to guard against unethical employment practices regarding the hiring of data labellers to support text annotation projects. Data labellers will be subject specialist teachers or domain experts so the monetary reward for carrying out data labelling needs to be commensurate to the work that they are undertaking. Some data labelling and text annotation projects may involve the handling of sensitive material; such as text about self-harming, mental health or suicide. In these instances, support will need to be provided to individuals who are involved with such projects. Guidance and practice will need to be in place to ensure against bias which may lead to discriminatory practices against individuals or to specific groups of students and teachers. For example, the labelling of postal codes, gender, race, age, learning support needs and more will all need to be addressed. In many cases, red lines will need to be drawn about what is not acceptable practice.

Here are some ethical practices that can be applied to the collection, annotation, and use of training data within the education sector. The sector and the institutions that operate within it will need to obtain explicit consent from individuals whose text is being collected, annotated, or used. This includes clearly communicating the purpose of the collection or annotation, how the text will be used, and any potential risks or benefits associated with participation. They will need to protect the privacy and confidentiality of the text that is collected, annotated, or used. Schools, colleges and universities will need to ensure that the collection, annotation, or use of the text is fair and non-discriminatory, and is not biased against any particular group or individual. They will need to clearly communicate the purpose, process, and use of the data that is to be collected. And the sector will need to allow individuals or institutions to choose whether or not to participate in the collection, annotation, and use of the training data.

An opportunity to establish good governance arrangements for managing training data across the education sector.
By establishing a national text labelling and annotation service, good governance arrangements could standardise the labelling and annotation processes; especially for small scale AIED projects or for national AIED projects that are designed to support hundreds of thousands of students across the UK. If textual data is supplied by thousands of education institutions across the UK appropriate governance arrangements will need to be put in place to enable the delivery of AIED services that are trusted by everyone who comes to use them; and by the institutions that supply text to the text labelling and annotation service. The lack of a national governance framework for text labelling and annotation services could lead to a fragmented landscape with individual organisations and companies designing and implementing their data labelling projects in isolation from their counterparts and with little or no interoperability between one service and another. A national governance framework will also provide advice and guidance regarding controversial annotation projects that involve facial recognition, the use of sensitive datasets or projects that do not align with the needs of the education sector. Recent reports from UNESCO and Jisc begin to address the ethical concerns of designing, using and managing AIED services but more guidance is still needed regarding data labelling and text annotation.

Establishing good governance arrangements for training data management in the education sector can help ensure that the use of training data is aligned with the values and goals of the sector. This can involve setting clear policies and procedures for the collection, storage, and use of training data, as well as establishing oversight mechanisms to ensure compliance with these policies. Good governance arrangements can also involve engaging with students, teachers, parents, and other stakeholders to ensure that their concerns and preferences are taken into account. By involving all stakeholders in the decision-making process, the education sector can ensure that training data is used in a way that reflects the values and goals of the sector, and that the rights and interests of all stakeholders are protected. Good governance arrangements can also help ensure that training data is used ethically and responsibly, and that the privacy and confidentiality of individuals are respected. By establishing good governance arrangements, the education sector can build trust and confidence in the use of training data, which can help support the development and use of effective and innovative education technologies.

If there are poor governance arrangements within the education sector for managing training data, there can be several ramifications. Firstly, poor governance arrangements could erode trust and confidence in the use of AIED servies, particularly if there are concerns about the ethical use of training data, which can hinder the development and adoption of these new and emerging services. Secondly, it could increase the risk of legal and regulatory violations, particularly if there are concerns about the privacy and security of training data. This can result in costly fines and legal action, which can damage the reputation of using AIED services. Thirdly, poor governance arrangements can lead to inefficient use of resources, as there may be a lack of clear policies and procedures for managing training data, leading to confusion and duplication of effort. Fourthly, poor oversight may result in the loss of value from the training data, as there may be a lack of clarity about how the data can be used, which could lead to missed opportunities for innovation. And poor governance arrangements could reduce the potential impact of using a national pool of training data for the education sector.

A national data labelling platform may also inform the development of a governance framework that addresses concerns relating to algorithmic authority within the education sector. For instance, algorithmic systems can be biased if they are trained on data that does not reflect the diversity of the education sector. This can lead to unfair outcomes, such as recommendations or decisions that are disproportionately negative for certain groups. It is important to ensure that the algorithms used in the education sector are transparent and explainable, so that stakeholders can understand how decisions are being made and have confidence in their fairness and accuracy. The sector will also need to ensure that the collection of data and its use protects individual privacy and confidentiality. It is important to ensure that algorithmic systems are used as tools to support human decision-making, rather than replacing human judgment entirely.

A large and growing library of annotated text could support the creation of multiple AIED services that address common problems faced by educational institutions.
One of the factors that contributes to complexity is the growing volume of data that permeates through our education institutions. This is especially true if we see every facet of a campus as entities that either produce, manipulate or consume data. The traditional tools that are used to manage data such as management information systems, CRM systems, business intelligence systems, learning management systems and data dashboards all too often fall short. In many cases, they exacerbate the problems that are associated with complexity and urgency. This problem was noted during the 1950s by Norbert Wiener who stated that whilst communication mechanisms do become more efficient, they are subject to increasing levels of entropy. Wiener also stated that external agents could be introduced to control entropy. The agents that support the day-to-day operations of an AIED service are designed to carry out narrow and well defined cognitive tasks, and in doing so, they reduce the level of complexity on our campuses. Mark Weiser and John Seely Brown later coined the term calm technology to describe the nature of these services (Hussain, 2019). One can easily envisage how a large pool of labelled and annotated textual data can be used as a basis for creating new AIED services. A pool of annotated data will also reduce the time and costs that are associated with developing these services. Some of these services are listed below. I am sure that you are able to think of many more.

A survey platform that hosts a bank of open-ended questions that could be posed to students and other stakeholders. As recipients respond to a question using free-form text, the administrator of the survey is presented with a real-time tally and summary of responses. Text annotation can be used to garner feedback from students about their courses or campus services. Educational institutions value the voice of their students and they regularly seek opinion from them so as to inform the development of courses and campus services. The current state of digital data collection tools allows for the rapid and large scale collection of responses to closed questions. However, the processing of responses to open-ended questions remains slow and protracted. The FirstPass platform is well placed to support this use case by taking advantage of text classification and sentiment analysis. Colleagues and I at Bolton College envisage that text classifiers can be set up and trained and then be used to support the delivery of open-ended survey questions. The administrators of the survey would then be able to access a textual and graphical summary of each question which updates in realtime as more students respond to each open-ended question.
If subject specific text can be labelled and annotated, services can be developed to support teachers and students with the formative assessment process. Bolton College's FirstPass platform is an early example of such a service. With regard to formative assessment, teachers have a high volume of work that is associated with giving effective feedback; especially when their students respond to open-ended questions using free-form text. This invariably means that students have to wait for a few days or more before they receive feedback from their teachers. Bolton College's FirstPass platform has demonstrated that a text annotation platform can aid and support the formative assessment process; specifically with regards to providing real-time feedback to students as they compose free-form text responses to open-ended questions that have been posed by their teachers. One of the key properties of the FirstPass platform is that its subject topic classifiers become more reliable and accurate as they are exposed to additional annotated text that is sourced from teachers and students as they interact with the platform.
Quality assurance tools for moderating formative and summative assessment practices, the feedback that is presented to students or the comments that are made by teachers and campus support teams. We are familiar with tools such as Grammarly that support us to write online. A text labelling and annotation platform will underpin services that perform a similar job as they help teachers and campus support teams to author commentary about a student; which could be formative assessment feedback or authoring smart targets that have been agreed with a student. These micro services will be embedded in virtually all online services that are used by students, teachers and campus support teams. Whilst these services support the end user, quality assurance teams will be able to use AIED tools that help them sample work that has been marked by teachers and examiners.
Tools to support the authoring of online tutorials and training packages to students and campus staff. Instructional design packages such as Articulate 360, Adobe Captivate, Lectora, Nolej AI and others will also make use of labelled and annotated text. For example, an instructional designer may call upon a natural language service to assist with the authoring of an introduction, assessment activity or summary to an online tutorial or training package.
There will be numerous opportunities to develop tools that will support day-to-day tasks and workflows that are undertaken by schools, colleges and universities. In high stake environments such as the education sector, it must be reiterated that individuals will need to remain in the loop at all times and that algorithmic authority does not override the decisions made by students, teachers and campus support teams.
We will see the fruits from the text labelling and annotation platform manifesting themselves into many of the day-to-day services that students, teachers and campus support teams use to support their studies and work. Machine Learning Operations will become an important element for all online campus services, but this will be only possible if the costs for training, deploying and maintaining the models fall substantially. The advent of the national text labelling and annotation platform will play an important role in supporting this endeavour.

The national text labelling and annotation platform should be designed so that it can incubate and stimulate potential ideas and services. It's intention is to act as a catalyst for emerging ideas and innovations that surround the use of natural language processing across the broader education sector.

The establishment of a national text labelling and annotation service supports the creation of new AIED services through a shared, participative and collaborative model.
A national text labelling and annotation platform and the services that emerge from it will be well placed to support a shared, participatory and collaborative model. One of the defining traits of AIED services; and one that sets them apart from legacy EdTech services, is that their design, use and performance is built on that shared, participatory and collaborative model. If cognitive computing becomes present in every digital service that students or teachers touch; and if we see technology as a human activity, the national text labelling and annotation platform and the digital services that emerge from it, will in themselves be shaped and altered by their use. For example, as teachers and data labellers train subject topic classifiers on Bolton College's FirstPass platform and as students respond to open-ended questions that are based on these classifiers, the volume of training data rises, which in turn improves the performance of the platform as it tries to offer real-time feedback to students. The means of production for AIED services such as the national text labelling and annotation platform are dependent on leveraging the strengths that are inherent in large numbers of people who fulfil tasks that were previously carried out by the few. The participatory model lends itself particularly well to the education sector were the larger group are motivated towards shared goals. If this is the case, the national text labelling and annotation platform can indeed be developed with a communal spirit (Hussain, 2021).

Wang, et al (2012) state that annotation projects; such as the ones that will emerge from the national text labelling and annotation platform, will require various forms of motivation to achieve the end goal of annotation. They highlight obvious motivations of the participant such as profit or altruism; but they state ‘that the space of possible motivations and dimensions of crowdsourcing have not been fully explored.’ In the case of the education sector, the primary motives for participants are very likely to be the desire to improve the student experience, to support student wellbeing, to support students as they journey through their studies, to raise attainment levels and so forth. These motivations are well placed to support the participatory model that underpins the national text labelling and annotation platform. Whilst Wang et al (2012) focus on the potential benefits, Ye and Kankanhalli (2017, p.103) have examined social exchange theory to review some of the more problematic areas of crowdsourcing platforms; such as the difference between an individual’s actual participation planned intention, the costs and benefits of participating; and the effect of trust on participation. Their work has highlighted that when individuals participate in an exchange they do so when they expect the rewards to exceed costs. Individuals tend to maximise benefits and minimise costs from exchanges.

Whilst, Ye and Kankanhalli (2017, p. 102) state that crowdsourcing models can be very productive, they do caution project teams before they embark on asking people to participate. One of the first concerns that they have is sustaining involvement by individuals on a crowdsourcing platform. However, since crowdsourcing platforms are entirely dependent on the voluntary behaviour of their members, they must understand what motivates them to become and remain active throughout a variety of tasks. An important distinction between platforms that thrive and those that fail is the level of ongoing activity of their members (Boons, Stam and Barkema, 2015, p. 720). This will be a primary concern for the management team behind the national text labelling and annotation platform who attempt to build on the volunteer base for the platform. However, this can be addressed, in part, if data labellers are hired and paid to label and annotate text.

A participatory process is important when crowdsourcing labelled data; however, Zheng et al (2019) state that project teams need to guard against deteriorating model accuracy as a wider pool of individuals are involved in the annotation process. But they go onto argue that repeated labelling by the crowd leads to an improvement in model accuracy; especially with regard to label inference. Label inference takes place when a natural language processing or classification model has acquired a sufficient volume of annotated data that allows the model to infer a label to unseen unlabelled text. In the case of Bolton College’s FirstPass service it is able to infer the correct label(s) to sentences that the natural language classification models have not seen before.

Establishing and maintaining a national text labelling and annotation platform will require a significant investment of resources. This may include investment in infrastructure, including staff to manage the platform. There are a number of different business models that could potentially be used to finance and sustain a national text labelling and annotation platform for the education sector. One option is to fund the platform through government grants, subsidies, or other forms of public funding. This could be an attractive option if the goal of the platform is to serve the public good, and if the platform is expected to generate benefits that justify the public investment. This option also enables the platform to serve as an open educational resource. Another option is to seek private investment to finance and sustain the platform. Private investment may be particularly appropriate if the platform is expected to generate significant revenues through the sale of products or services. A third option is to finance the platform through user fees, such as fees charged to institutions or individuals that use the platform. This could be attractive if the platform is expected to generate significant value for its users, and if users are willing and able to pay for that value. And the fourth option could be to finance the platform through partnerships with private sector firms, foundations, or other organisations that are interested in supporting the development of education technologies. Partnerships can help provide access to additional resources, expertise, and networks, and may be a good way to leverage additional funding and support for the platform. Ultimately, the best business model for financing and sustaining a national text labelling and annotation platform will depend on a number of factors, including the goals and objectives of the platform, the value it is expected to generate, and the resources and capabilities of the education sector. It may be necessary to experiment with different business models and to adapt the model over time in order to ensure the long-term sustainability of the platform.

Summary
The abundance of text within the education sector and the growing complexity that this brings to schools, colleges and universities makes it an ideal arena for asking how the use of natural language processing services can be put to use to serve the needs of students, teachers and campus support teams. With a little imagination, it is not difficult to list the services that could be developed to support the sector. If these services are to be trusted in a high stakes environment, the education sector will need to ensure that appropriate ethical and governance arrangements are in place to oversee their development, and that these arrangements can flex and respond to technological advances that will pervade AIED services. A national text labelling and annotation platform that is underpinned by a shared, participative and collaborative model is more likely to foster the development of AIED services that are trusted and supported by organisations that operate within the education sector and by the communities that they serve.

References

Boons, M., Stam, D. and Barkema, H.G. (2015). Feelings of Pride and Respect as Drivers of Ongoing Member Activity on Crowdsourcing Platforms. Journal of Management Studies, 52(6), pp.717–741. doi:10.1111/joms.12140.s
Wang, A., Hoang, C.D.V. and Kan, M.-Y. (2012). Perspectives on crowdsourcing annotations for natural language processing. Language Resources and Evaluation, 47(1), pp.9–31. doi:10.1007/s10579-012-9176-1.
Ye, H. (Jonathan) and Kankanhalli, A. (2017). Solvers’ participation in crowdsourcing platforms: Examining the impacts of trust, and benefit and cost factors. The Journal of Strategic Information Systems, [online] 26(2), pp.101–117. doi:10.1016/j.jsis.2017.02.001.
Zheng, L. and Chen, L. (2019). DLTA: A Framework for Dynamic Crowdsourcing Classification Tasks. IEEE Transactions on Knowledge and Data Engineering, 31(5), pp.867–879. doi:10.1109/tkde.2018.2849385.