Sunday, May 26, 2024

Hood County superintendents react to STAAR testing’s new AI grading system


Students across Texas will soon be guinea pigs for a brand-new AI grading system, as the Texas Education Agency (TEA) unveils yet another change to its state-mandated exam.

But Hood County superintendents are a little apprehensive about this new method.

According to the Texas Tribune, the TEA is rolling out an automated scoring engine for open-ended questions on the State of Texas Assessment of Academic Readiness (STAAR) test for reading, writing, science and social studies.

The STAAR test — which measures students’ understanding of state-mandated core curriculum — was previously graded by temporary human scorers. But with TEA’s recent change, written answers on the STAAR test will now be graded automatically by computers.

“The writing portion of the STAAR test is graded through an AI software,” Granbury Independent School District Superintendent Dr. Jeremy Glenn told the Hood County News. “This is a new technology, so our district is still learning how the Texas Education Agency plans to use the software to grade students’ writing samples.”

The technology, which uses natural language processing — a building block of artificial intelligence chatbots such as GPT-4 — will save the state agency about $15 to $20 million per year that it would otherwise have spent on hiring human scorers through a third-party contractor, according to the Texas Tribune.

This new grading method comes after the STAAR test was redesigned last year. The Texas Tribune reports that the test now includes fewer multiple-choice questions and more open-ended questions — known as constructed response items. After the redesign, there are six to seven times more constructed response items.

While Glenn said this new method will be more cost-effective and will result in a quicker turnaround time on scoring students’ exams, he still has concerns regarding AI’s ability to “interpret and score a student’s creative writing assignment.”

“Writing is a unique skill,” Glenn said. “While that technology might exist, many educators are skeptical because we have not been exposed to it, nor have we seen evidence that it is a reliable scoring method for STAAR. From what we have seen so far in a limited sample, students failed at a much higher rate when graded by AI as opposed to a human scorer.”

Automated technology —also known as hybrid scoring — was already used on a limited basis in December 2023. The state overall saw an increase in zeroes on constructed responses in December 2023, but the TEA said there are other factors at play, according to the Texas Tribune. In December 2022, the only way to score a zero was by not providing an answer at all. With the STAAR redesign in 2023, students can receive a zero for responses that may answer the question but lack any coherent structure or evidence.

"It appears to be a good thing, but I don’t have confidence in another factor that can lead to more discrepancies on already questionable results,” Lipan ISD Superintendent Ralph Carter said. “The changes that were made last year already are now being tested in the courts. The stakes are too high for us to put trust in an unverified technology. This technology was used on a limited basis in the December 2023 retesting and produced more zeros than ever before.”

According to a slideshow on TEA’s website, the auto scoring engine goes through a rigorous programming process that is led and checked by humans. The engine uses a sample of about 3,000 exam responses that have already received two rounds of human grading previously. From this field sample, the automated scoring engine learns the characteristics of responses, and it is programmed to assign the same scores a human would give.

The Texas Tribune reports that as students complete their tests this spring, the computer will first grade all the constructed responses and then a quarter of the responses will be rescored by humans.

When the computer has low confidence in the score it assigned, those responses will be automatically reassigned to a human. If the computer also encounters a type of response that its programming does not recognize, like the use of slang words or using words in a language other than English, the responses will also be reassigned to a human, according to the Texas Tribune.

Carter said over the last year, districts were told the test was changing, but with AI “bots” grading tests, he said it is clear to him that the TEA wants public schools to fail.

“Acceptance of the validity of the scores is going to be a huge challenge,” Carter said. “Also, validity is going to be a long time coming for anyone who understands statistical analysis. There will also be issues with teaching strategies as well as constructed responses for younger test takers.”

While Glenn and Carter both had strong concerns regarding the new system, Tolar ISD Superintendent Travis Stilwell said he is unsure of where he stands regarding AI grading exams.

"Obviously, a negative is that the personal touch is taken away. However, I do understand that it will make grading faster and maybe more consistent,” Stilwell said. “But, when dealing with writing samples, it is important for the reader to be able to ‘feel’ what someone is trying to say. I am not certain that this will be possible with AI. Overall, the unknown makes it a little scary. I guess time will tell whether it works in this instance.”

According to TEA’s slideshow, automated scoring technology is over a decade old and is widely used, including in Texas. Approximately 180,000 Texas students annually use the Texas Success Initiative Assessment (TSIA) to meet their graduation requirement, which relies on automated scoring technology. Additionally, more than 21 states currently employ autoscoring for their state assessments.

“The main concern for GISD and districts around the state is that our students’ work is scored in a fair and equitable manner,” Glenn said. “Students work hard to prepare for the STAAR test. It is important that creativity, individuality and emotions in their writing are scored correctly and valued.”

He added his hope is that the AI software has been thoroughly vetted for validity and reliability on every student’s STAAR test.

"We (GISD) trust that the Texas Education Agency has done its best to ensure that any AI scoring program will work to the benefit of our students and districts,” he said.