HackerRank's ATS: Resume Scoring is a Luck Filter

HackerRank's Open-Source ATS: When Resume Scoring Becomes a Roll of the Dice

HackerRank recently open-sourced its applicant tracking system (ATS), Hiring Agent, sparking widespread discussion on LinkedIn and Reddit. The tool promises to automate resume screening using large language models (LLMs), but early testing reveals a troubling flaw: the same resume can score anywhere from 66 to 99 out of 100, depending on random chance. This isn't a bug—it's a fundamental design issue that turns hiring into a luck filter.

Dan Kinsky, a software engineer, put the tool through its paces. On his first run, his resume scored a solid 90/100. After cleaning up some debug print statements, the score dropped to 74. A third run? 88. He then automated 100 runs and found scores ranging from 66 to 99. With a hypothetical cutoff at 85, his resume would fail 65% of the time—despite being identical each time. This variance isn't an edge case; it's the norm.

How the ATS Works

The system parses a PDF resume into text, then calls an LLM six times to extract structured data: basics, work history, education, skills, projects, and awards. It also scrapes the candidate's GitHub profile and top repositories for additional context. Finally, the LLM grades the resume on a 100-point scale, with up to 20 bonus points.

The scoring breakdown is as follows:

Open source contributions: 35 points
Personal projects: 30 points
Work experience: 25 points
Technical skills: 10 points
Bonus points: Up to 20 for startup experience, a portfolio site, a technical blog, etc.

The default model is Gemma3:4b, running at a temperature of 0.1—a low setting intended to push the model toward deterministic outputs. Yet, the scores still vary wildly.

Consistency Where It Doesn't Matter, Chaos Where It Does

Digging into the individual categories reveals a stark contrast. Technical skills scored a near-perfect 8/10 in 98 out of 100 runs. Why? Because technical skills are a checklist—you either know React or you don't. There's little for the LLM to judge subjectively.

Projects, on the other hand, show enormous variation. The LLM struggles to consistently evaluate qualitative aspects like architectural complexity or real-world deployment. Sometimes a project is praised, other times it's deemed lacking—a coin toss at best.

Work experience is the most troubling. Every single run scored 25/25, regardless of the candidate. A junior engineer with one internship gets full marks, as does a principal engineer with decades of experience. The prompt for this category is only two lines long, with no rubric or examples. It reads: "Analyze the 'work' and 'volunteer' sections for real-world, internship, or production experience. SPECIAL CONSIDERATION: Give extra points for founder roles, co-founder positions, or early-stage engineer roles (first 10-20 employees) at startups." There are no anchors for what constitutes a 15 versus a 25, making the score meaningless.

continue reading below...

Temperature 0 Doesn't Fix It

Lowering the temperature to 0 doesn't solve the non-determinism. A GitHub issue opened in October 2025 shows scores of 27, 34, 32, 34, 34, and 30 across six consecutive runs at temperature 0. This isn't a bug that can be tuned away—it's a fundamental limitation of LLMs when asked to make subjective judgments.

Even switching to a more powerful model like Gemini doesn't eliminate the problem. While the distribution tightens—scores cluster between 48 and 64—a cutoff at 60 still means failing 28% of the time through no fault of the applicant.

The Broader Implications for Hiring

The tool's heavy weighting on open source and projects (65% of the base score) further biases results. An engineer with 30 years of experience who built critical infrastructure like Amazon S3 might have little to show on GitHub. Under this system, they'd score poorly compared to a junior developer with a few open source contributions. As Kinsky notes, "Some of the best engineers I know have built things that never ended up on GitHub."

This isn't just a technical curiosity—it has real-world consequences. Companies relying on AI screening risk filtering out top talent based on randomness. The process becomes a luck filter, not a quality filter. As one critic put it, "You might as well throw out half the resumes and tell the applicants you don't fuck with bad luck."

What This Means for Job Seekers and Employers

For job seekers, the takeaway is sobering: your resume's score is partly a matter of chance. Even a perfectly tailored resume can fail or pass based on the LLM's mood. For employers, the risk is even greater. Adopting such tools without understanding their limitations could lead to systematically biased hiring decisions.

LLMs excel at parsing structured data and checking checklists—like whether a candidate knows Python. But they are fundamentally ill-suited for subjective evaluations like judging the quality of work experience or the complexity of a project. As Kinsky concludes, "Use an LLM to parse a resume into structured data—great. Use one to check whether someone knows Python—amazing. Use one to judge whether a candidate's experience is worth 18 points or 24 points? You get a vibe-check."

The broader hiring landscape is already shifting. A Business Insider report highlights that job seekers in 2026 face a new world of AI-driven processes, where resumes must be tailored to both human and machine readers. Meanwhile, compensation data is becoming more dynamic, and the need for decision-ready data is growing. But if the tools used to screen candidates are fundamentally unreliable, the entire system is at risk.

For engineers with influence over their company's hiring pipeline, the message is clear: proceed with caution. AI screening tools can be powerful, but they are not a substitute for human judgment. As the cybersecurity industry warns, even open-source AI models can be nearly as effective as proprietary ones—for better or worse. In hiring, the stakes are too high to leave to chance.