Monday, February 10, 2025

Using AI/LLM to Implement a News Headline Filter

There are news topics that I'd rather avoid being bombarded with, but options for filtering headlines on the news sites are generally very limited.  You can tell Google News to block certain sources and there's an "fewer like this" option which doesn't seem to do anything, but neither will block topics (e.g."politics", "reality TV shows", etc. ) so you still end up getting bombarded with things you don't want to see even in your "personalized" views.  Fortunately, sites like Google News provide RSS feeds that makes it easier to get the list of headlines, links and descriptions and avoid trying to scrape websites which can be brittle as the sites can change their layout at any time.

I decided to write my own filter take out things that I'm not interested in and publish the result to a site that I can access from any browser.  The simplest way (and one that doesn't even need site to host the result) is to have a list of blocked words and drop any headlines with those words.  As long as the app can access the RSS feed even a mobile app can easily handle this kind of filtering, but creating and maintaining an effective block list becomes a challenge.  A political news article isn't going to be "Political Article on City Council Votes On Artificial Turf at Parks.  This is a good use case for using a language model to categorize a headline instead of using a keyword filter.

I created a prompt in natural language explaining what I don't want to see along with some additional details to add to common definitions and then sent my instructions with the list of headlines to the language model and have it return a filtered list:
Remove all political headlines.  Headlines that includes political figures, and celebrities who are active in politics such as Elon Musk should also be removed.
Being able to use natural language to handle categorization makes it so much easier to build the app.  The categorization would've been the most challenging part of the project, but the LLM allowed me to get good results very quickly (took just a few minutes to get the first results but some more time to tweak the prompt).  Another benefit with the large language models such as Gemini is that it understand multiple languages so while I gave my instructions in English, it can filter headlines in French, Chinese, Japanese, etc.

Using an language model does mean giving up some control such as relying on what the model interprets "political" to mean.  Prompts can help refine the model's interpretation but sometimes a headline makes it through the filter and it is not as easy to determine compared to being able to see the algorithm and determine the reasoning.

I encountered a problem where the LLM's response stopped midway.  This was because  LLMs have limits on input and output tokens (how much info you can send out and the size of the response it will send back).  The input token limit is so high that I'm not likely to have enough headlines to even remotely approach it's limit (Gemini is 1 million tokens for the free tier and 2 million tokens for the paid tier and I'll be using around 20,000).  The output limit is much smaller (~8k) so if I wanted it to send back the complete details of the filtered headlines (title and link) it won't be able to.  To address this problem, I send the LLM the headlines with an index and have it return just the index.  If it was a 100 headlines then size of the output is less than 200 tokens.

A cron job will execute the program and write the result as JSON to the server where the web page will load the JSON  and display the headlines:


The code is on Github and you can see it is very simple program.

No comments:

Post a Comment