Deduplicate threads on Newest forum tab #38

Merged
noah merged 2 commits from deduplicate-forum-newest into main 2024-02-16 03:56:16 +00:00

This is some work-in-progress code to try and make the Newest tab on the forum nicer when we have a popular forum thread that's getting tons of responses, so that the spam doesn't dominate a whole page of results.

The original query simply selects ALL forum comments ordered by created at:

SELECT
  comments.id AS comment_id,
  threads.id AS thread_id,
  forums.id AS forum_id,
  comments.updated_at AS updated_at
FROM comments
LEFT OUTER JOIN threads ON (table_name='threads' AND table_id=threads.id)
LEFT OUTER JOIN forums ON (threads.forum_id=forums.id)
-- and a handful of filters similar to the query below

In order to de-duplicate by thread, I could add a SELECT DISTINCT ON (threads.id) to ensure that each thread ID is only returned once. However, when doing a DISTINCT ON you are required to ORDER BY the same field. But ordering by thread ID is not useful for the "Newest" tab when what I need to order by is the timestamp of the recent comment.

I found some info on StackOverflow here and here how you can ORDER BY a different column than you SELECT DISTINCT ON. I'm currently trying the subquery approach.

The raw SQL query I'm trying to build looks so far like this (annotated):

SELECT comment_id, thread_id, forum_id, updated_at
FROM (
  -- subquery so we can DISTINCT ON thread id, de-duplicating by thread
  -- so we will only see each thread ID one time in the result
  SELECT DISTINCT ON (threads.id)
    comments.id AS comment_id,
    threads.id AS thread_id,
    forums.id AS forum_id,
    comments.updated_at AS updated_at
  FROM comments
  LEFT OUTER JOIN threads ON (table_name='threads' AND table_id=threads.id)
  LEFT OUTER JOIN forums ON (threads.forum_id=forums.id)
  WHERE table_name = 'threads'
  AND forums.category IN ('Rules and Announcements','Nudists','Exhibitionists','Photo Boards','Anything Goes')
  AND EXISTS (
    SELECT 1
    FROM users
    WHERE users.id = comments.user_id
    AND users.status = 'active'
  )
  -- the subquery MUST order by thread ID because of the DISTINCT ON clause
  -- however, ordering the result by thread ID would break the "Newest" requirement,
  -- we want the overall result ordered by comment timestamp!
  ORDER BY threads.id
) AS subquery
-- so here, we order the subquery result by the comment timestamp
ORDER BY subquery.updated_at DESC;
This is some work-in-progress code to try and make the Newest tab on the forum nicer when we have a popular forum thread that's getting tons of responses, so that the spam doesn't dominate a whole page of results. The original query simply selects ALL forum comments ordered by created at: ```sql SELECT comments.id AS comment_id, threads.id AS thread_id, forums.id AS forum_id, comments.updated_at AS updated_at FROM comments LEFT OUTER JOIN threads ON (table_name='threads' AND table_id=threads.id) LEFT OUTER JOIN forums ON (threads.forum_id=forums.id) -- and a handful of filters similar to the query below ``` In order to de-duplicate by thread, I could add a `SELECT DISTINCT ON (threads.id)` to ensure that each thread ID is only returned once. However, when doing a DISTINCT ON you are required to ORDER BY the same field. But ordering by thread ID is not useful for the "Newest" tab when what I need to order by is the timestamp of the recent comment. I found some info on StackOverflow [here](https://stackoverflow.com/questions/9796078/selecting-rows-ordered-by-some-column-and-distinct-on-another) and [here](https://stackoverflow.com/questions/9795660/postgresql-distinct-on-with-different-order-by) how you can ORDER BY a different column than you SELECT DISTINCT ON. I'm currently trying the subquery approach. The raw SQL query I'm trying to build looks so far like this (annotated): ```sql SELECT comment_id, thread_id, forum_id, updated_at FROM ( -- subquery so we can DISTINCT ON thread id, de-duplicating by thread -- so we will only see each thread ID one time in the result SELECT DISTINCT ON (threads.id) comments.id AS comment_id, threads.id AS thread_id, forums.id AS forum_id, comments.updated_at AS updated_at FROM comments LEFT OUTER JOIN threads ON (table_name='threads' AND table_id=threads.id) LEFT OUTER JOIN forums ON (threads.forum_id=forums.id) WHERE table_name = 'threads' AND forums.category IN ('Rules and Announcements','Nudists','Exhibitionists','Photo Boards','Anything Goes') AND EXISTS ( SELECT 1 FROM users WHERE users.id = comments.user_id AND users.status = 'active' ) -- the subquery MUST order by thread ID because of the DISTINCT ON clause -- however, ordering the result by thread ID would break the "Newest" requirement, -- we want the overall result ordered by comment timestamp! ORDER BY threads.id ) AS subquery -- so here, we order the subquery result by the comment timestamp ORDER BY subquery.updated_at DESC; ```
noah added 1 commit 2024-02-15 05:49:46 +00:00
noah added 1 commit 2024-02-16 03:53:37 +00:00
noah changed title from WIP: Deduplicate threads on Newest forum tab to Deduplicate threads on Newest forum tab 2024-02-16 03:56:08 +00:00
noah merged commit 85d2f4eee9 into main 2024-02-16 03:56:16 +00:00
noah deleted branch deduplicate-forum-newest 2024-02-16 03:56:16 +00:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: nonshy/website#38
There is no content yet.