Request for a deduplication step in the generate_synthetic_queries_over_documents function to avoid returning duplicated synthetic questions.
### Feature Description # Semantic Dedupulication ## Current State: Absence of any Deduplication step In `llama_index/finetuning/cross_encoders/dataset_gen.py`, `generate_synthetic_queries_over_documents(...)` generates synthetic questions per document chunk using an LLM and aggregates them into a flat list. However, there is currently no deduplication step before returning the final list of questions. Currently, we have (See [full code](https://github.com/run-llama/llama_index/blob/f6b8a6d00d20f222975ac60f3111f0d29d41b462/llama-index-finetuning/llama_index/finetuning/cross_encoders/dataset_gen.py#L35C1-L88C21)): ```py questions.extend(response_questions) ... return questions ``` All generated questions are appended and returned as-is, without normalization or deduplication. This can result in identical questions across multiple chunks, and near-duplicate questions with minor formatting differences. Further down the training pipeline, this can entail redundant training examples in