news_recommendation_service

所属分类:数值算法/人工智能
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2023-11-26 00:45:47
上 传 者sh-1993
说明:  新闻推荐服务
(news recommendation service)

文件列表:
News_Recommendation_Platform_ipynb鈥.ipynb (165950, 2023-11-25)
Photo/ (0, 2023-11-25)
Photo/2.png (294019, 2023-11-25)
Photo/3.png (249787, 2023-11-25)
Photo/4.png (463412, 2023-11-25)
Photo/photo1.png (82502, 2023-11-25)
app.py (4354, 2023-11-25)
requirements.txt (4177, 2023-11-25)
static/ (0, 2023-11-25)
static/script.js (1257, 2023-11-25)
static/style.css (1842, 2023-11-25)
templates/ (0, 2023-11-25)
templates/index.html (3880, 2023-11-25)

# News Recommendation Service `Ge Jiang` github: https://github.com/egotist0/news_recommendation_service [toc] ## Introduction EVA is an open-source artificial intelligence relational database that supports deep learning models. It is designed to facilitate artificial intelligence database applications capable of leveraging deep learning models to process both structured and unstructured data. The database has built-in support for popular vector databases like Pinecone. ***The project aims to extend the news recommendation tool developed in Project 1 into a practical web platform.*** It utilizes *EvaDB, the ChatGPT model, and Pinecone for semantic similarity matching*, providing features such as document summarization, keyword extraction, and entity recognition for handling database documents. *Also uses Flask to build a web application service*. Based on a user's previous reading history, the tool selects the top 10 articles from a library that are most likely to align with the reader's preferences and presents the recommendations. ___________ ## Data Sources The article data is derived from [kaggle](https://www.kaggle.com/datasets/snapcrack/all-the-news), encompassing 143,000 articles sourced from 15 prominent American publications, such as the New York Times, Breitbart, CNN, Business Insider, the Atlantic, Fox News, Talking Points Memo, Buzzfeed News, National Review, New York Post, the Guardian, NPR, Reuters, Vox, and the Washington Post. This project exclusively focuses on the "articles1.csv" file within the dataset, containing 50,000 news articles (Articles 1-50,000). The file encompasses various attributes, including: ![photo1](https://github.com/egotist0/news_recommendation_service/tree/master/Photo/photo1.png) _____________ ## Related Work ### Pinecone Pinecone is an emerging service and tool crafted to aid organizations and developers in the effective management and utilization of extensive vector data. It stands out as a high-performance vector indexing and retrieval system tailored specifically for machine learning applications. Functioning as a robust infrastructure, Pinecone excels in storing, indexing, and searching vector embeddings. These embeddings represent data points numerically, capturing their semantic information and relationships. Widely applicable across domains like natural language processing, computer vision, recommendation systems, and anomaly detection, vector embeddings play a crucial role. A standout feature of Pinecone lies in its adept handling of high-dimensional vector data. Leveraging advanced indexing techniques, including approximate nearest neighbor search algorithms, Pinecone facilitates swift and accurate retrieval of similar vectors. This capability proves invaluable in scenarios demanding real-time or near-real-time responses, such as personalized recommendations or similarity-based searches. ### GloVe The Global Vectors for Word Representation (GloVe) stands as a widely embraced word embedding model crafted by researchers at Stanford University. This model represents words as dense vectors in a high-dimensional space, capturing their semantic relationships through co-occurrence patterns. GloVe merges global statistical insights with local context to generate word vectors. Through the factorization of a co-occurrence matrix, GloVe produces word vectors where the dot product signifies the likelihood of word co-occurrence. GloVe excels in capturing both syntactic and semantic relationships among words, rendering it invaluable for various natural language processing tasks, including word similarity computation, text classification, and machine translation. Known for its simplicity, efficiency, and effectiveness, GloVe has garnered widespread adoption in both academic and industrial settings. Pre-trained GloVe word vectors are readily available for multiple languages, seamlessly integrated into machine learning models, thereby fostering advancements in language understanding and text analysis. ____________ ## Technical Details ### Data Preprocessing Use Python's Pandas library to process data in CSV format. The code is as follows: ```python def prepare_data(data): # rename id column and remove unnecessary columns data.rename(columns={"Unnamed: 0": "article_id"}, inplace=True) data.drop(columns=['date'], inplace=True) # extract only first few sentences of each article for quicker vector calculations data['content'] = data['content'].fillna('') data['content'] = data.content.swifter.apply( lambda x: ' '.join(re.split(r'(?<=[.:;])\s', x)[:4])) data['title_and_content'] = data['title'] + ' ' + data['content'] # create a vector embedding based on title and article columns encoded_articles = model.encode( data['title_and_content'], show_progress_bar=True) data['article_vector'] = pd.Series(encoded_articles.tolist()) return data def process_file(filename): data = pd.read_csv(filename) data = prepare_data(data) upload_items(data) return data ``` ### Build Vector Index I have established a vector index utilizing the Title and Content attributes for each article. The vectors are configured with a dimension of 300, and the metric is specified as "cosine." This index is constructed within Pinecone to streamline subsequent similarity searches across the entire content and titles of the article database. ```python def initialize_pinecone(): cursor = evadb.connect().cursor() warnings.filterwarnings("ignore") # Set api key api_key = '' os.environ["PINECONE_API_KEY"] = api_key openai.api_key = "" # Set environment environment = 'gcp-starter' os.environ["PINECONE_ENV"] = environment PINECONE_API_KEY = os.environ["PINECONE_API_KEY"] pinecone.init(api_key=api_key, environment=environment) def delete_existing_pinecone_index(): if PINECONE_INDEX_NAME in pinecone.list_indexes(): pinecone.delete_index(PINECONE_INDEX_NAME) def create_pinecone_index(): pinecone.create_index( dimension=300, name=PINECONE_INDEX_NAME, metric="cosine", shards=1) pinecone_index = pinecone.Index(index_name=PINECONE_INDEX_NAME) return pinecone_index def upload_items(data): upsert_batch = [] for i, row in data.iterrows(): upsert_batch.append((str(row.id), row.article_vector)) if len(upsert_batch) > 500: pinecone_index.upsert(upsert_batch) upsert_batch = [] # Process any remaining data in upsert_batch if upsert_batch: pinecone_index.upsert(upsert_batch) ``` ### Model The vector creation model employed is GloVe. ```python def create_model(): model = SentenceTransformer('average_word_embeddings_komninos') return model ``` ### Recommendation Logic Pinecone is designed to fetch articles that closely align with a specified object vector, determined by the similarity of the index. Given that a reader might have engaged with various articles, the reference points for the similarity search consist of multiple vectors. Given that the Similarity function in EvaDB lacks the capability to search for the similarity of multiple objects, I have chosen to utilize the query API directly offered by Pinecone. ```python def query_pinecone(reading_history_ids): reading_history_ids_list = list(map(int, reading_history_ids.split(','))) reading_history_articles = uploaded_data.loc[uploaded_data['id'].isin( reading_history_ids_list)] article_vectors = reading_history_articles['article_vector'] reading_history_vector = [*map(mean, zip(*article_vectors))] query_results = pinecone_index.query( vector=[reading_history_vector], top_k=10) res = query_results['matches'] results_list = [] for idx, item in enumerate(res): results_list.append({ "title": titles_mapped[int(item.id)], "publication": publications_mapped[int(item.id)], "score": item.score, }) return json.dumps(results_list) ``` ### Web Service Implemented in Flask ```python app = Flask(__name__) app.route("/") def index(): return render_template("index.html") @app.route("/api/search", methods=["POST", "GET"]) def search(): if request.method == "POST": return query_pinecone(request.form.history) if request.method == "GET": return query_pinecone(request.args.get("history", "")) return "Only GET and POST methods are allowed for this endpoint" if __name__ == '__main__': app.run() ``` ___________ ## Sample + Homepage ![2](https://github.com/egotist0/news_recommendation_service/tree/master/Photo/2.png) + Select any articles you like in initial recommentation ![2](https://github.com/egotist0/news_recommendation_service/tree/master/Photo/3.png) + Tap ==Submit== to get recommentation News and select new interested articles ![2](https://github.com/egotist0/news_recommendation_service/tree/master/Photo/4.png) ________ ## Usage + Create a new conda env or use python venv ```bash conda create --name your_env_name python=3.10 ``` + Activate your env ```bash conda activate your_env_name ``` + CD to your current working dir + Install the required python package. ```bash conda create --name your_env_name --file requirements.txt ``` + Download data from https://www.kaggle.com/datasets/snapcrack/all-the-news using wget. + Run app.py ```bash python app.py ``` Then you can see the website on http://127.0.0.1:5000/ ## Testing Approach The detection method in this section aligns with Project 1 — utilizing the EvaDB ChatGPT API interface. Input consists of the titles and content of the articles the reader has previously read, along with the titles and content of ten recommended articles. Subsequently, ChatGPT assesses them from a semantic perspective. ________ ## Reference + https://www.kaggle.com/datasets/snapcrack/all-the-news/data + https://docs.pinecone.io/docs/choosing-index-type-and-size + https://nlp.stanford.edu/projects/glove/ref=hackernoon.com + https://github.com/thawkin3/pinecone-demo + https://evadb.readthedocs.io/en/latest/source/reference/evaql/create_table.html

近期下载者

相关文件


收藏者