Download presentation
Presentation is loading. Please wait.
Published byMilton Harper Modified over 6 years ago
1
Data Driven Job Search Engine Using Skills and Company Attribute Filters
About me, Project as part of Internship during last summer at EverString. This is an attempt to make the job search for applicants much more personalized using many attributes that are not present in current job search engines(like number of emp, skills, technologies used, industry, revenue). Most of the current job search engines only allow users to search for jobs based on title, date posted, experience level, company and salary. However this is only tested on a very small dataset of 1.5 million job postings – the results can be much better if we use a larger job postings corpus. I cannot present here more results or more search queries, as this project was done as a part of my summer internship and I don’t have access to the code.
2
Data Extraction and Processing
Skills Data: Extracted a set of skills mentioned in professional social networks using DBpedia. Normalized, Lemmatized and Filtered the skills from 750k to 73k. Eg: “C#/.net”, ”C# / .net”, “C# & .net” and “C# and .net” are mapped to ”C# and .net. Eg: “systems installations”, “system installations”, “systems installation”, and “system installing” are mapped to “system installation”. We have generated a skill set that people generally use in professional social networks. Here we construct a lemma dictionary keeping track of the original skill and the normalized skill.
3
Data Extraction and Processing
Job Postings Data: Extracted job postings data from different companies web pages and Indeed API. Attributes extracted are: Company Name, URL, Job description, job title, company address. Title normalization and parsing. Company name normalizing so that it can be used to generate the domain name which is useful for data merging. Extract skills lemma dictionary with counts for each job description. Company name to website/ domain name. Join with Everstring’s company knowledge data base to populate company related firmographics. Titles classified into management levels (Clevel, Vp level, director, mamager, non manager/ regular)and departments after normalization. Comp norm: removing terms like ltd, pvt, stop words like technology management, service etc… C2D: Convert company names like “Amazon Drive", “Amazon Web Services", ``Amazon Prime", ``AmazonFresh", ``Amazon HVH",``Amazon Corporate LLC", ``Amazon Logistics",``Amazon Web Services, Inc", ``Amazon.com.dedc, LLC",``Amazon Fulfillment",``Amazon Fulfillment services", ``Amazon" to the company's website amazon.com By joining with the company DB we populate company related features like, employee size, revenue, technologies,funding, followers, recruiters contacts and micro industries that it belongs to.
4
LTU term weighting scheme(TF-IDF)
Compute TF-IDF for every skill in a job posting document using the following formula tf - term frequency, docLen - Document length, nDocs - Number of Documents df - Document frequency, avgDocLen - Average document length. Ranking the skills just based on TF-IDF doesn’t give good results as we are taking only the document level job descriptions into consideration. Compute LTU or TFIDF for each of the skill for a job posting. TF – Number of times a skill is listed in a job posting docLen – Total number of skills present in the job posting with repetitions. Ndocs – Total number of job postings in the corpus DF - Number of job postings which require this skill. avgDocLen – Average number of skills present across all the documents in the corpus. LTU uses a pivoted document length normalization scheme, which weights the words that appear same number of times in a shorter document more when compared to words that appear same number of times in a longer document. As the job descriptions contain a lot of company specific information and benefits - we need new weighting schema on top of it – as we observed a lot of skill words that are taken from company specific and benefits information.
5
Weighting on top of TF-IDF
Takes Job titles into consideration for generating the weights. Count matrix is generated with title Ngram as rows and skill as columns. Generate the count of number of occurrences of a skill for each tile Ngram. Generate the weight for each skill in a job posting using the following formula, gives different weight for skills under different titles. Prob(skill | titleNgram) - This term helps weighting the skills higher if the probability of finding it in similar title Ngrams is higher. Prob(skill) - This term penalizes the TF-IDF weight if a particular skill is found across many different titles – here the intuition is as we have different titles for a company and has same company description and benefits, so we penalize the terms that come from here. Final weight for a skill in a job posting is generated by averaging out all the weights generated for each of the title Ngram. Final Score for a skill in a job posting is generated by multiplying TF-IDF with the computed weight for a skill. Why use this weighting schema on top of TFIDF? – As much of the information in the job postings act as noise to our TF-IDF model, we use the titles of the job postings by getting into one more additional dimension for better ranking of the skills. Explain the count matrix construction – split the title – to get the ngrams. Eg: If we have ``Big Data Software Engineer" as a title, we look for all bigram sequences (``Big Data",``Data Software" and ``Software Engineer") and add to the unigram tile grams array. As we intuitively want the skill to be weighted higher if it occurs more frequently under a title ngram. Eg: skills like C, C++, Java, OOP will appear more frequently in job postings with Software in the title, so we weight these skills higher using this weighting schema. Explain the Numerator and denominator Explain how we compute the final scores for the skills. Tried SVD on TFIDF matrix to get the topic learned based on the dimension space where we project, and use PPMI weighting models positive pointwise mutual information - PPMI weighting models
6
Ranking the Filtered Results
The filtered search results are ranked using the following formula. Avg(weight(skill)) - Average of all the weights of the skills corresponding to a document / job posting. feedback - A factor computed using the information of number of user clicks. af - Alexa factor, computed for each company using its Alexa rank. ef - Employment factor, computed using the current number of employees in the company. nlf - Number of lemmas factor, computed using the number of company specific lemma keywords. csk – sqrt (micro industry keyword score for a company) Now that we have a table for each job posting and its company firmographics and the ranked skills, we can use this table to query our search and perform few analytics. For filtering the results we get the job postings which satisfy all the conditions mentioned by the user in his search query. Once we have the filtered job postings satisfying the query, we rank them using the above mentioned formula and show it to the user. Here while finding the Average of the weights of the skills, we only consider the skills that are mentioned by the user in the search query.
7
Results User Search Query:
A User with a bachelor's degree and has Python and Scala programming skills wants to search for the jobs in companies which uses jQuery technology, wants to work in the “engineering” vertical with companies whose revenue is great than 1 Million USD and the number of employees in the company to be between 50 and 200. The revenue number shown in the diagram are in the order of 1k. The top 40 company results here are in decreasing order of number of job postings for that company. These job postings are the ones which matched all the user given attributes in the query. There are total of 1064 job postings in our corpus that matched the search query.
8
Results Specific Job search results for the top 3 companies:
These are the results for the top 3 companies in the table that we saw earlier. You can see most of the job postings match with the search query by superficially looking at the job titles. As these results are ranked taking only the top 3 companies job postings we see a lot of similar companies together and another reason for this cluster is that there are more company attributes that decide the score or rank for a particular job posting. If we rank the results without taking the top 3 companies, I am sure the results will look better with job postings with much diverse company names in the top 40 results. Will explain few results that appear to be anomalies next.
9
Results Contacts information of the recruiters for the top 40 companies You can see here the results of the contact details of the relevant recruiters for a company. Here we are not displaying the mobile numbers and the addresses to protect the privacy. The job titles (Current employees) are used to classify into 5 management levels - C-Level, VP-Level, Director, manager, regular
10
Results Search query has Python and Scala. Why are we getting results like ”Full Stack java developer”? Job key: d93199bf4c06f3b Link: Why are we getting the search results in clusters? As most of the results ranking schema depends on the company specific attributes we tend to see the results in clusters corresponding to each company. The results can be better if we have more data on the job postings. As the filtered results are ranked based on many company attributes we tend to see these clusters while ranking the filtered results. Comparing with google and indeed If doesn’t work: Link:
11
Analytic -1
12
Analytic -2
13
Questions?
14
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.