Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 12
This is a sentence-transformers model finetuned from sentence-transformers/all-distilroberta-v1 on the ai-job-embedding-finetuning dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("tudorizer/distilroberta-ai-job-embeddings")
# Run inference
queries = [
"Here\u0027s a concise job search query:\n\nData Analyst (5+ years) - Power BI \u0026 Excel expertise required for remote construction data analysis role with competitive pay and benefits.\n\nThis query highlights the essential skills and requirements mentioned in the job description, while excluding generic terms like data science or software engineering.",
]
documents = [
'Qualifications)\n\n 5+ years of data analytic, data validation, data manipulation experience Six Sigma yellow or green belt certification Strong Power BI skills Strong Excel skills\n\nHow To Stand Out (Preferred Qualifications)\n\n Six Sigma Black Belt certification\n\n#DataAnalysis #RemoteWork #CareerGrowth #CompetitivePay #Benefits\n\nAt Talentify, we prioritize candidate privacy and champion equal-opportunity employment. Central to our mission is our partnership with companies that share this commitment. We aim to foster a fair, transparent, and secure hiring environment for all. If you encounter any employer not adhering to these principles, please bring it to our attention immediately. Talentify is not the EOR (Employer of Record) for this position. Our role in this specific opportunity is to connect outstanding candidates with a top-tier employer.\n\nTalentify helps candidates around the world to discover and stay focused on the jobs they want until they can complete a full application in the hiring company career page/ATS.',
'Skill set Required: Primary:Python, Scala, AWS servicesNoSQL storage databases such Cassandra and MongoDBApache Beam and Apache SparkAmazon Redshift, Google BigQuery, and Snowflake Secondary:Java, Go languageMicroservices frameworks such as Kubernetes and Terraform.',
"experienced data scientist who thrives on innovation and craves the vibrancy of a startup environment.\nResponsibilitiesProven experience in applying advanced data science algorithms such as neural networks, SVM, random forests, gradient boosting machines, or deep learning.Demonstrable expertise in at least three classes of advanced algorithms.Prior experience with live recommender systems and their implementation.Proficiency in deep learning frameworks, preferably TensorFlow.Proven track record in implementing scalable, distributed, and highly available systems on Cloud Platform (AWS, Azure, or GCP).Strong machine learning and AI skills.Strong communication skills, adaptability, and a thirst for innovation.High autonomy, ownership, and leadership mentality are crucial as you will be a pivotal member shaping our organization's future.Strong skills in data processing with R, SQL, Python, and PySpark.\nNice to haveSolid understanding of the computational complexity involved in model training and inference, especially in the context of real-time and near real-time applications.Familiarity with the management and analysis of large-scale assets.A team player with a collaborative mindset who is eager to learn and apply new methods and tools.A sense of pride and ownership in your work, along with the ability to represent your team confidently to other departments.",
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 768] [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.4716, -0.1088, -0.0557]])
ai-job-validation and ai-job-testTripletEvaluator| Metric | ai-job-validation | ai-job-test |
|---|---|---|
| cosine_accuracy | 0.9902 | 1.0 |
query, job_description_pos, and job_description_neg| query | job_description_pos | job_description_neg | |
|---|---|---|---|
| type | string | string | string |
| details |
|
|
|
| query | job_description_pos | job_description_neg |
|---|---|---|
Here's a concise job search query with 3 specialized skills or areas of expertise that are distinct to the role: |
Requirements |
skills and supercharge careers. We help discover passion—the driving force that makes one smile and innovate, create, and make a difference every day. The Hexaware Advantage: Your Workplace BenefitsExcellent Health benefits with low-cost employee premium.Wide range of voluntary benefits such as Legal, Identity theft and Critical Care CoverageUnlimited training and upskilling opportunities through Udemy and Hexavarsity |
Here's a concise job search query with 3 specialized skills: |
skills, including prioritizing, problem-solving, and interpersonal relationship building.Strong experience in SDLC delivery, including waterfall, hybrid, and Agile methodologies.Experience delivering in an agile environment.Skills:Proficient in SQLTableau |
requirements, ultimately driving significant value and fostering data-informed decision-making across the enterprise. |
Here's a concise job search query: |
experience as a lead full stack Java developer with strong JSP and servlets and UI development along with some backend technologies experience Another primary skill is Team handling and responsible for Junior developer’s code reviews and onsite/offshore coordination experience is a must. |
skills and data science knowledge to create real-world impact. You’ll work closely with your clients to understand their questions and needs, and then dig into their data-rich environments to find the pieces of their information puzzle. You’ll develop algorithms and systems and use the right combination of tools and frameworks to turn sets of disparate data points into objective answers to help clients make informed decisions. Ultimately, you’ll provide a deep understanding of the data, what it all means, and how it can be used. |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
query, job_description_pos, and job_description_neg| query | job_description_pos | job_description_neg | |
|---|---|---|---|
| type | string | string | string |
| details |
|
|
|
| query | job_description_pos | job_description_neg |
|---|---|---|
Here's a concise job search query with 3 specialized skills: |
requirements and deliver innovative solutionsPerform data cleaning, preprocessing, and feature engineering to improve model performanceOptimize and fine-tune machine learning models for scalability and efficiencyEvaluate and improve existing ML algorithms, frameworks, and toolkitsStay up-to-date with the latest trends and advancements in the field of machine learning |
skills.50% of the time candidate will need to manage and guide a team of developers and the other 50% of the time will be completing the technical work (hands on). Must have previous experience with this (i.e., technical lead)Code review person. Each spring. Coders will do developing then candidate will be reviewing code and auditing the code to ensure its meeting the standard (final eye)Migrating to a data warehouse. |
Here's a concise job search query with 3 specialized skills or areas of expertise that are distinct to the role: |
requirements and building relationships.Drive risk-based data and integration decisions to minimize ERP implementation risks.Lead data extraction, transformation, and loading from legacy sources into Dynamics 365.Design, develop, and troubleshoot integrations with Dynamics 365 and other systems.Develop and maintain documentation for data processes and integration architecture.Enhance the enterprise data strategy in collaboration with leadership.Build and deploy scalable data pipelines and APIs to support evolving data needs.Drive data integrations for future acquisitions and ensure data integrity and governance.Collaborate with stakeholders to design and implement data models, dashboards, and reports. |
Qualifications: |
Here's a concise job search query: |
1-year contract | Strong scripting/programming skills in Python, time series analysis experience (OSI PI, PI AF), and data visualization. This query highlights the key requirements mentioned in the job description, excluding generic data science or software engineering skills. |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
eval_strategy: stepsper_device_train_batch_size: 16per_device_eval_batch_size: 16learning_rate: 2e-05num_train_epochs: 1warmup_ratio: 0.1batch_sampler: no_duplicatesoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 2e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 1max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | ai-job-validation_cosine_accuracy | ai-job-test_cosine_accuracy |
|---|---|---|---|
| -1 | -1 | 0.9902 | 1.0 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
sentence-transformers/all-distilroberta-v1